{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "iKENeWEEwKIb" }, "source": [ "Explicaciones para NLP utilizando SHAP\n", "======================================" ] }, { "cell_type": "markdown", "metadata": { "id": "JctlpI10jJ15" }, "source": [ "Introducción\n", "------------" ] }, { "cell_type": "markdown", "metadata": { "id": "sT_t9OxYwKIc" }, "source": [ "La idea detrás de los valores de Shapley (Shapley, Lloyd S 1953) es la siguiente: dado un conjunto de predictores, encontrar la contribución marginal de cada predictor con respecto a la predicción general. ¿cuál es la predicción general? Es el valor esperado del modelo (EV). Piense en ello como la línea de base del modelo. Entonces, la contribución marginal indica en que medida cada predictor obliga a la predicción a alejarse de esa línea de base.\n", "\n", "Para una introducción más detallada puede ver la entrada del blog: [Model interpretability — Making your model confesses: SHAP](https://santiagof.medium.com/model-interpretability-making-your-model-confess-shapley-values-5fb95a10a624)" ] }, { "cell_type": "markdown", "metadata": { "id": "t67TU7HLY1Hg" }, "source": [ "¿Como funciona?\n", "---------------\n", "\n", "La forma en que los valores de Sharpley calculan la contribución marginal es calculando el valor predicho con y sin el valor de la característica que se está considerando actualmente y tomando la diferencia para obtener la contribución marginal. Finalmente, el valor de Sharpley se calcula promediando la contribución marginal del valor de la característica en todos los subconjuntos de características posibles (llamados coaliciones) dentro del conjunto de características en el que participa la característica.\n", "\n", "Este algorimo se encuentra implementado en la libraria Shap. Para mas información sobre esta librería visite: https://shap.readthedocs.io/en/latest/index.html" ] }, { "cell_type": "markdown", "metadata": { "id": "Dcyc_TQ6dis7" }, "source": [ "### Para ejecutar este notebook" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Ntcs1AlpfckX" }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": { "id": "nJFcNrbmjJ17" }, "source": [ "Para ejecutar este notebook, instale las siguientes librerias:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "sxwUvomeZ9hd", "outputId": "6d09d272-c672-4b20-b916-d862c08325e6", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\u001b[K |████████████████████████████████| 4.9 MB 5.2 MB/s \n", "\u001b[K |████████████████████████████████| 163 kB 36.1 MB/s \n", "\u001b[K |████████████████████████████████| 6.6 MB 36.2 MB/s \n", "\u001b[K |████████████████████████████████| 569 kB 5.0 MB/s \n", "\u001b[?25h" ] } ], "source": [ "!pip install transformers --quiet\n", "!pip install shap --quiet" ] }, { "cell_type": "markdown", "metadata": { "id": "20Sih7EhY1H0" }, "source": [ "Cargamos el conjunto de datos con el que se entrenó el modelo en caso de necesitarlo" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "GgK8b6e_jJ17" }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \\\n", " --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "I8vqJD9JwKIv" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "HcO4lC5qY1H3" }, "source": [ "Cargando un modelo de NLP\n", "-------------------------" ] }, { "cell_type": "markdown", "metadata": { "id": "juB0uVfFY1H_" }, "source": [ "Recordemos que nuestro modelo predice los sectores a los que pertenecería el tweet, siendo ellos:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "RMBEop__zFAI" }, "outputs": [], "source": [ "target_names = {0:'ALIMENTACION', 1:'AUTOMOCION', 2:'BANCA', 3:'BEBIDAS', 4:'DEPORTES', 5:'RETAIL', 6:'TELCO'}" ] }, { "cell_type": "markdown", "metadata": { "id": "_gBXNzwYwKIu" }, "source": [ "Cargaremos el modelo que fue descargado anteriornmente utilizando la librería de `transformers`. Note que cargamos tanto el `tokenizer` como el modelo propiamente dicho.\n", "\n", "> Nota: Utilizaremos un modelo entrenado para resolver el problema de tweets que venimos viendo en este curso. Este modelo fue publicado en el repositorio de HuggingFace bajo el nombre `fce-m72109/mascorpus-bert-classifier`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "FuUI_dXFxh6u", "outputId": "0356f795-7f4a-4cad-f517-adaa2acc444b", "colab": { "base_uri": "https://localhost:8080/", "height": 209, "referenced_widgets": [ "ae1c06235e0c470ca79b119203c4f8d2", "a9a7ec8111154c7abd2cbe2c508f83b3", "814f9d250f1b4f6b893ba77c291f7cf0", "359dceeedddf4e17990cc29f5e0d546f", "84c4da36ed8e425da6c7e5bf462cf07e", "d5aba6853d3f42a7a839add870d44107", "5afdb29779354008a92eb3a6c65e2160", "893a57d7352e442180212f49fb9318e5", "3f14e0f0ba774ee0b0bff1ce765f864e", "0f8c97897e214b29bb8507ff5ad39c94", "07505bbe78344114952205c4526ee45c", "abaf804f8e56409aa8b143471eb45e01", "341277d783e44b7cb954aef81cea7bad", "448071b6790a4c1c8f6bf2f5a024678c", "103fe458d3344b7f98228cad380c59b6", "0252ce749e8c4ad491af6ac5ff8443b8", "456a362fab584c2694b00782a091a8cd", "d89fb9ca1599442abe168ce8709fd4a1", "c9dbc70e7e434a4baaf48dc06a5f7d31", "ea7fcdeb14874b9bbfc3daed5271c497", "8abc075437064f6083e6f3e44fdc88ac", "67eec932a25d44e9b100e9fcee9d7825", "ad99dd682b5d425a8b9b47fbec56e668", "cb011e641c2645cbaf810a9bacf7f277", "1f8c1ffc3dbb428b95bcac87931be51a", "ede64cb39d4746aeba9c518573c7a15e", "e228b30b08ff4d69995395f7e7bf139a", "8076715349c74c33b8c9e27cec36baa3", "cba6440e6f524b1293c21e4bf99ee6bb", "f0b4565034db4634933759a0bdc87bbb", "175a8af62c254d038b88eec11c8f16a8", "7c0c1594667b4145a33021debe6fa5a3", "0c166081c06e44bc846344a662af32ec", "7808960651ad4780899d6c2e9ac04daf", "559a57c2a6484af18cdcfcf2c49cae6f", "02edb5b17b9344449f0d2d1e9b9a36a3", "408ffc5a8af2498f9dfc2315d0540068", "1f262e992ca645fea183eae9d8ade242", "c1d69234e79d4fb586a7b715c2c21dcc", "6fdc45279e284fceb58f204fa901184d", "5900bacbebf746baa2a72d7b1430c40b", "42d4820ce27e42b6843d7335eae4f21d", "704b495416e54946a4729f65e07b9b58", "a008d2799f654d1c80d6535b94c84852", "3272fed36db343269034914a7038a4a0", "41615d9872284d71a74087c813ea1432", "bb30917b354c49c4957ebde20f107e66", "00b45b2d41fb494280fbe305fc8dea3e", "ab8372ec58874a599f4fdfbdc36044be", "88605c61cc15494ba809f04fadedb8f7", "132fd6166999454db0353dadf29d71ca", "9c8ee61e010844bdb9ce926f284dbb0f", "cf6de04721204ffeaf616359f3736f8a", "d4c251ac1c7f44b283aa2208e811f46b", "74275ec79f4c4872aa8022047435dfa9", "79ebdbe644c64d3fa6928edab03c2d87", "f74cb893e2b5449daa26e91dd7aaec48", "28311ae1ebca4a37b1a58f8865cb6891", "81c52722eefa4b84827b33ad6358ec8b", "0b2528f56c384a069b1be3169fdb23cd", "fae326f3529245dc91433f7b5e62790d", "36fc8bd3435049d489978c97ef4ffeb8", "d17ee3df6e0f416b92875df073b33631", "226718bf0520416a9559e6bd0aad762d", "b05915f72c6a4eca82c3a1f7285160a0", "cdf789fbe0794504a01fdd9df3611bd9" ] } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "Downloading: 0%| | 0.00/567 [00:00 **Tip:** Agregar los parameteros `id2label` y `label2id` permite que el modelo pueda conocer los verdaderos valores de las clases que predice. " ], "metadata": { "id": "20rdnMFMAY-4" } }, { "cell_type": "markdown", "source": [ "Construimos un pipeline con nuestro modelo y tokenizer" ], "metadata": { "id": "nGGoEQ5r--mc" } }, { "cell_type": "code", "source": [ "pred = transformers.pipeline(\"text-classification\", model=model, tokenizer=tokenizer, device=-1, top_k=1)" ], "metadata": { "id": "pfayUNmb-fZg" }, "execution_count": 7, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "dArhsrI6Y1H5" }, "source": [ "Generando explicaciones con SHAP\n", "--------------------------------" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "snbeQywH8epf" }, "outputs": [], "source": [ "sample = [\"Nos estafaron en carrefour. No vuelvo a comprar alli jamas\"]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "lBaIIYyw8epf", "outputId": "2659b9f4-e884-4f1f-c5f4-a14bd805fa60", "colab": { "base_uri": "https://localhost:8080/", "height": 66, "referenced_widgets": [ "863a96cb5020438691b5d651cef216d8", "b706d07653284c68850177b0fab12073", "56ab17c6c51141f8bc432e493ff9f872", "e17e6ea83f16489f886794c303d0112d", "895364657bee48c8ae7ee9e77982fd34", "037b05e0c848427696817c8f5c9c979b", "ec92656cdf534f4aacc73b9c76af8369", "ac88d1b662e844f3b187de9d441670f1", "bfaddd85af98447b88ac84f3b5ff006d", "71423e6b0da54ada9c28b5242298b85a", "5445fdda0b30480787349daddd93c9fb" ] } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " 0%| | 0/272 [00:00" ], "text/html": [ "\n", "
\n", "
\n", "
[0]
\n", "
\n", "
\n", "\n", "
outputs
\n", "
ALIMENTACION
\n", "
AUTOMOCION
\n", "
BANCA
\n", "
BEBIDAS
\n", "
DEPORTES
\n", "
RETAIL
\n", "
TELCO


0.40.200.60.800base value00fALIMENTACION(inputs)0.034 . 0.028 fo 0.023 ron 0.015 vuelvo 0.008 estafa 0.001 ur -0.063 carre -0.031 en -0.015 No
inputs
0.0
0.0
Nos
0.008
estafa
0.023
ron
-0.031
en
-0.063
carre
0.028
fo
0.001
ur
0.034
.
-0.015
No
0.015
vuelvo
0.0
a
0.0
comprar
0.0
alli
0.0
jama
0.0
s
0.0