{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "Modelado clásico de lenguaje natural\n",
        "==============================="
      ],
      "metadata": {
        "id": "nhSQKbkFSObx"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Creando un pipeline de preprocesamiento de texto\n",
        "\n",
        "A pesar de que los métodos anteriores son no supervisados, son de utilidad para el modelado de de problemas no supervisados como supervisados. Para llevar estos métodos a un entorno práctico normalmente se construyen flujos de procesamiento como el que se muestra más abajo. Estos flujos se los llama Pipeline:"
      ],
      "metadata": {
        "id": "jwFoS39i_J_r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "<img src='https://github.com/santiagxf/M72109/blob/master/docs/nlp/_images/classic_pipeline.png?raw=1' />"
      ],
      "metadata": {
        "id": "cLSZG2Gy_J_r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "A modo de ejemplo, vamos a utilizar la API de Scikit-Learn para generar cada uno de estos pasos y así construir un modelo que resuelva un problema de negocio de punta a punta."
      ],
      "metadata": {
        "id": "sdw5EvPI_J_s"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**¿Que es lo que vamos a hacer?**\n",
        "Intentaremos construir un pipeline de machine learning donde como entrada recibamos texto, ejecutemos todos los pasos que vimos en este notebook incluyendo:\n",
        "\n",
        " - Eliminación de stopwords\n",
        " - Tokenización\n",
        " - Stemming y Lemmatization\n",
        " - Procesamiento especico del tema\n",
        " - Creación de features utilizando algun metodo de reducción de dimensionalidad, SVD, LSI, LDA\n",
        "\n",
        ", para luego utilizar estas features para entrenar un modelo que nos permita predecir alguna propiedad interesante del set de datos. En este caso en particular, donde estamos viendo tweets, algunos casos interesantes podrían ser:\n",
        " - Predecir el sector al que pertenece el tweet: Alimentación, Bebidas, etc.\n",
        " - Predecir el paso en el Marketing Funel al que pertece"
      ],
      "metadata": {
        "id": "ONzBUk9yhH9k"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Para ejecutar este notebook"
      ],
      "metadata": {
        "id": "EbV1KaAVSOb9"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Para ejecutar este notebook, instale las siguientes librerias:"
      ],
      "metadata": {
        "id": "6UH3VzqXSOb-"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 42,
      "source": [
        "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \\\n",
        "    --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/\n",
        "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/normalization.py \\\n",
        "    --quiet --no-clobber --directory-prefix ./m72109/nlp/\n",
        "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/transformation.py \\\n",
        "    --quiet --no-clobber --directory-prefix ./m72109/nlp/\n",
        "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/classic/classic-modeling.txt \\\n",
        "    --quiet --no-clobber\n",
        "!pip install -r classic-modeling.txt --quiet"
      ],
      "outputs": [],
      "metadata": {
        "id": "fb3InosASOb_"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "source": [
        "!python -m spacy download es_core_news_sm 1> /dev/null"
      ],
      "outputs": [],
      "metadata": {
        "id": "DkZn-F6ISOcC"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Primero importaremos algunas librerias necesarias"
      ],
      "metadata": {
        "id": "WPpqVNrwSdhL"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "from tqdm import tqdm"
      ],
      "outputs": [],
      "metadata": {
        "id": "zPfF_O0U_J9a"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Sobre el set de datos con el que vamos a trabajar"
      ],
      "metadata": {
        "id": "lE_O7bEjLebd"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Utilizaremos como ejemplo un set de datos en español que contiene tweets que diferentes usuarios han publicado en relación a diferentes marcas de productos u empresas en el rubro de alimentación, construcción, automoviles, etc. Estos tweets, a su vez, están asociados a una de las diferentes fases en el proceso de ventas (también conocido como Marketing Funel) y por eso están tagueados con las fases de:\n",
        " - Awareness – el cliente es conciente de la existencia de un producto o servicio\n",
        " - Interest – activamente expresa el interes de un producto o servicio\n",
        " - Evaluation – aspira una marca o producto en particular\n",
        " - Purchase – toma el siguiente paso necesario para comprar el producto o servicio\n",
        " - Postpurchase - realización del proceso de compra. El cliente compara la diferencia entre lo que deseaba y lo que obtuvo\n",
        "\n",
        "Referencia: [Spanish Corpus of Tweets for Marketing](http://ceur-ws.org/Vol-2111/paper1.pdf)\n",
        "\n",
        "> Nota: La version de este conjunto de datos que utilizaremos aqui es una versión preprocesada del original."
      ],
      "metadata": {
        "id": "H8lcRTa_Li4e"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "source": [
        "tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')"
      ],
      "outputs": [],
      "metadata": {
        "id": "Gc44Q7do_J9h"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Inspeccionamos el set de datos"
      ],
      "metadata": {
        "id": "INJwReUXSs4K"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "source": [
        "tweets.head(5)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                               TEXTO  SECTOR      MARCA  \\\n",
              "0  #tablondeanuncios Funda nordica ikea #madrid h...  RETAIL       IKEA   \n",
              "1  #tr Me ofrezco para montar muebles de Ikea - H...  RETAIL       IKEA   \n",
              "2  #VozPópuli Vozpópuli @voz_populi - #LoMásLeido...  RETAIL    ALCAMPO   \n",
              "3  #ZonaTecno Destacado: Todo lo que hay que sabe...  RETAIL  CARREFOUR   \n",
              "4  $Carrefour retira pez #Panga. OCU y grupos x #...  RETAIL  CARREFOUR   \n",
              "\n",
              "       CANAL  AWARENESS  EVALUATION  PURCHASE  POSTPURCHASE  NC2  \n",
              "0  Microblog          0           0       0.0             0  1.0  \n",
              "1  Microblog          0           0       0.0             0  1.0  \n",
              "2  Microblog          0           0       0.0             0  1.0  \n",
              "3  Microblog          0           0       0.0             0  1.0  \n",
              "4  Microblog          0           0       0.0             0  1.0  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-377abc21-aba3-4682-a88c-82c493d2f057\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>TEXTO</th>\n",
              "      <th>SECTOR</th>\n",
              "      <th>MARCA</th>\n",
              "      <th>CANAL</th>\n",
              "      <th>AWARENESS</th>\n",
              "      <th>EVALUATION</th>\n",
              "      <th>PURCHASE</th>\n",
              "      <th>POSTPURCHASE</th>\n",
              "      <th>NC2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>#tablondeanuncios Funda nordica ikea #madrid h...</td>\n",
              "      <td>RETAIL</td>\n",
              "      <td>IKEA</td>\n",
              "      <td>Microblog</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>#tr Me ofrezco para montar muebles de Ikea - H...</td>\n",
              "      <td>RETAIL</td>\n",
              "      <td>IKEA</td>\n",
              "      <td>Microblog</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>#VozPópuli Vozpópuli @voz_populi - #LoMásLeido...</td>\n",
              "      <td>RETAIL</td>\n",
              "      <td>ALCAMPO</td>\n",
              "      <td>Microblog</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>#ZonaTecno Destacado: Todo lo que hay que sabe...</td>\n",
              "      <td>RETAIL</td>\n",
              "      <td>CARREFOUR</td>\n",
              "      <td>Microblog</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>$Carrefour retira pez #Panga. OCU y grupos x #...</td>\n",
              "      <td>RETAIL</td>\n",
              "      <td>CARREFOUR</td>\n",
              "      <td>Microblog</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0</td>\n",
              "      <td>1.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-377abc21-aba3-4682-a88c-82c493d2f057')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-377abc21-aba3-4682-a88c-82c493d2f057 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-377abc21-aba3-4682-a88c-82c493d2f057');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-a1b665e3-654d-4deb-b23f-d38dc04029ab\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-a1b665e3-654d-4deb-b23f-d38dc04029ab')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-a1b665e3-654d-4deb-b23f-d38dc04029ab button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "tweets",
              "summary": "{\n  \"name\": \"tweets\",\n  \"rows\": 3763,\n  \"fields\": [\n    {\n      \"column\": \"TEXTO\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3678,\n        \"samples\": [\n          \"El BBVA deber\\u00eda hacer nuevos comerciales con Claudio Bravo.\\nAprovechando que ahora es el rey de la banca.\\n@alebattocchio\",\n          \"yo quiero el nuevo citroen c3!! https://t.co/5gKTjThrJk\",\n          \"Acabo de correr 5,78 km a un ritmo de 6'36'' con Nike+ https://t.co/Ui6paCTjtC #nikeplus\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"SECTOR\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 7,\n        \"samples\": [\n          \"RETAIL\",\n          \"TELCO\",\n          \"BEBIDAS\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"MARCA\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 38,\n        \"samples\": [\n          \"ESTRELLA GALICIA\",\n          \"NIKE\",\n          \"LEROY MERLIN\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"CANAL\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"Microblog\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"AWARENESS\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0,\n        \"min\": 0,\n        \"max\": 1,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          1\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"EVALUATION\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0,\n        \"min\": 0,\n        \"max\": 1,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          1\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"PURCHASE\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.15200028602567756,\n        \"min\": 0.0,\n        \"max\": 1.0,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          1.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"POSTPURCHASE\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0,\n        \"min\": 0,\n        \"max\": 1,\n        \"num_unique_values\": 2,\n        \"samples\": [\n          1\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"NC2\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.5047254454850222,\n        \"min\": 0.0,\n        \"max\": 11.0,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          1.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 5
        }
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 452
        },
        "id": "Gd6EocPdG5A0",
        "outputId": "86c71adf-baf3-4911-cb00-949f6f4c415a"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "source": [
        "tweets.groupby('SECTOR').head(1)[['TEXTO', 'SECTOR']]"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                                  TEXTO        SECTOR\n",
              "0     #tablondeanuncios Funda nordica ikea #madrid h...        RETAIL\n",
              "725   \"Ilcinsisti lis MB dispiniblis\" te odeeeeeo Mo...         TELCO\n",
              "964   #CarlosSlim y Bimbo lanzarán un vehículo eléct...  ALIMENTACION\n",
              "1298  ‼🏎Toyota #Day, 4ruedas ,1/4 milla, 1 #pasión, ...    AUTOMOCION\n",
              "1748  \"- Tú qué.\\n- Yo na.\"\\nConversaciones banco sa...         BANCA\n",
              "2348  - Cariño, te juro que sólo tenían Cruzcampo en...       BEBIDAS\n",
              "3023  #adidas #hockey Amenabar 2080 CABA https://t.c...      DEPORTES"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-248047a3-be8b-4cd3-8a2b-c54aca830b90\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>TEXTO</th>\n",
              "      <th>SECTOR</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>#tablondeanuncios Funda nordica ikea #madrid h...</td>\n",
              "      <td>RETAIL</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>725</th>\n",
              "      <td>\"Ilcinsisti lis MB dispiniblis\" te odeeeeeo Mo...</td>\n",
              "      <td>TELCO</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>964</th>\n",
              "      <td>#CarlosSlim y Bimbo lanzarán un vehículo eléct...</td>\n",
              "      <td>ALIMENTACION</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1298</th>\n",
              "      <td>‼🏎Toyota #Day, 4ruedas ,1/4 milla, 1 #pasión, ...</td>\n",
              "      <td>AUTOMOCION</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1748</th>\n",
              "      <td>\"- Tú qué.\\n- Yo na.\"\\nConversaciones banco sa...</td>\n",
              "      <td>BANCA</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2348</th>\n",
              "      <td>- Cariño, te juro que sólo tenían Cruzcampo en...</td>\n",
              "      <td>BEBIDAS</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3023</th>\n",
              "      <td>#adidas #hockey Amenabar 2080 CABA https://t.c...</td>\n",
              "      <td>DEPORTES</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-248047a3-be8b-4cd3-8a2b-c54aca830b90')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-248047a3-be8b-4cd3-8a2b-c54aca830b90 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-248047a3-be8b-4cd3-8a2b-c54aca830b90');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-293c73f7-62e6-4e61-a24a-3922a98ee155\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-293c73f7-62e6-4e61-a24a-3922a98ee155')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-293c73f7-62e6-4e61-a24a-3922a98ee155 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "summary": "{\n  \"name\": \"tweets\",\n  \"rows\": 7,\n  \"fields\": [\n    {\n      \"column\": \"TEXTO\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 7,\n        \"samples\": [\n          \"#tablondeanuncios Funda nordica ikea #madrid https://t.co/9TvaZSa1De https://t.co/J3bC6t6C9u\",\n          \"\\\"Ilcinsisti lis MB dispiniblis\\\" te odeeeeeo Movistar dir\\u00eda la golda peluda jajaja\",\n          \"- Cari\\u00f1o, te juro que s\\u00f3lo ten\\u00edan Cruzcampo en la tienda... https://t.co/m064KIDlpM\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"SECTOR\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 7,\n        \"samples\": [\n          \"RETAIL\",\n          \"TELCO\",\n          \"BEBIDAS\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 6
        }
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 269
        },
        "id": "uXwJS2Og_J9l",
        "outputId": "199bb5a9-be6f-4906-c59c-52dd2cc3d34c"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Creando un pipeline"
      ],
      "metadata": {
        "id": "FFUHr1BjSOcI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Creando un paso de Pipeline para procesamiento de texto\n",
        "\n",
        "El paso más complejo que tenemos para crear es quizas el preprocesamiento del texto. Esto lo podemos encapsular en un modulo de Scikit-Learn. Esta libreria tiene 2 tipos de modulos:\n",
        "\n",
        " - Transformers\n",
        " - Estimators\n",
        "\n",
        "Los transformers toman un set de features y devuelven otro set de features, por eso es que reciben el nombre de \"trasformers\", porque basicamente transforman vectores. Los estimators, por el contrario, reciben un set de features y producen un podelo que aproxima, o estima, una variable target. Por este motivo, estos modulos reciben el nombre de \"estimators\".\n",
        "\n",
        "Para simplicidad, en este curso disponemos de un `TweetTextNormalizer` ya implementado que basicamente utiliza el mismo código que vimos en la lección de procesamiento de texto.\n",
        "\n",
        "> Tip: Recomendamos que revise todos los parametros que recibe esta clase."
      ],
      "metadata": {
        "id": "R-TetBhbiGMz"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Instanciamos nuestro preprocesamiento de texto"
      ],
      "metadata": {
        "id": "UeSpeGyC_J_0"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 47,
      "source": [
        "from m72109.nlp.normalization import TweetTextNormalizer\n",
        "\n",
        "normalizer = TweetTextNormalizer(\n",
        "    language='spanish',\n",
        "    lemmatize=True,\n",
        "    stem=False,\n",
        "    reduce_len=True,\n",
        "    strip_handles=True,\n",
        "    strip_stopwords=True,\n",
        "    strip_urls=True,\n",
        "    strip_accents=True,\n",
        "    token_min_len=4,\n",
        "    preserve_case=False\n",
        "  )"
      ],
      "outputs": [],
      "metadata": {
        "id": "2WWuoi17_J_t"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Podemos ver como funciona nuestro modulo de preprocesamiento de texto al llamarlo con la función transform:"
      ],
      "metadata": {
        "id": "W8YY0A5wkPa0"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 48,
      "source": [
        "tweet = tweets['TEXTO'][5]\n",
        "print(tweet)"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            ". @PoliciadeBurgos @PCivilBurgos @Aytoburgos Mismo peligro c/ Rio Viejo junto Mercadona Villimar\n"
          ]
        }
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "N89kgK3tkYUv",
        "outputId": "8074eac0-5933-4a0c-e2bf-cbf160b698f7"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 49,
      "source": [
        "normalizer.transform([tweet])"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1/1 [00:00<00:00, 13.51it/s]\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array(['mismo peligro viejo junto mercadona villimar '], dtype=object)"
            ]
          },
          "metadata": {},
          "execution_count": 49
        }
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "UIZm5SWMkgvI",
        "outputId": "07b24b8b-80d5-4902-a39a-eb1af693ba48"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Creando pasos de pipeline para la vectorizacion y ingeniería de features"
      ],
      "metadata": {
        "id": "VDuoM0BRmESe"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Importamos algunas librerias que necesitaremos"
      ],
      "metadata": {
        "id": "WwhSKt53_J_w"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 50,
      "source": [
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "from sklearn.decomposition import LatentDirichletAllocation\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.pipeline import Pipeline\n",
        "from sklearn.metrics import classification_report"
      ],
      "outputs": [],
      "metadata": {
        "id": "2bx5kQr6_J_w"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Instanciamos nuestro vectorizador, en este caso usando el método TF-IDF"
      ],
      "metadata": {
        "id": "eU4ykD8z_J_3"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 51,
      "source": [
        "vectorizer = TfidfVectorizer(use_idf=True, sublinear_tf=True, norm='l2')"
      ],
      "outputs": [],
      "metadata": {
        "id": "dVIdgSc9_J_4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Instanciamos nuestro generador de features, que en este caso son los tópicos que LDA genere"
      ],
      "metadata": {
        "id": "fSnMG5la_J_5"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 52,
      "source": [
        "featurizer = LatentDirichletAllocation(n_components=7)"
      ],
      "outputs": [],
      "metadata": {
        "id": "7o18xNsj_J_6"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Creando un paso de pipeline para clasificar"
      ],
      "metadata": {
        "id": "-G_KAGlvSOcM"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Instanciamos nuestro clasificador que utilizará las features generadas hasta este momento"
      ],
      "metadata": {
        "id": "FumqlRDO_J__"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 53,
      "source": [
        "estimator = LogisticRegression(max_iter=10000, multi_class='multinomial')"
      ],
      "outputs": [],
      "metadata": {
        "id": "2mpzmpo__KAA"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Ensamblando el pipeline"
      ],
      "metadata": {
        "id": "hYDLsEFgSOcN"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Creamos un pipeline que ejecute todos los pasos en secuencia"
      ],
      "metadata": {
        "id": "3xkn_cWu_KAE"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 54,
      "source": [
        "pipeline = Pipeline(steps=[('normalizer', normalizer),\n",
        "                           ('vectorizer', vectorizer),\n",
        "                           ('featurizer', featurizer),\n",
        "                           ('estimator', estimator)])"
      ],
      "outputs": [],
      "metadata": {
        "id": "uAccQHAG_KAE"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "En este caso intentaremos predecir el sector al que pertenece un tweet en particular. Para ello, como en todo proceso de machine learning separaremos nuestros datos en training y testing, para poder evaluar los resultados:"
      ],
      "metadata": {
        "id": "7wHgLnnYmZH0"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 55,
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "X_train, X_test, y_train, y_test = train_test_split(tweets['TEXTO'], tweets['SECTOR'],\n",
        "                                                    test_size=0.33, stratify=tweets['SECTOR'])"
      ],
      "outputs": [],
      "metadata": {
        "id": "6yAc9s1WmjlG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "El método fit intrenará nuestro modelo de punta a punta. Tomará unos minutos"
      ],
      "metadata": {
        "id": "UtWhdGYQn4Ey"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 56,
      "source": [
        "model = pipeline.fit(X=X_train, y=y_train)"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 2521/2521 [01:47<00:00, 23.56it/s]\n"
          ]
        }
      ],
      "metadata": {
        "id": "Mp-fZa4E_KAG",
        "outputId": "e1311e43-6cdc-4202-e940-e2a9698c7045",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Es hora de ver que tan bien le fué a nuestro modelo en esta tarea"
      ],
      "metadata": {
        "id": "hOLZErx-oGIB"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 57,
      "source": [
        "predictions = model.predict(X_test)"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1242/1242 [00:53<00:00, 23.02it/s]\n"
          ]
        }
      ],
      "metadata": {
        "id": "BiQLKHrr_KAL",
        "outputId": "94f95221-2a38-4798-c34a-552c34b83d40",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": 60,
      "source": [
        "print(classification_report(y_test, predictions, zero_division=0))"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "              precision    recall  f1-score   support\n",
            "\n",
            "ALIMENTACION       0.00      0.00      0.00       110\n",
            "  AUTOMOCION       0.00      0.00      0.00       148\n",
            "       BANCA       0.50      0.01      0.02       198\n",
            "     BEBIDAS       0.34      0.39      0.36       223\n",
            "    DEPORTES       0.28      0.32      0.30       216\n",
            "      RETAIL       0.25      0.69      0.37       268\n",
            "       TELCO       0.00      0.00      0.00        79\n",
            "\n",
            "    accuracy                           0.28      1242\n",
            "   macro avg       0.20      0.20      0.15      1242\n",
            "weighted avg       0.24      0.28      0.20      1242\n",
            "\n"
          ]
        }
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hbDgBh5M_KAP",
        "outputId": "20900553-b2c8-43a1-c2e3-7ef344094bf3"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "¿Les parece que estás métricas son buenas? ¿Se les ocurre como mejorarlo? Algunas ideas:\n",
        "\n",
        " - ¿Quien funcionará mejor? ¿Stemmer o Lemmatization?\n",
        " - ¿Qué será mejor hacer con los hashtags? ¿Quitarlos?\n",
        " - ¿Que cantidad de factores latentes funcionará mejor? ¿7, 10, 200, 300?\n",
        " - ¿Es Logistic Regression el mejor clasificador que podemos probar? ¿Si subimos la cantidad de tópicos que me sería mejor utilizar?"
      ],
      "metadata": {
        "id": "089VY2OupEVK"
      }
    }
  ],
  "metadata": {
    "colab": {
      "name": "Topic Modeling.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.11"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}