{ "cells": [ { "cell_type": "markdown", "source": [ "Modelado clásico de lenguaje natural\n", "===============================" ], "metadata": { "id": "nhSQKbkFSObx" } }, { "cell_type": "markdown", "source": [ "## Creando un pipeline de preprocesamiento de texto\n", "\n", "A pesar de que los métodos anteriores son no supervisados, son de utilidad para el modelado de de problemas no supervisados como supervisados. Para llevar estos métodos a un entorno práctico normalmente se construyen flujos de procesamiento como el que se muestra más abajo. Estos flujos se los llama Pipeline:" ], "metadata": { "id": "jwFoS39i_J_r" } }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "cLSZG2Gy_J_r" } }, { "cell_type": "markdown", "source": [ "A modo de ejemplo, vamos a utilizar la API de Scikit-Learn para generar cada uno de estos pasos y así construir un modelo que resuelva un problema de negocio de punta a punta." ], "metadata": { "id": "sdw5EvPI_J_s" } }, { "cell_type": "markdown", "source": [ "**¿Que es lo que vamos a hacer?**\n", "Intentaremos construir un pipeline de machine learning donde como entrada recibamos texto, ejecutemos todos los pasos que vimos en este notebook incluyendo:\n", "\n", " - Eliminación de stopwords\n", " - Tokenización\n", " - Stemming y Lemmatization\n", " - Procesamiento especico del tema\n", " - Creación de features utilizando algun metodo de reducción de dimensionalidad, SVD, LSI, LDA\n", "\n", ", para luego utilizar estas features para entrenar un modelo que nos permita predecir alguna propiedad interesante del set de datos. En este caso en particular, donde estamos viendo tweets, algunos casos interesantes podrían ser:\n", " - Predecir el sector al que pertenece el tweet: Alimentación, Bebidas, etc.\n", " - Predecir el paso en el Marketing Funel al que pertece" ], "metadata": { "id": "ONzBUk9yhH9k" } }, { "cell_type": "markdown", "source": [ "### Para ejecutar este notebook" ], "metadata": { "id": "EbV1KaAVSOb9" } }, { "cell_type": "markdown", "source": [ "Para ejecutar este notebook, instale las siguientes librerias:" ], "metadata": { "id": "6UH3VzqXSOb-" } }, { "cell_type": "code", "execution_count": 42, "source": [ "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \\\n", " --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/normalization.py \\\n", " --quiet --no-clobber --directory-prefix ./m72109/nlp/\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/transformation.py \\\n", " --quiet --no-clobber --directory-prefix ./m72109/nlp/\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/classic/classic-modeling.txt \\\n", " --quiet --no-clobber\n", "!pip install -r classic-modeling.txt --quiet" ], "outputs": [], "metadata": { "id": "fb3InosASOb_" } }, { "cell_type": "code", "execution_count": 2, "source": [ "!python -m spacy download es_core_news_sm 1> /dev/null" ], "outputs": [], "metadata": { "id": "DkZn-F6ISOcC" } }, { "cell_type": "markdown", "source": [ "Primero importaremos algunas librerias necesarias" ], "metadata": { "id": "WPpqVNrwSdhL" } }, { "cell_type": "code", "execution_count": 3, "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from tqdm import tqdm" ], "outputs": [], "metadata": { "id": "zPfF_O0U_J9a" } }, { "cell_type": "markdown", "source": [ "## Sobre el set de datos con el que vamos a trabajar" ], "metadata": { "id": "lE_O7bEjLebd" } }, { "cell_type": "markdown", "source": [ "Utilizaremos como ejemplo un set de datos en español que contiene tweets que diferentes usuarios han publicado en relación a diferentes marcas de productos u empresas en el rubro de alimentación, construcción, automoviles, etc. Estos tweets, a su vez, están asociados a una de las diferentes fases en el proceso de ventas (también conocido como Marketing Funel) y por eso están tagueados con las fases de:\n", " - Awareness – el cliente es conciente de la existencia de un producto o servicio\n", " - Interest – activamente expresa el interes de un producto o servicio\n", " - Evaluation – aspira una marca o producto en particular\n", " - Purchase – toma el siguiente paso necesario para comprar el producto o servicio\n", " - Postpurchase - realización del proceso de compra. El cliente compara la diferencia entre lo que deseaba y lo que obtuvo\n", "\n", "Referencia: [Spanish Corpus of Tweets for Marketing](http://ceur-ws.org/Vol-2111/paper1.pdf)\n", "\n", "> Nota: La version de este conjunto de datos que utilizaremos aqui es una versión preprocesada del original." ], "metadata": { "id": "H8lcRTa_Li4e" } }, { "cell_type": "code", "execution_count": 4, "source": [ "tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')" ], "outputs": [], "metadata": { "id": "Gc44Q7do_J9h" } }, { "cell_type": "markdown", "source": [ "Inspeccionamos el set de datos" ], "metadata": { "id": "INJwReUXSs4K" } }, { "cell_type": "code", "execution_count": 5, "source": [ "tweets.head(5)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " TEXTO SECTOR MARCA \\\n", "0 #tablondeanuncios Funda nordica ikea #madrid h... RETAIL IKEA \n", "1 #tr Me ofrezco para montar muebles de Ikea - H... RETAIL IKEA \n", "2 #VozPópuli Vozpópuli @voz_populi - #LoMásLeido... RETAIL ALCAMPO \n", "3 #ZonaTecno Destacado: Todo lo que hay que sabe... RETAIL CARREFOUR \n", "4 $Carrefour retira pez #Panga. OCU y grupos x #... RETAIL CARREFOUR \n", "\n", " CANAL AWARENESS EVALUATION PURCHASE POSTPURCHASE NC2 \n", "0 Microblog 0 0 0.0 0 1.0 \n", "1 Microblog 0 0 0.0 0 1.0 \n", "2 Microblog 0 0 0.0 0 1.0 \n", "3 Microblog 0 0 0.0 0 1.0 \n", "4 Microblog 0 0 0.0 0 1.0 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TEXTOSECTORMARCACANALAWARENESSEVALUATIONPURCHASEPOSTPURCHASENC2
0#tablondeanuncios Funda nordica ikea #madrid h...RETAILIKEAMicroblog000.001.0
1#tr Me ofrezco para montar muebles de Ikea - H...RETAILIKEAMicroblog000.001.0
2#VozPópuli Vozpópuli @voz_populi - #LoMásLeido...RETAILALCAMPOMicroblog000.001.0
3#ZonaTecno Destacado: Todo lo que hay que sabe...RETAILCARREFOURMicroblog000.001.0
4$Carrefour retira pez #Panga. OCU y grupos x #...RETAILCARREFOURMicroblog000.001.0
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "tweets", "summary": "{\n \"name\": \"tweets\",\n \"rows\": 3763,\n \"fields\": [\n {\n \"column\": \"TEXTO\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3678,\n \"samples\": [\n \"El BBVA deber\\u00eda hacer nuevos comerciales con Claudio Bravo.\\nAprovechando que ahora es el rey de la banca.\\n@alebattocchio\",\n \"yo quiero el nuevo citroen c3!! https://t.co/5gKTjThrJk\",\n \"Acabo de correr 5,78 km a un ritmo de 6'36'' con Nike+ https://t.co/Ui6paCTjtC #nikeplus\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SECTOR\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"RETAIL\",\n \"TELCO\",\n \"BEBIDAS\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MARCA\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 38,\n \"samples\": [\n \"ESTRELLA GALICIA\",\n \"NIKE\",\n \"LEROY MERLIN\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"CANAL\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Microblog\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AWARENESS\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"EVALUATION\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"PURCHASE\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.15200028602567756,\n \"min\": 0.0,\n \"max\": 1.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"POSTPURCHASE\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"NC2\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5047254454850222,\n \"min\": 0.0,\n \"max\": 11.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 5 } ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 452 }, "id": "Gd6EocPdG5A0", "outputId": "86c71adf-baf3-4911-cb00-949f6f4c415a" } }, { "cell_type": "code", "execution_count": 6, "source": [ "tweets.groupby('SECTOR').head(1)[['TEXTO', 'SECTOR']]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " TEXTO SECTOR\n", "0 #tablondeanuncios Funda nordica ikea #madrid h... RETAIL\n", "725 \"Ilcinsisti lis MB dispiniblis\" te odeeeeeo Mo... TELCO\n", "964 #CarlosSlim y Bimbo lanzarán un vehículo eléct... ALIMENTACION\n", "1298 ‼🏎Toyota #Day, 4ruedas ,1/4 milla, 1 #pasión, ... AUTOMOCION\n", "1748 \"- Tú qué.\\n- Yo na.\"\\nConversaciones banco sa... BANCA\n", "2348 - Cariño, te juro que sólo tenían Cruzcampo en... BEBIDAS\n", "3023 #adidas #hockey Amenabar 2080 CABA https://t.c... DEPORTES" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TEXTOSECTOR
0#tablondeanuncios Funda nordica ikea #madrid h...RETAIL
725\"Ilcinsisti lis MB dispiniblis\" te odeeeeeo Mo...TELCO
964#CarlosSlim y Bimbo lanzarán un vehículo eléct...ALIMENTACION
1298‼🏎Toyota #Day, 4ruedas ,1/4 milla, 1 #pasión, ...AUTOMOCION
1748\"- Tú qué.\\n- Yo na.\"\\nConversaciones banco sa...BANCA
2348- Cariño, te juro que sólo tenían Cruzcampo en...BEBIDAS
3023#adidas #hockey Amenabar 2080 CABA https://t.c...DEPORTES
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"tweets\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"TEXTO\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"#tablondeanuncios Funda nordica ikea #madrid https://t.co/9TvaZSa1De https://t.co/J3bC6t6C9u\",\n \"\\\"Ilcinsisti lis MB dispiniblis\\\" te odeeeeeo Movistar dir\\u00eda la golda peluda jajaja\",\n \"- Cari\\u00f1o, te juro que s\\u00f3lo ten\\u00edan Cruzcampo en la tienda... https://t.co/m064KIDlpM\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SECTOR\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"RETAIL\",\n \"TELCO\",\n \"BEBIDAS\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 6 } ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 269 }, "id": "uXwJS2Og_J9l", "outputId": "199bb5a9-be6f-4906-c59c-52dd2cc3d34c" } }, { "cell_type": "markdown", "source": [ "## Creando un pipeline" ], "metadata": { "id": "FFUHr1BjSOcI" } }, { "cell_type": "markdown", "source": [ "### Creando un paso de Pipeline para procesamiento de texto\n", "\n", "El paso más complejo que tenemos para crear es quizas el preprocesamiento del texto. Esto lo podemos encapsular en un modulo de Scikit-Learn. Esta libreria tiene 2 tipos de modulos:\n", "\n", " - Transformers\n", " - Estimators\n", "\n", "Los transformers toman un set de features y devuelven otro set de features, por eso es que reciben el nombre de \"trasformers\", porque basicamente transforman vectores. Los estimators, por el contrario, reciben un set de features y producen un podelo que aproxima, o estima, una variable target. Por este motivo, estos modulos reciben el nombre de \"estimators\".\n", "\n", "Para simplicidad, en este curso disponemos de un `TweetTextNormalizer` ya implementado que basicamente utiliza el mismo código que vimos en la lección de procesamiento de texto.\n", "\n", "> Tip: Recomendamos que revise todos los parametros que recibe esta clase." ], "metadata": { "id": "R-TetBhbiGMz" } }, { "cell_type": "markdown", "source": [ "Instanciamos nuestro preprocesamiento de texto" ], "metadata": { "id": "UeSpeGyC_J_0" } }, { "cell_type": "code", "execution_count": 47, "source": [ "from m72109.nlp.normalization import TweetTextNormalizer\n", "\n", "normalizer = TweetTextNormalizer(\n", " language='spanish',\n", " lemmatize=True,\n", " stem=False,\n", " reduce_len=True,\n", " strip_handles=True,\n", " strip_stopwords=True,\n", " strip_urls=True,\n", " strip_accents=True,\n", " token_min_len=4,\n", " preserve_case=False\n", " )" ], "outputs": [], "metadata": { "id": "2WWuoi17_J_t" } }, { "cell_type": "markdown", "source": [ "Podemos ver como funciona nuestro modulo de preprocesamiento de texto al llamarlo con la función transform:" ], "metadata": { "id": "W8YY0A5wkPa0" } }, { "cell_type": "code", "execution_count": 48, "source": [ "tweet = tweets['TEXTO'][5]\n", "print(tweet)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ ". @PoliciadeBurgos @PCivilBurgos @Aytoburgos Mismo peligro c/ Rio Viejo junto Mercadona Villimar\n" ] } ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "N89kgK3tkYUv", "outputId": "8074eac0-5933-4a0c-e2bf-cbf160b698f7" } }, { "cell_type": "code", "execution_count": 49, "source": [ "normalizer.transform([tweet])" ], "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 1/1 [00:00<00:00, 13.51it/s]\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "array(['mismo peligro viejo junto mercadona villimar '], dtype=object)" ] }, "metadata": {}, "execution_count": 49 } ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UIZm5SWMkgvI", "outputId": "07b24b8b-80d5-4902-a39a-eb1af693ba48" } }, { "cell_type": "markdown", "source": [ "### Creando pasos de pipeline para la vectorizacion y ingeniería de features" ], "metadata": { "id": "VDuoM0BRmESe" } }, { "cell_type": "markdown", "source": [ "Importamos algunas librerias que necesitaremos" ], "metadata": { "id": "WwhSKt53_J_w" } }, { "cell_type": "code", "execution_count": 50, "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.decomposition import LatentDirichletAllocation\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.metrics import classification_report" ], "outputs": [], "metadata": { "id": "2bx5kQr6_J_w" } }, { "cell_type": "markdown", "source": [ "Instanciamos nuestro vectorizador, en este caso usando el método TF-IDF" ], "metadata": { "id": "eU4ykD8z_J_3" } }, { "cell_type": "code", "execution_count": 51, "source": [ "vectorizer = TfidfVectorizer(use_idf=True, sublinear_tf=True, norm='l2')" ], "outputs": [], "metadata": { "id": "dVIdgSc9_J_4" } }, { "cell_type": "markdown", "source": [ "Instanciamos nuestro generador de features, que en este caso son los tópicos que LDA genere" ], "metadata": { "id": "fSnMG5la_J_5" } }, { "cell_type": "code", "execution_count": 52, "source": [ "featurizer = LatentDirichletAllocation(n_components=7)" ], "outputs": [], "metadata": { "id": "7o18xNsj_J_6" } }, { "cell_type": "markdown", "source": [ "### Creando un paso de pipeline para clasificar" ], "metadata": { "id": "-G_KAGlvSOcM" } }, { "cell_type": "markdown", "source": [ "Instanciamos nuestro clasificador que utilizará las features generadas hasta este momento" ], "metadata": { "id": "FumqlRDO_J__" } }, { "cell_type": "code", "execution_count": 53, "source": [ "estimator = LogisticRegression(max_iter=10000, multi_class='multinomial')" ], "outputs": [], "metadata": { "id": "2mpzmpo__KAA" } }, { "cell_type": "markdown", "source": [ "### Ensamblando el pipeline" ], "metadata": { "id": "hYDLsEFgSOcN" } }, { "cell_type": "markdown", "source": [ "Creamos un pipeline que ejecute todos los pasos en secuencia" ], "metadata": { "id": "3xkn_cWu_KAE" } }, { "cell_type": "code", "execution_count": 54, "source": [ "pipeline = Pipeline(steps=[('normalizer', normalizer),\n", " ('vectorizer', vectorizer),\n", " ('featurizer', featurizer),\n", " ('estimator', estimator)])" ], "outputs": [], "metadata": { "id": "uAccQHAG_KAE" } }, { "cell_type": "markdown", "source": [ "En este caso intentaremos predecir el sector al que pertenece un tweet en particular. Para ello, como en todo proceso de machine learning separaremos nuestros datos en training y testing, para poder evaluar los resultados:" ], "metadata": { "id": "7wHgLnnYmZH0" } }, { "cell_type": "code", "execution_count": 55, "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(tweets['TEXTO'], tweets['SECTOR'],\n", " test_size=0.33, stratify=tweets['SECTOR'])" ], "outputs": [], "metadata": { "id": "6yAc9s1WmjlG" } }, { "cell_type": "markdown", "source": [ "El método fit intrenará nuestro modelo de punta a punta. Tomará unos minutos" ], "metadata": { "id": "UtWhdGYQn4Ey" } }, { "cell_type": "code", "execution_count": 56, "source": [ "model = pipeline.fit(X=X_train, y=y_train)" ], "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 2521/2521 [01:47<00:00, 23.56it/s]\n" ] } ], "metadata": { "id": "Mp-fZa4E_KAG", "outputId": "e1311e43-6cdc-4202-e940-e2a9698c7045", "colab": { "base_uri": "https://localhost:8080/" } } }, { "cell_type": "markdown", "source": [ "Es hora de ver que tan bien le fué a nuestro modelo en esta tarea" ], "metadata": { "id": "hOLZErx-oGIB" } }, { "cell_type": "code", "execution_count": 57, "source": [ "predictions = model.predict(X_test)" ], "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 1242/1242 [00:53<00:00, 23.02it/s]\n" ] } ], "metadata": { "id": "BiQLKHrr_KAL", "outputId": "94f95221-2a38-4798-c34a-552c34b83d40", "colab": { "base_uri": "https://localhost:8080/" } } }, { "cell_type": "code", "execution_count": 60, "source": [ "print(classification_report(y_test, predictions, zero_division=0))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " precision recall f1-score support\n", "\n", "ALIMENTACION 0.00 0.00 0.00 110\n", " AUTOMOCION 0.00 0.00 0.00 148\n", " BANCA 0.50 0.01 0.02 198\n", " BEBIDAS 0.34 0.39 0.36 223\n", " DEPORTES 0.28 0.32 0.30 216\n", " RETAIL 0.25 0.69 0.37 268\n", " TELCO 0.00 0.00 0.00 79\n", "\n", " accuracy 0.28 1242\n", " macro avg 0.20 0.20 0.15 1242\n", "weighted avg 0.24 0.28 0.20 1242\n", "\n" ] } ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hbDgBh5M_KAP", "outputId": "20900553-b2c8-43a1-c2e3-7ef344094bf3" } }, { "cell_type": "markdown", "source": [ "¿Les parece que estás métricas son buenas? ¿Se les ocurre como mejorarlo? Algunas ideas:\n", "\n", " - ¿Quien funcionará mejor? ¿Stemmer o Lemmatization?\n", " - ¿Qué será mejor hacer con los hashtags? ¿Quitarlos?\n", " - ¿Que cantidad de factores latentes funcionará mejor? ¿7, 10, 200, 300?\n", " - ¿Es Logistic Regression el mejor clasificador que podemos probar? ¿Si subimos la cantidad de tópicos que me sería mejor utilizar?" ], "metadata": { "id": "089VY2OupEVK" } } ], "metadata": { "colab": { "name": "Topic Modeling.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 0 }