{
"cells": [
{
"cell_type": "markdown",
"source": [
"Vectorizado con métodos clásicos\n",
"================================\n",
"\n",
"## Introducción"
],
"metadata": {
"id": "64VtYMpfSyf-"
},
"id": "64VtYMpfSyf-"
},
{
"cell_type": "markdown",
"source": [
"Un modelos de aprendizaje automático es, a groso modo, una función parametrizada `f(x)` que toma como entrada\n",
"un vector `x`, `n-dimensional`, y produce un vector de salida `m-dimensional`. Tal función puede ser simple (para un modelo lineal por ejemplo) o más compleja (como una red neuronal).\n",
"\n",
"Cuando trabajamos con lenguaje natural, la mayoría de nuestros datos de entrada representarán características discretas y categóricas, ya sean palabras, letras o incluso utterancias (partes del discurso). La pregunta que nos haremos entonces es ¿Cómo codificamos esos datos categóricos de una manera que sea práctica para ser utilizada por un modelo de aprendizaje automático?\n",
"\n",
"Discutiremos las opciones disponibles.\n",
"\n",
" - [One-hot encoding](#One-hot-encoding)\n",
" - [Index-based encoding](#Index-based-encoding)\n",
" - [Basados en frecuencias](#Basados-en-frecuencias)\n",
"\n",
" Utilizaremos el siguiente corpus para nuestros ejemplos:\n"
],
"metadata": {
"id": "dLRE8wzSSygF"
},
"id": "dLRE8wzSSygF"
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"corpus = [\"El hielo es agua en estado sólido\",\n",
" \"El hielo es uno de los cuatro estados naturales del agua\",\n",
" \"El agua pura se congela a 0 grados\",\n",
" \"El hielo es el nombre común del agua en estado sólido\"]"
],
"outputs": [],
"metadata": {
"id": "FFWRRU00SygG"
},
"id": "FFWRRU00SygG"
},
{
"cell_type": "markdown",
"source": [
"### Para ejecutar este notebook"
],
"metadata": {
"id": "zyuxR3UTSygJ"
},
"id": "zyuxR3UTSygJ"
},
{
"cell_type": "markdown",
"source": [
"Para ejecutar este notebook, instale las siguientes librerias:"
],
"metadata": {
"id": "U0BdtmHGSygK"
},
"id": "U0BdtmHGSygK"
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/vectorization/vectorization.txt --quiet --no-clobber\n",
"!pip install -r vectorization.txt"
],
"outputs": [],
"metadata": {
"id": "Z_KMq7GnSygL"
},
"id": "Z_KMq7GnSygL"
},
{
"cell_type": "markdown",
"source": [
"## Vocabulario"
],
"metadata": {
"id": "ypFyXQwbMNEO"
},
"id": "ypFyXQwbMNEO"
},
{
"cell_type": "markdown",
"source": [
"Primero crearemos nuestro vocabulario:"
],
"metadata": {
"id": "HPE0QyMbEpd0"
},
"id": "HPE0QyMbEpd0"
},
{
"cell_type": "code",
"source": [
"vocab = { i:w for w,i in enumerate(set(' '.join(corpus).split())) }"
],
"metadata": {
"id": "JxpYrYfmEp3q"
},
"id": "JxpYrYfmEp3q",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"`vocab` es un diccionario que tiene las palabras como \"claves\" y su posición como \"valor\"."
],
"metadata": {
"id": "u8rI3u3LMPFP"
},
"id": "u8rI3u3LMPFP"
},
{
"cell_type": "markdown",
"source": [
"## One-hot encoding"
],
"metadata": {
"id": "uuc_1Ex6SygN"
},
"id": "uuc_1Ex6SygN"
},
{
"cell_type": "markdown",
"source": [
"Dado que el texto representará características discretas y categóricas, hace sentido pensar en utilizar métodos clásicos para este tipo de dato. **One-hot encoding** es una técnica sensilla que cosiste en representar las palabras con vectores de longitud igual al tamaño del vocabulario donde todas las posiciones son zero salvo la posición que corresponde al indice de la palabra en cuestión. Eso significa que las dimensiones de los vectores de entrada dependerá del tamaño del vocabulario y no del tamaño del cuerpo de texto. Este tipo de representación tiene la propiedad de que todas las palabras son igualmente relevantes para el modelo."
],
"metadata": {
"id": "2zskVUwgSygN"
},
"id": "2zskVUwgSygN"
},
{
"cell_type": "markdown",
"source": [
"Si bien este método es sencillo de implementar, genera representaciones dispersas, que genera dificultades a la hora de procesarlos. Sin embargo esto no significa que esta forma de codificación no sea práctica. Veremos más adelante que esta técnica se puede utilizar para aprender representaciones más complejas como [embeddings](https://m72109.readthedocs.io/es/latest/nlp/vectorization/embeddings.html). En estas configuraciones, la entrada de la red neuronal está especificada como una colección de vectores one-hot."
],
"metadata": {
"id": "11FT9sOoS_V9"
},
"id": "11FT9sOoS_V9"
},
{
"cell_type": "markdown",
"source": [
"Inicializamos nuestros vectores:"
],
"metadata": {
"id": "iiEVvxaFKPyF"
},
"id": "iiEVvxaFKPyF"
},
{
"cell_type": "code",
"source": [
"vectors = np.zeros((len(corpus), 11, len(vocab)))"
],
"metadata": {
"id": "CUUxT666KQJo"
},
"id": "CUUxT666KQJo",
"execution_count": 87,
"outputs": []
},
{
"cell_type": "code",
"source": [
"vectors.shape"
],
"metadata": {
"id": "GVJmBxrAO9sR",
"outputId": "1c38c380-b8fb-45fc-f72c-c99dfa47101a",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"id": "GVJmBxrAO9sR",
"execution_count": 89,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(4, 11, 23)"
]
},
"metadata": {},
"execution_count": 89
}
]
},
{
"cell_type": "markdown",
"source": [
"> ¿Que representa el 11?"
],
"metadata": {
"id": "gN2Bq_4PMv-y"
},
"id": "gN2Bq_4PMv-y"
},
{
"cell_type": "markdown",
"source": [
"Ponemos un 1 en la posicion de la palabra:"
],
"metadata": {
"id": "zVe4FMt8Keqy"
},
"id": "zVe4FMt8Keqy"
},
{
"cell_type": "code",
"source": [
"for doc_idx, doc in enumerate(corpus):\n",
" for word_idx, word in enumerate(doc.split()):\n",
" vectors[doc_idx][word_idx][vocab[word]] = 1"
],
"metadata": {
"id": "QXwGwscUKYcp"
},
"id": "QXwGwscUKYcp",
"execution_count": 94,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Index-based encoding"
],
"metadata": {
"id": "fKhZpjEWSygO"
},
"id": "fKhZpjEWSygO"
},
{
"cell_type": "markdown",
"source": [
"Es una técnica similar a `one-hot encoding` salvo que aqui los vectores están representados utilizando los valores numéricos correspondientes a los índices que ocupan cada palabra dentro del vocabulario. Es decir que cada palabra está codificada como un número entero. Esto hace que el tamaño del vector no dependa del tamaño del vocabulario.\n",
"\n",
"En general, esta técnica no ofrece ninguna ventaja, pero se utiliza como una representación intermedia para implementar otras formas de vectorización. Por este motivo no se la suele referenciar como una técnica de **vectorización** sino que como un **diccionario de palabras** (ya que mapea *palabras* con *IDs únicos*) Veamos como utilizarla:"
],
"metadata": {
"id": "vb5pweKcSygP"
},
"id": "vb5pweKcSygP"
},
{
"cell_type": "markdown",
"source": [
"Utilizando el vocabulario, transformamos nuestros documentos en vectores:"
],
"metadata": {
"id": "v1cr590mSygQ"
},
"id": "v1cr590mSygQ"
},
{
"cell_type": "code",
"execution_count": 84,
"source": [
"vectors = [[vocab[w] for w in s.split(' ')] for s in corpus]"
],
"outputs": [],
"metadata": {
"id": "qx-DdW3MSygQ"
},
"id": "qx-DdW3MSygQ"
},
{
"cell_type": "code",
"execution_count": 85,
"source": [
"import pandas as pd\n",
"\n",
"pd.DataFrame(vectors)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9 10\n",
"0 8 14 16 7 1 19 0 NaN NaN NaN NaN\n",
"1 8 14 16 9 21 3 22 5.0 4.0 20.0 7.0\n",
"2 8 7 15 2 6 10 12 11.0 NaN NaN NaN\n",
"3 8 14 16 13 18 17 20 7.0 1.0 19.0 0.0"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 8 | \n",
" 14 | \n",
" 16 | \n",
" 7 | \n",
" 1 | \n",
" 19 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" 8 | \n",
" 14 | \n",
" 16 | \n",
" 9 | \n",
" 21 | \n",
" 3 | \n",
" 22 | \n",
" 5.0 | \n",
" 4.0 | \n",
" 20.0 | \n",
" 7.0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 8 | \n",
" 7 | \n",
" 15 | \n",
" 2 | \n",
" 6 | \n",
" 10 | \n",
" 12 | \n",
" 11.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" 8 | \n",
" 14 | \n",
" 16 | \n",
" 13 | \n",
" 18 | \n",
" 17 | \n",
" 20 | \n",
" 7.0 | \n",
" 1.0 | \n",
" 19.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"pd\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": 0,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 8,\n \"max\": 8,\n \"num_unique_values\": 1,\n \"samples\": [\n 8\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 1,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3,\n \"min\": 7,\n \"max\": 14,\n \"num_unique_values\": 2,\n \"samples\": [\n 7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 2,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 15,\n \"max\": 16,\n \"num_unique_values\": 2,\n \"samples\": [\n 15\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 3,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 2,\n \"max\": 13,\n \"num_unique_values\": 4,\n \"samples\": [\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 4,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9,\n \"min\": 1,\n \"max\": 21,\n \"num_unique_values\": 4,\n \"samples\": [\n 21\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 5,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7,\n \"min\": 3,\n \"max\": 19,\n \"num_unique_values\": 4,\n \"samples\": [\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 6,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9,\n \"min\": 0,\n \"max\": 22,\n \"num_unique_values\": 4,\n \"samples\": [\n 22\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 7,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3.0550504633038935,\n \"min\": 5.0,\n \"max\": 11.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 8,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.1213203435596424,\n \"min\": 1.0,\n \"max\": 4.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 9,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 19.0,\n \"max\": 20.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 19.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": 10,\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.949747468305833,\n \"min\": 0.0,\n \"max\": 7.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 85
}
],
"metadata": {
"id": "seKDU01pSygQ",
"outputId": "64b7e053-33bf-4ac4-8988-cb4abb368029",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
}
},
"id": "seKDU01pSygQ"
},
{
"cell_type": "markdown",
"source": [
"## Basados en frecuencias"
],
"metadata": {
"id": "Mk67wK4USygR"
},
"id": "Mk67wK4USygR"
},
{
"cell_type": "markdown",
"source": [
"Se trata de una familia de métodos de los más ampliamente utilizado durante mucho tiempo para convertir texto en representaciones numéricas. En general, estos métodos representan el texto utilizando una lista de frecuencia de palabras, es decir, basandose de alguna forma en la frecuencia en la que aparecen."
],
"metadata": {
"id": "VhIZIN0ySygR"
},
"id": "VhIZIN0ySygR"
},
{
"cell_type": "markdown",
"source": [
"Los métodos basados en bag-of-words tienen las siguientes limitaciones:\n",
"\n",
" - El orden de las palabras es ignorado\n",
" - La frecuencia de la plabra no necesariamente encapsula la importancia\n",
" - Las frecuencias marginales juegan un papel importante (relación entre files y columnas)"
],
"metadata": {
"id": "fyRnJpBiSygR"
},
"id": "fyRnJpBiSygR"
},
{
"cell_type": "markdown",
"source": [
"### Term frecuency"
],
"metadata": {
"id": "RQJEfZHSSygR"
},
"id": "RQJEfZHSSygR"
},
{
"cell_type": "markdown",
"source": [
"Utiliza un vector de longitud igual al tamaño del vocabulario, pero donde los valores corresponden a la frecuencia de la palabra w en el documento D. Las palabras más frecuentes tienen más relevancia.\n",
"\n",
"$$TF = \\frac {freq(w_i)} {len(doc)} $$"
],
"metadata": {
"id": "O32ZdBqQSygR"
},
"id": "O32ZdBqQSygR"
},
{
"cell_type": "code",
"execution_count": 104,
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"vectorizer = CountVectorizer(min_df=0., max_df=1.)\n",
"vectors = vectorizer.fit_transform(corpus)"
],
"outputs": [],
"metadata": {
"id": "_yHWF2NySygS"
},
"id": "_yHWF2NySygS"
},
{
"cell_type": "code",
"execution_count": 105,
"source": [
"vectors = vectors.todense()"
],
"outputs": [],
"metadata": {
"id": "_CuP_PYSSygS"
},
"id": "_CuP_PYSSygS"
},
{
"cell_type": "code",
"execution_count": 106,
"source": [
"vectors.shape"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(4, 20)"
]
},
"metadata": {},
"execution_count": 106
}
],
"metadata": {
"id": "dZHgVaQPSygS",
"outputId": "612f4560-26a3-45c3-cd09-da74ffdd38e1",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"id": "dZHgVaQPSygS"
},
{
"cell_type": "markdown",
"source": [
"> ¿Que representa 20 en las dimensiones del vector?"
],
"metadata": {
"id": "EPNzHBq9SygS"
},
"id": "EPNzHBq9SygS"
},
{
"cell_type": "code",
"execution_count": 107,
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Obtenemos todas las palabras del vocabulario\n",
"vocab = vectorizer.get_feature_names_out()\n",
"# Vectores de cada uno de los documentos\n",
"pd.DataFrame(vectors, columns=vocab)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" agua común congela cuatro de del el en es estado estados grados \\\n",
"0 1 0 0 0 0 0 1 1 1 1 0 0 \n",
"1 1 0 0 1 1 1 1 0 1 0 1 0 \n",
"2 1 0 1 0 0 0 1 0 0 0 0 1 \n",
"3 1 1 0 0 0 1 2 1 1 1 0 0 \n",
"\n",
" hielo los naturales nombre pura se sólido uno \n",
"0 1 0 0 0 0 0 1 0 \n",
"1 1 1 1 0 0 0 0 1 \n",
"2 0 0 0 0 1 1 0 0 \n",
"3 1 0 0 1 0 0 1 0 "
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" agua | \n",
" común | \n",
" congela | \n",
" cuatro | \n",
" de | \n",
" del | \n",
" el | \n",
" en | \n",
" es | \n",
" estado | \n",
" estados | \n",
" grados | \n",
" hielo | \n",
" los | \n",
" naturales | \n",
" nombre | \n",
" pura | \n",
" se | \n",
" sólido | \n",
" uno | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"pd\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"agua\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 1,\n \"num_unique_values\": 1,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"com\\u00fan\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"congela\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cuatro\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"de\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"del\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"el\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 2,\n \"num_unique_values\": 2,\n \"samples\": [\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"en\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"es\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"estado\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"estados\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"grados\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"hielo\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"los\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"naturales\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"nombre\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"pura\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"se\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"s\\u00f3lido\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"uno\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 107
}
],
"metadata": {
"id": "jeAe-jbxSygT",
"outputId": "5239ee1c-2f0e-4cfc-83e8-03c33181853f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
}
},
"id": "jeAe-jbxSygT"
},
{
"cell_type": "markdown",
"source": [
"### TF-IDF"
],
"metadata": {
"id": "tqDCQpTHSygT"
},
"id": "tqDCQpTHSygT"
},
{
"cell_type": "markdown",
"source": [
"Se trata de un método similar a Term Frecuency, pero cuyo objetivo es tratar de ajustar la frecuencia de la palabra en cada documento considerando que tan frecuente es dentro del corpus (dispersión). Dispersión justamente se refiere a que tan equitativamente las palabras están distribuidas entre los diferentes documentos del texto.\n",
"\n",
"Este método considera que:\n",
"\n",
" - Cuanto más frecuente es una palabra en el corpus (más dispersa), más general es su significado.\n",
" - Cuanto más centralizada está el uso de una palabra dentro de todo el corpus (baja dispersión), más probable es que la palabra represente un tópico puntual.\n",
"\n",
"Para calcular el peso de cada palabra debemos obtener:\n",
"\n",
" - **TF:** La frecuencia del termino en el corpus.\n",
"\n",
" $$TF = \\frac {freq(w_i)} {len(doc)} $$\n",
"\n",
" - **IDF:** La frecuencia (inversa) del termino en el documento.\n",
"\n",
"$$IDF = 1 + log(\\frac {len(corpus)} {freq(w_i, corpus)}) $$"
],
"metadata": {
"id": "rMp0s9bqSygT"
},
"id": "rMp0s9bqSygT"
},
{
"cell_type": "markdown",
"source": [
"Finalmente, `tfidf` se computa como la multiplicación de los dos terminos. Adicionalmente, se normaliza usando `L2`, es decir, la norma euclidiana. `L2` reducirá el tamaño de todos los pesos pero los hará 0. Si bien es menos eficiente en terminos de memoria, puede ser útil si queremos/necesitamos retener todos los parámetros.\n",
"\n",
"$$\n",
" \\textit{TF-IDF}_{normalized} = \\frac{tf \\times idf}{\\sqrt{(tf\\times idf)^2}}\n",
"$$"
],
"metadata": {
"id": "FKH5deeOSygT"
},
"id": "FKH5deeOSygT"
},
{
"cell_type": "markdown",
"source": [
"Veamos como aplicarlo:"
],
"metadata": {
"id": "srjr6hnKSygT"
},
"id": "srjr6hnKSygT"
},
{
"cell_type": "code",
"execution_count": 108,
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer()\n",
"vectors = vectorizer.fit_transform(corpus)"
],
"outputs": [],
"metadata": {
"id": "m12pg5HiSygT"
},
"id": "m12pg5HiSygT"
},
{
"cell_type": "code",
"execution_count": 110,
"source": [
"vectors = vectors.todense()\n",
"vectors.shape"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(4, 20)"
]
},
"metadata": {},
"execution_count": 110
}
],
"metadata": {
"id": "yrICiomwSygU",
"outputId": "53df8b11-2c41-4b4a-eaf6-da1e21cad54c",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"id": "yrICiomwSygU"
},
{
"cell_type": "markdown",
"source": [
"> Notemos que cambiar la forma de vectorización en estos casos no cambia la longitud de nuestros vectores (que siempre está dada por la dimensionalidad del vocabulario, en este caso 20). Solo cambia los valores numericos que se asignan en los vectores."
],
"metadata": {
"id": "j54VtxB0SygU"
},
"id": "j54VtxB0SygU"
},
{
"cell_type": "code",
"execution_count": 111,
"source": [
"vocab = vectorizer.get_feature_names_out()\n",
"pd.DataFrame(np.round(vectors, 2), columns=vocab)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" agua común congela cuatro de del el en es estado \\\n",
"0 0.29 0.00 0.00 0.00 0.00 0.00 0.29 0.44 0.36 0.44 \n",
"1 0.18 0.00 0.00 0.35 0.35 0.28 0.18 0.00 0.23 0.00 \n",
"2 0.24 0.00 0.47 0.00 0.00 0.00 0.24 0.00 0.00 0.00 \n",
"3 0.20 0.39 0.00 0.00 0.00 0.31 0.40 0.31 0.25 0.31 \n",
"\n",
" estados grados hielo los naturales nombre pura se sólido uno \n",
"0 0.00 0.00 0.36 0.00 0.00 0.00 0.00 0.00 0.44 0.00 \n",
"1 0.35 0.00 0.23 0.35 0.35 0.00 0.00 0.00 0.00 0.35 \n",
"2 0.00 0.47 0.00 0.00 0.00 0.00 0.47 0.47 0.00 0.00 \n",
"3 0.00 0.00 0.25 0.00 0.00 0.39 0.00 0.00 0.31 0.00 "
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" agua | \n",
" común | \n",
" congela | \n",
" cuatro | \n",
" de | \n",
" del | \n",
" el | \n",
" en | \n",
" es | \n",
" estado | \n",
" estados | \n",
" grados | \n",
" hielo | \n",
" los | \n",
" naturales | \n",
" nombre | \n",
" pura | \n",
" se | \n",
" sólido | \n",
" uno | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0.29 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.29 | \n",
" 0.44 | \n",
" 0.36 | \n",
" 0.44 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.36 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.44 | \n",
" 0.00 | \n",
"
\n",
" \n",
" | 1 | \n",
" 0.18 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.35 | \n",
" 0.35 | \n",
" 0.28 | \n",
" 0.18 | \n",
" 0.00 | \n",
" 0.23 | \n",
" 0.00 | \n",
" 0.35 | \n",
" 0.00 | \n",
" 0.23 | \n",
" 0.35 | \n",
" 0.35 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.35 | \n",
"
\n",
" \n",
" | 2 | \n",
" 0.24 | \n",
" 0.00 | \n",
" 0.47 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.24 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.47 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.47 | \n",
" 0.47 | \n",
" 0.00 | \n",
" 0.00 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0.20 | \n",
" 0.39 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.31 | \n",
" 0.40 | \n",
" 0.31 | \n",
" 0.25 | \n",
" 0.31 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.25 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.39 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 0.31 | \n",
" 0.00 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"pd\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"agua\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.048562674281111544,\n \"min\": 0.18,\n \"max\": 0.29,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.18,\n 0.2,\n 0.29\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"com\\u00fan\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.19499999999999998,\n \"min\": 0.0,\n \"max\": 0.39,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.39,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"congela\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.235,\n \"min\": 0.0,\n \"max\": 0.47,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.47,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cuatro\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.175,\n \"min\": 0.0,\n \"max\": 0.35,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.35,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"de\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.175,\n \"min\": 0.0,\n \"max\": 0.35,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.35,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"del\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.1707581135212419,\n \"min\": 0.0,\n \"max\": 0.31,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.0,\n 0.28\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"el\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.09322910847298,\n \"min\": 0.18,\n \"max\": 0.4,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.18,\n 0.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"en\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.22291627725822685,\n \"min\": 0.0,\n \"max\": 0.44,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.44,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"es\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.15121728296285006,\n \"min\": 0.0,\n \"max\": 0.36,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.23,\n 0.25\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"estado\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.22291627725822685,\n \"min\": 0.0,\n \"max\": 0.44,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.44,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"estados\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.175,\n \"min\": 0.0,\n \"max\": 0.35,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.35,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"grados\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.235,\n \"min\": 0.0,\n \"max\": 0.47,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.47,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"hielo\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.15121728296285006,\n \"min\": 0.0,\n \"max\": 0.36,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.23,\n 0.25\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"los\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.175,\n \"min\": 0.0,\n \"max\": 0.35,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.35,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"naturales\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.175,\n \"min\": 0.0,\n \"max\": 0.35,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.35,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"nombre\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.19499999999999998,\n \"min\": 0.0,\n \"max\": 0.39,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.39,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"pura\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.235,\n \"min\": 0.0,\n \"max\": 0.47,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.47,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"se\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.235,\n \"min\": 0.0,\n \"max\": 0.47,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.47,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"s\\u00f3lido\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.22291627725822685,\n \"min\": 0.0,\n \"max\": 0.44,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.44,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"uno\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.175,\n \"min\": 0.0,\n \"max\": 0.35,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.35,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 111
}
],
"metadata": {
"id": "N4dmJHarSygU",
"outputId": "8bb32cb3-efbd-4f41-e7c9-ff20a0ea7feb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
}
},
"id": "N4dmJHarSygU"
},
{
"cell_type": "markdown",
"source": [
"## Conclusión"
],
"metadata": {
"id": "Be_gwPEcSygV"
},
"id": "Be_gwPEcSygV"
},
{
"cell_type": "markdown",
"source": [
"En estos métodos, cada caractéritica o palabra que forma parte de nuestra entrada está representada en su propia dimensión, y como resultado, tenemos representaciones que dependen del tamaño del vocabulario. Estas representaciones tienen la característica de que la representación de cada palabra es independiente de las restantes. Esto quiere decir que el vector que corresponde a la palabra *perro* es tan distinto al vector que corresponde a *gato* como lo es al que corresponde con *heladera*.\n",
"\n",
"En la siguiente sección veremos técnicas de vectorización de vectores de espacios continuos, donde estas características no se cumplen."
],
"metadata": {
"id": "Wty0OepXSygV"
},
"id": "Wty0OepXSygV"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.11"
},
"colab": {
"provenance": []
}
},
"nbformat": 4,
"nbformat_minor": 5
}