{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "iKENeWEEwKIb" }, "source": [ "Dividiendo documentos largos en subsecuencias\n", "=============================================" ] }, { "cell_type": "markdown", "metadata": { "id": "sT_t9OxYwKIc" }, "source": [ "## Introducción\n", "\n", "Las secuencias de texto con las que trabajamos podrían ser de una cantidad de palabras infinita, sin embargo, en computación sabemos que no existe poder de computo infinito y por lo tanto es necesario imponer restricciones en las dimensiones de nuestros datos de entrada. Dependiendo de los requisitos computacionales del modelo con el que estamos trabajando y sus supociones, será entonces la longitud máxima de texto sobre la que podemos trabajar. Modelos complejos pondrán mayor presión de recursos de hardware y por lo tanto podrían ser más restrictivos con la cantidad de palabras que podemos procesar.\n", "\n", "Tenemos varias opciones para resolver esta limitación:\n", "\n", "- Truncar las secuencias a la máxima longitud disponible, con la esperanza de que toda la información relevante esté en la secuencia resultante. Claramente aquí perderá información y dependerá de la cantidad de información que piede si es una alternativa viable o no.\n", "- Utilizar un modelo que opere sobre secuencias más largas\n", "- Ejecutar el modelo sobre subsecuencias más pequeñas y luego entrenar un metamodelo que tome las predicciones de cada secuencia y las combine.\n", "- Dividir la secuencia en subsecuencias de un tamaño menos pero manteniendo algo del contexto de la subsecuencia anterior a la que estamos procesando. Luego ejecutar nuestro modelo tratando a cada subsecuencia como un documento distinto. Las predicciones de todas las subsecuencias luego son agregadas utilizando alguna función. Para ver un ejemplo de esto ultimo vea Dividiendo documentos largos en subsecuencias.\n", "\n", "En este ejemplo exploraremos como realizar la última opción:" ] }, { "cell_type": "markdown", "metadata": { "id": "7b_oCydHKHEJ" }, "source": [ "### Preparación del ambiente" ] }, { "cell_type": "markdown", "metadata": { "id": "nFNu3tUYAKe0" }, "source": [ "Instalamos las librerías necesarias" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uFcb2-bO_tVS", "outputId": "6e6af0d6-4e6a-4aa3-f616-3f5df4f40b70", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m210.7/210.7 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.9/1.9 MB\u001b[0m \u001b[31m14.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m320.3/320.3 kB\u001b[0m \u001b[31m16.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m21.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m25.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h" ] } ], "source": [ "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/transformation.py \\\n", " --no-clobber --quiet --directory-prefix ./m72109/nlp/\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/preprocessing/long_sequences.txt \\\n", " --no-clobber --quiet\n", "!pip install -r long_sequences.txt -q" ] }, { "cell_type": "markdown", "metadata": { "id": "QZrG1xI-61j0" }, "source": [ "### Sobre el conjunto de datos a utilizar" ] }, { "cell_type": "markdown", "metadata": { "id": "_gBXNzwYwKIu" }, "source": [ "En este caso, no podremos trabajar con el conjnto de datos que veniamos trabajando anteriormente ya que los tweets en general son secuencias de texto cortas de hasta 140 caracateres. Es decir que nuestros modelos probablemente nunca se encuentren con el problema de longitudes largas en el texto.\n", "\n", "Para demostrar este ejemplo, utilizaremos el conjunto de datos \"20 grupos de notificas\". El conjunto de datos comprende alrededor de 18000 publicaciones de grupos de noticias sobre 20 temas distintos, desde deportes hasta noticias relacioandas con la aeronáutica.\n", "\n", "Para realizar el trabajo más sencillo, solo traeremos noticias de los grupos 'alt.atheism' y 'sci.space'." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tW6_ZXyGAOfs" }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "\n", "categories = ['alt.atheism', 'sci.space']\n", "newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)" ] }, { "cell_type": "markdown", "metadata": { "id": "ewFjFRl9KSwQ" }, "source": [ "Podemos ver que este conjunto de datos dispone de la siguiente cantidad de muestras:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eFEABYV3CNmi", "outputId": "9feefbc1-f29a-4832-8828-e336d610a211" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Cantidad de textos: 1073 \n", "Cantidad de anotaciones: 1073\n" ] } ], "source": [ "print(\"Cantidad de textos:\", len(newsgroups_train.data), \"\\nCantidad de anotaciones:\", newsgroups_train.target.shape[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "vBdAoarcKvVn" }, "source": [ "Creemos un `pd.DataFrame` de `pandas` para que sea más fácil manipular los datos:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qCtqp7xw6LY6" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({ 'text': newsgroups_train.data, 'category': newsgroups_train.target })" ] }, { "cell_type": "markdown", "metadata": { "id": "tlploSNM6LY6" }, "source": [ "## Resolviendo las limitaciones de longitud de texto" ] }, { "cell_type": "markdown", "metadata": { "id": "qRbtuoJLK7Sf" }, "source": [ "Primero, revisemos las longitudes de los documentos que tenemos para tener una idea de que tan largos pueden llegar a ser. Primero dividiremos el mismo en palabras y luego generaremos un histograma para revisar que tan frecuentemente hay documentos con cada longitud:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1MjFAxw1EorM" }, "outputs": [], "source": [ "from tensorflow.keras.preprocessing.text import Tokenizer\n", "\n", "tokenizer = Tokenizer(num_words=10000)\n", "tokenizer.fit_on_texts(df['text'])\n", "\n", "tokenized_text = tokenizer.texts_to_sequences(df['text'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 524 }, "id": "-TAJSnfAEQbP", "outputId": "cd008122-776f-426e-9f98-b1b21b1a068d" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 5 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAekAAAHpCAYAAACmzsSXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAvSklEQVR4nO3df3RU1b3//1dCMpPwIxMCJENsItFSQERBQIxyrUhq+HEVhf7AFWlquaAW0JB+UXMVbK0apFYpGIi4FGovlFs+V6lSizcGAV2GSEJRQYxwjcIC80NjZkiUITD7+4fNlClBJUwyO8nzsdZZMHvv7LzPyYJXzpk950QYY4wAAIB1IsNdAAAAaBkhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoS0JGOMvF6v+Mg4AMAmhLSko0ePyuVy6ejRo+EuBQCAAEIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUlHhLqAz8/v9qq2tlST169dPkZH8TgQA+PZIjTZUW1ur7BVFyl5RFAhrAAC+Lc6k21hMr97hLgEA0EFxJg0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYKqwhvX37dl1//fVKTk5WRESENm7ceNqYffv26YYbbpDL5VKPHj00evRoHTx4MNB/7NgxzZkzR3369FHPnj01bdo0VVdXt+NeAADQNsIa0o2Njbr00ktVUFDQYv///d//aezYsRo8eLC2bt2qd955RwsXLlRMTExgzPz58/XSSy9pw4YN2rZtm44cOaKpU6e21y4AANBmwvo86YkTJ2rixIln7L/vvvs0adIkLVmyJNB24YUXBv7u8Xj0zDPPaN26dbr22mslSatXr9aQIUO0Y8cOXXHFFS3O6/P55PP5Aq+9Xu+57goAACFn7XvSfr9ff/3rX/W9731PmZmZSkxM1JgxY4IuiZeXl6upqUkZGRmBtsGDBys1NVUlJSVnnDs/P18ulyuwpaSktOWuAADQKtaGdE1NjRoaGrR48WJNmDBB//u//6ubbrpJU6dO1bZt2yRJVVVVcjgcio+PD/rapKQkVVVVnXHuvLw8eTyewHbo0KG23BUAAFolrJe7v47f75ckTZkyRfPnz5ckDR8+XG+++aYKCwv1/e9/v9VzO51OOZ3OkNQJAEBbsfZMum/fvoqKitJFF10U1D5kyJDA6m63263jx4+rvr4+aEx1dbXcbnd7lQoAQJuwNqQdDodGjx6tioqKoPYPPvhA559/viRp5MiRio6OVnFxcaC/oqJCBw8eVHp6ervWCwBAqIX1cndDQ4MOHDgQeF1ZWandu3crISFBqampWrBggX7yk5/o6quv1rhx47R582a99NJL2rp1qyTJ5XJp5syZys3NVUJCguLi4jRv3jylp6efcWU3AAAdRVhDuqysTOPGjQu8zs3NlSRlZ2drzZo1uummm1RYWKj8/HzdeeedGjRokP7nf/5HY8eODXzNE088ocjISE2bNk0+n0+ZmZlasWJFu+8LAAChFmGMMeEuIty8Xq9cLpc8Ho/i4uJCNm91dbVu+2OZJOmpGaOUlJQUsrkBAJ2fte9JAwDQ1RHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEtZ+6jKzsT4/aqtrZUk9evXT5GR/G4EAPhmpEU78DV6lLOuTNkrigJhDQDAN+FMup04esbL4YgOdxkAgA6EM2kAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoQ0AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAAS4U1pLdv367rr79eycnJioiI0MaNG8849vbbb1dERISWLl0a1F5XV6esrCzFxcUpPj5eM2fOVENDQ9sWDgBAOwhrSDc2NurSSy9VQUHB14574YUXtGPHDiUnJ5/Wl5WVpb1796qoqEibNm3S9u3bNXv27LYqGQCAdhMVzm8+ceJETZw48WvHHD58WPPmzdMrr7yiyZMnB/Xt27dPmzdv1s6dOzVq1ChJ0vLlyzVp0iQ99thjLYY6AAAdhdXvSfv9fs2YMUMLFizQ0KFDT+svKSlRfHx8IKAlKSMjQ5GRkSotLT3jvD6fT16vN2gDAMA2Vof0o48+qqioKN15550t9ldVVSkxMTGoLSoqSgkJCaqqqjrjvPn5+XK5XIEtJSUlpHUDABAK1oZ0eXm5fv/732vNmjWKiIgI6dx5eXnyeDyB7dChQyGdHwCAULA2pF9//XXV1NQoNTVVUVFRioqK0scff6xf/vKXGjBggCTJ7XarpqYm6OtOnDihuro6ud3uM87tdDoVFxcXtAEAYJuwLhz7OjNmzFBGRkZQW2ZmpmbMmKFbb71VkpSenq76+nqVl5dr5MiRkqQtW7bI7/drzJgx7V4zAAChFNaQbmho0IEDBwKvKysrtXv3biUkJCg1NVV9+vQJGh8dHS23261BgwZJkoYMGaIJEyZo1qxZKiwsVFNTk+bOnavp06ezshsA0OGF9XJ3WVmZRowYoREjRkiScnNzNWLECC1atOhbz7F27VoNHjxY48eP16RJkzR27FitWrWqrUoGAKDdhPVM+pprrpEx5luP/+ijj05rS0hI0Lp160JYFQAAdrB24RgAAF0dIQ0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQh3Y6M36/a2lr5/f5wlwIA6AAI6Xbka/TojlWvqra2NtylAAA6AEK6nTl6uMJdAgCggyCkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoQ0AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgqbCG9Pbt23X99dcrOTlZERER2rhxY6CvqalJ99xzj4YNG6YePXooOTlZP/3pT3XkyJGgOerq6pSVlaW4uDjFx8dr5syZamhoaOc9AQAg9MIa0o2Njbr00ktVUFBwWt8XX3yhXbt2aeHChdq1a5eef/55VVRU6IYbbggal5WVpb1796qoqEibNm3S9u3bNXv27PbaBQAA2kxUOL/5xIkTNXHixBb7XC6XioqKgtqefPJJXX755Tp48KBSU1O1b98+bd68WTt37tSoUaMkScuXL9ekSZP02GOPKTk5uc33AQCAttKh3pP2eDyKiIhQfHy8JKmkpETx8fGBgJakjIwMRUZGqrS09Izz+Hw+eb3eoA0AANt0mJA+duyY7rnnHt18882Ki4uTJFVVVSkxMTFoXFRUlBISElRVVXXGufLz8+VyuQJbSkpKm9YOAEBrdIiQbmpq0o9//GMZY7Ry5cpzni8vL08ejyewHTp0KARVAgAQWmF9T/rbaA7ojz/+WFu2bAmcRUuS2+1WTU1N0PgTJ06orq5Obrf7jHM6nU45nc42qxkAgFCw+ky6OaD379+vV199VX369AnqT09PV319vcrLywNtW7Zskd/v15gxY9q7XAAAQiqsZ9INDQ06cOBA4HVlZaV2796thIQE9e/fXz/84Q+1a9cubdq0SSdPngy8z5yQkCCHw6EhQ4ZowoQJmjVrlgoLC9XU1KS5c+dq+vTprOwGAHR4YQ3psrIyjRs3LvA6NzdXkpSdna1f/epXevHFFyVJw4cPD/q61157Tddcc40kae3atZo7d67Gjx+vyMhITZs2TcuWLWuX+gEAaEthDelrrrlGxpgz9n9dX7OEhAStW7culGUBAGAFq9+TBgCgKyOkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoQ0AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAAS0WFu4DOyO/3q7a2VrW1tZIJdzUAgI6KkG4DtbW1yl5RJF+DR7F9zgt3OQCADoqQbiMxvXqHuwQAQAdHSLczY766FC5J/fr1U2QkywIAAC0jIdpZ0xdHlbOuTNkrigJhDQBASziTDgNHz3g5HNHhLgMAYDnOpAEAsBQhDQCApQhpAAAsRUgDAGCpsIb09u3bdf311ys5OVkRERHauHFjUL8xRosWLVL//v0VGxurjIwM7d+/P2hMXV2dsrKyFBcXp/j4eM2cOVMNDQ3tuBcAALSNsIZ0Y2OjLr30UhUUFLTYv2TJEi1btkyFhYUqLS1Vjx49lJmZqWPHjgXGZGVlae/evSoqKtKmTZu0fft2zZ49u712AQCANhPWj2BNnDhREydObLHPGKOlS5fq/vvv15QpUyRJzz33nJKSkrRx40ZNnz5d+/bt0+bNm7Vz506NGjVKkrR8+XJNmjRJjz32mJKTk9ttXwAACDVr35OurKxUVVWVMjIyAm0ul0tjxoxRSUmJJKmkpETx8fGBgJakjIwMRUZGqrS09Ixz+3w+eb3eoA0AANtYG9JVVVWSpKSkpKD2pKSkQF9VVZUSExOD+qOiopSQkBAY05L8/Hy5XK7AlpKSEuLqAQA4d9aGdFvKy8uTx+MJbIcOHQp3SQAAnMbakHa73ZKk6urqoPbq6upAn9vtVk1NTVD/iRMnVFdXFxjTEqfTqbi4uKANAADbWBvSaWlpcrvdKi4uDrR5vV6VlpYqPT1dkpSenq76+nqVl5cHxmzZskV+v19jxoxp95oBAAilsK7ubmho0IEDBwKvKysrtXv3biUkJCg1NVU5OTl66KGHNHDgQKWlpWnhwoVKTk7WjTfeKEkaMmSIJkyYoFmzZqmwsFBNTU2aO3eupk+fzspuAECHF9aQLisr07hx4wKvc3NzJUnZ2dlas2aN7r77bjU2Nmr27Nmqr6/X2LFjtXnzZsXExAS+Zu3atZo7d67Gjx+vyMhITZs2TcuWLWv3fQEAINQijDEm3EWEm9frlcvlksfjCcn709XV1brtj2U6dvRzRTp7yu9rCPzZ+Pmn6uUeIIcjWk/NGHXa6nUAAJpZ+540AABdHSENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBSrQrpCy64QJ999tlp7fX19brgggvOuSgAANDKkP7oo4908uTJ09p9Pp8OHz58zkUBAICzvC3oiy++GPj7K6+8IpfLFXh98uRJFRcXa8CAASErDgCAruysQrr5wRYRERHKzs4O6ouOjtaAAQP0u9/9LmTFAQDQlZ1VSPv9fklfPUZy586d6tu3b5sUBQAAWvkUrMrKylDXAQAA/kWrH1VZXFys4uJi1dTUBM6wmz377LPnXBgAAF1dq0L617/+tR588EGNGjVK/fv3V0RERKjrAgCgy2tVSBcWFmrNmjWaMWNGqOsBAAD/0KrPSR8/flxXXnllqGsBAACnaFVI/8d//IfWrVsX6loAAMApWnW5+9ixY1q1apVeffVVXXLJJYqOjg7qf/zxx0NSHAAAXVmrQvqdd97R8OHDJUl79uwJ6mMRGQAAodGqkH7ttddCXQcAAPgXPKoSAABLtepMety4cV97WXvLli2tLggAAHylVSHd/H50s6amJu3evVt79uw57cEbAACgdVoV0k888USL7b/61a/U0NBwTgUBAICvhPQ96VtuuYX7dgMAECIhDemSkhLFxMSEckoAALqsVl3unjp1atBrY4w++eQTlZWVaeHChSEpDACArq5VIe1yuYJeR0ZGatCgQXrwwQd13XXXhaQwAAC6ulaF9OrVq0NdBwAA+BetCulm5eXl2rdvnyRp6NChGjFiREiKAgAArQzpmpoaTZ8+XVu3blV8fLwkqb6+XuPGjdP69evVr1+/UNYIAECX1KrV3fPmzdPRo0e1d+9e1dXVqa6uTnv27JHX69Wdd94Z6hoBAOiSWnUmvXnzZr366qsaMmRIoO2iiy5SQUEBC8cAAAiRVp1J+/3+054hLUnR0dHy+/3nXBQAAGhlSF977bW66667dOTIkUDb4cOHNX/+fI0fPz5kxQEA0JW1KqSffPJJeb1eDRgwQBdeeKEuvPBCpaWlyev1avny5aGuEQCALqlV70mnpKRo165devXVV/X+++9LkoYMGaKMjIyQFgcAQFd2VmfSW7Zs0UUXXSSv16uIiAj94Ac/0Lx58zRv3jyNHj1aQ4cO1euvv95WtQIA0KWcVUgvXbpUs2bNUlxc3Gl9LpdLt912mx5//PGQFQcAQFd2ViH99ttva8KECWfsv+6661ReXn7ORTU7efKkFi5cqLS0NMXGxurCCy/Ub37zGxljAmOMMVq0aJH69++v2NhYZWRkaP/+/SGrAQCAcDmrkK6urm7xo1fNoqKiVFtbe85FNXv00Ue1cuVKPfnkk9q3b58effRRLVmyJGhx2pIlS7Rs2TIVFhaqtLRUPXr0UGZmpo4dOxayOgAACIezCunzzjtPe/bsOWP/O++8o/79+59zUc3efPNNTZkyRZMnT9aAAQP0wx/+UNddd53eeustSV+dRS9dulT333+/pkyZoksuuUTPPfecjhw5oo0bN4asDgAAwuGsQnrSpElauHBhi2epX375pR544AH9+7//e8iKu/LKK1VcXKwPPvhA0leX29944w1NnDhRklRZWamqqqqgVeUul0tjxoxRSUnJGef1+Xzyer1BGwAAtjmrj2Ddf//9ev755/W9731Pc+fO1aBBgyRJ77//vgoKCnTy5Endd999ISvu3nvvldfr1eDBg9WtWzedPHlSDz/8sLKysiRJVVVVkqSkpKSgr0tKSgr0tSQ/P1+//vWvQ1YnAABt4axCOikpSW+++abuuOMO5eXlBRZwRUREKDMzUwUFBacF5rn485//rLVr12rdunUaOnSodu/erZycHCUnJys7O7vV8+bl5Sk3Nzfw2uv1KiUlJRQlAwAQMmd9M5Pzzz9fL7/8sj7//HMdOHBAxhgNHDhQvXv3DnlxCxYs0L333qvp06dLkoYNG6aPP/5Y+fn5ys7OltvtlvTVgrZT3wuvrq7W8OHDzziv0+mU0+kMeb0AAIRSq24LKkm9e/fW6NGjdfnll7dJQEvSF198ocjI4BK7desWeIhHWlqa3G63iouLA/1er1elpaVKT09vk5oAAGgvrbotaHu5/vrr9fDDDys1NVVDhw7V3//+dz3++OP6+c9/Lumry+w5OTl66KGHNHDgQKWlpWnhwoVKTk7WjTfeGN7iAQA4R1aH9PLly7Vw4UL94he/UE1NjZKTk3Xbbbdp0aJFgTF33323GhsbNXv2bNXX12vs2LHavHmzYmJiwlg5AADnLsKcevuuLsrr9crlcsnj8bR4y9OzVV1drdv+WKZjRz9XpLOn/L6GwJ+Nn3+qXu4Bcjii9dSMUSFdaAcA6Fxa/Z40AABoW4Q0AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoQ0AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgqahwF9BVGb9ftbW1kqR+/fopMpLflwAAwUiGMPE1epSzrkzZK4oCYQ0AwKk4kw4jR894ORzR4S4DAGApzqQBALAUIQ0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFLWh/Thw4d1yy23qE+fPoqNjdWwYcNUVlYW6DfGaNGiRerfv79iY2OVkZGh/fv3h7FiAABCw+qQ/vzzz3XVVVcpOjpaf/vb3/Tee+/pd7/7nXr37h0Ys2TJEi1btkyFhYUqLS1Vjx49lJmZqWPHjoWxcgAAzp3Vdxx79NFHlZKSotWrVwfa0tLSAn83xmjp0qW6//77NWXKFEnSc889p6SkJG3cuFHTp09v95oBAAgVq8+kX3zxRY0aNUo/+tGPlJiYqBEjRujpp58O9FdWVqqqqkoZGRmBNpfLpTFjxqikpOSM8/p8Pnm93qANAADbWB3SH374oVauXKmBAwfqlVde0R133KE777xTf/jDHyRJVVVVkqSkpKSgr0tKSgr0tSQ/P18ulyuwpaSktN1OAADQSlaHtN/v12WXXaZHHnlEI0aM0OzZszVr1iwVFhae07x5eXnyeDyB7dChQyGq+Ow1P7LS7/eHrQYAgJ2sDun+/fvroosuCmobMmSIDh48KElyu92SpOrq6qAx1dXVgb6WOJ1OxcXFBW3h4mv06I5Vr/K4SgDAaawO6auuukoVFRVBbR988IHOP/98SV8tInO73SouLg70e71elZaWKj09vV1rPReOHq5wlwAAsJDVq7vnz5+vK6+8Uo888oh+/OMf66233tKqVau0atUqSVJERIRycnL00EMPaeDAgUpLS9PChQuVnJysG2+8MbzFAwBwjqwO6dGjR+uFF15QXl6eHnzwQaWlpWnp0qXKysoKjLn77rvV2Nio2bNnq76+XmPHjtXmzZsVExMTxsoBADh3EcYYE+4iws3r9crlcsnj8YTk/enq6mrd9scyHTv6uSKdPeX3NQT+bPz8U/VyDwhqazrepHU5k05bpQ4A6Nqsfk8aAICujJAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoQ0AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJaKCncBkIzxq7a2VpLUr18/RUbyuxMAgDNpKzR9cVQ568qUvaIoENYAAHAmbQlHz3g5HNHhLgMAYBHOpAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoQ0AACWIqQBALAUIQ0AgKU6VEgvXrxYERERysnJCbQdO3ZMc+bMUZ8+fdSzZ09NmzZN1dXV4SvyHBj/Vw/aqK6ult/vD3c5AIAw6zAhvXPnTj311FO65JJLgtrnz5+vl156SRs2bNC2bdt05MgRTZ06NUxVnhtfo4cHbQAAAjpESDc0NCgrK0tPP/20evfuHWj3eDx65pln9Pjjj+vaa6/VyJEjtXr1ar355pvasWPHGefz+Xzyer1Bmy0cPeMV06v3Nw8EAHR6HSKk58yZo8mTJysjIyOovby8XE1NTUHtgwcPVmpqqkpKSs44X35+vlwuV2BLSUlps9oBAGgt60N6/fr12rVrl/Lz80/rq6qqksPhUHx8fFB7UlKSqqqqzjhnXl6ePB5PYDt06FCoywYA4JxZ/TzpQ4cO6a677lJRUZFiYmJCNq/T6ZTT6QzZfAAAtAWrz6TLy8tVU1Ojyy67TFFRUYqKitK2bdu0bNkyRUVFKSkpScePH1d9fX3Q11VXV8vtdoenaAAAQsTqM+nx48fr3XffDWq79dZbNXjwYN1zzz1KSUlRdHS0iouLNW3aNElSRUWFDh48qPT09HCUDABAyFgd0r169dLFF18c1NajRw/16dMn0D5z5kzl5uYqISFBcXFxmjdvntLT03XFFVeEo2QAAELG6pD+Np544glFRkZq2rRp8vl8yszM1IoVK8JdFgAA56zDhfTWrVuDXsfExKigoEAFBQXhKQgAgDZi9cIxAAC6MkIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAIClCGkAACxFSAMAYClCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWIqQBgDAUoQ0AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgqahwF4DTGb9ftbW1kqR+/fopMpLfpQCgK+J/fwv5Gj3KWVem7BVFgbAGAHQ9nElbytEzXg5HdLjLAACEEWfSAABYipAGAMBShHSI+ZsXfZlwVwIA6OgI6RCrra3VbU++qONNTeEuBQDQwRHSbcDRPS7cJQAAOgFCGgAASxHSAABYipAGAMBShDQAAJYipAEAsBQhDQCApQhpAAAsZX1I5+fna/To0erVq5cSExN14403qqKiImjMsWPHNGfOHPXp00c9e/bUtGnTVF1dHaaKAQAIDetDetu2bZozZ4527NihoqIiNTU16brrrlNjY2NgzPz58/XSSy9pw4YN2rZtm44cOaKpU6eGsWoAAM6d9Y+q3Lx5c9DrNWvWKDExUeXl5br66qvl8Xj0zDPPaN26dbr22mslSatXr9aQIUO0Y8cOXXHFFafN6fP55PP5Aq+9Xm/b7gQAAK1g/Zn0v/J4PJKkhIQESVJ5ebmampqUkZERGDN48GClpqaqpKSkxTny8/PlcrkCW0pKStsXDgDAWepQIe33+5WTk6OrrrpKF198sSSpqqpKDodD8fHxQWOTkpJUVVXV4jx5eXnyeDyB7dChQ21dOgAAZ836y92nmjNnjvbs2aM33njjnOZxOp1yOp0hqgoAgLbRYc6k586dq02bNum1117Td77znUC72+3W8ePHVV9fHzS+urpabre7nasEACB0rA9pY4zmzp2rF154QVu2bFFaWlpQ/8iRIxUdHa3i4uJAW0VFhQ4ePKj09PT2LhcAgJCx/nL3nDlztG7dOv3lL39Rr169Au8zu1wuxcbGyuVyaebMmcrNzVVCQoLi4uI0b948paent7iyGwCAjsL6kF65cqUk6ZprrglqX716tX72s59Jkp544glFRkZq2rRp8vl8yszM1IoVK9q5UgAAQsv6kDbGfOOYmJgYFRQUqKCgoB0qaj/G71dtba369eunyEjr35kAAIQY//NbzNfo0R2rXlVtbW24SwEAhAEhbTlHD1e4SwAAhAkhDQCApQhpAAAsRUgDAGApQhoAAEsR0gAAWMr6z0njK/5/fGZaEp+bBoAugv/pO4ja2lplryhS9ooiPjcNAF0EZ9IdSEyv3uEuAQDQjjiTBgDAUoQ0AACWIqQBALAUIQ0AgKUI6Q6m+fGV1dXV8vv9QX1+v7/FdgBAx0RIdzC+Ro9y1pW1+FGs2tpaTV+ygY9oAUAnwUewOiBHz3g5HNEt9jl5tCUAdBqcSQMAYClCGgAAS3G5uxNovq93bW2tZMJdDQAgVAjpTqD5vt6+Bo8iHd3DXQ4AIEQI6U6i+b7eTcebwlwJACBUCGnLGfPPR1RyKRsAuhZC2nJNXxxVzroynfQ1KrbPeeEuBwDQjgjpDsDRM17+6NN/VM13GGPBGAB0ToR0B/bZZ5/p/9uwW74GD2fZANAJEdIdXPOCMQBA58PNTDoo4/fr008/5TI3AHRihHQH5Wv06J7/el3Hm/jIFQB0VoR0B+bo3ivcJQAA2hDvSXcyp36uul+/fpIU9DoyMjJwG9FT2wAA9iGkO5nmz1VHRUfpD7/4gSQpe0WRJOkPv/iBkpKSArcRPbUNAGAfQroTcvSMV3RUt8DZckzP3lJE8JhTV4U3n1lzVg0AduF/5E7K1+hRzroyzVu97RsXl9XW1mr6kg3/vP0oAMAKnEl3Yme6U1lLnD1cbVwNAOBscSbdRZh/XNL2+/0t9/9jwVl1dfUZx4RC861Mv+77fJsxANAVENJdhK/RoztWvXrGS9rHG79acJa9oqhNL3s3L1r7uu/zbcYAQFfA5e4uxPEvl7SNP/gxmI6e8XI4oiUFLyaTTv8Y19loPjOWvrrf+KkL2c60aK15YdupX8vCNgBdDSHdhTUvLjvpa9SJkyeC+poXk62/+0eSTv8Y19mora3VD3+9WrEJyYFHbjb/MnDq92lp3uaHiLT2ewNAR0ZId3HNi8tOfP7paX2nLiY71wd5OLrHnXEh2zctWuMhIgC6qk5z7bCgoEADBgxQTEyMxowZo7feeivcJVmneXFYa58/3Xx5/JNPPtEnn3wStKir+bL0qX1f97zr5rlC+Szs5u/XmgVpbbFYLRRztjRHey+sa+ln2xbz27BoMRxsrQv/FM6fUac4k/7v//5v5ebmqrCwUGPGjNHSpUuVmZmpiooKJSYmhrs8azTfjaz5kvPZCro83tSk/3ffzYHLz82LvXwNnkCfpH+2/cvl9FPninR0P/ed0zdfOv+6O621xV3YQjFnS3O09x3jWvrZhvJ7ttf+2HqnPVvrwj+F82fUKUL68ccf16xZs3TrrbdKkgoLC/XXv/5Vzz77rO69997Txvt8Pvl8vsBrj8cjSfJ6vedcy9GjR/VFfY1MRDed9DWqm7NH0J9fej9Xt2jnN7a1tu/bjf9SX3z2SYt9x6OjVFlZKUlqrKsK/L3h0yPyNXj+Mf5LnThxQpWVlTp69Kgk6dNPP9UJ35dfbf/okxRoO3NdX+pE49HT5mr49Igk6eOPFfj7qWNa0lzDmcY197c019f1tVYo5mxpjrao9dvUcKKFn3so55fadn/a+7h9W7bWhX869Wd09OhRxcbGhmzuXr16KSIi4swDTAfn8/lMt27dzAsvvBDU/tOf/tTccMMNLX7NAw88YPTVRVY2NjY2NrawbR6P52szrsOfSX/66ac6efLkaZcfkpKS9P7777f4NXl5ecrNzQ289vv9qqurU58+fb7+N5pv4PV6lZKSokOHDikuLq7V8+CbcazbD8e6/XCs248tx7pXr69/5HCHD+nWcDqdcjqdQW3x8fEhmz8uLo5/YO2EY91+ONbth2Pdfmw/1h1+dXffvn3VrVu3wA0vmlVXV8vtdoepKgAAzl2HD2mHw6GRI0equLg40Ob3+1VcXKz09PQwVgYAwLnpFJe7c3NzlZ2drVGjRunyyy/X0qVL1djYGFjt3V6cTqceeOCB0y6lI/Q41u2HY91+ONbtp6Mc6whjjAl3EaHw5JNP6re//a2qqqo0fPhwLVu2TGPGjAl3WQAAtFqnCWkAADqbDv+eNAAAnRUhDQCApQhpAAAsRUgDAGApQjpEeFTm2cnPz9fo0aPVq1cvJSYm6sYbb1RFRUXQmGPHjmnOnDnq06ePevbsqWnTpp1205qDBw9q8uTJ6t69uxITE7VgwQKdOBH8xK2tW7fqsssuk9Pp1He/+12tWbOmrXfPaosXL1ZERIRycnICbRzr0Dl8+LBuueUW9enTR7GxsRo2bJjKysoC/cYYLVq0SP3791dsbKwyMjK0f//+oDnq6uqUlZWluLg4xcfHa+bMmWpoaAga88477+jf/u3fFBMTo5SUFC1ZsqRd9s8WJ0+e1MKFC5WWlqbY2FhdeOGF+s1vfqNT10J3imN9zk+4gFm/fr1xOBzm2WefNXv37jWzZs0y8fHxprq6OtylWSszM9OsXr3a7Nmzx+zevdtMmjTJpKammoaGhsCY22+/3aSkpJji4mJTVlZmrrjiCnPllVcG+k+cOGEuvvhik5GRYf7+97+bl19+2fTt29fk5eUFxnz44Yeme/fuJjc317z33ntm+fLlplu3bmbz5s3tur+2eOutt8yAAQPMJZdcYu66665AO8c6NOrq6sz5559vfvazn5nS0lLz4YcfmldeecUcOHAgMGbx4sXG5XKZjRs3mrffftvccMMNJi0tzXz55ZeBMRMmTDCXXnqp2bFjh3n99dfNd7/7XXPzzTcH+j0ej0lKSjJZWVlmz5495k9/+pOJjY01Tz31VLvubzg9/PDDpk+fPmbTpk2msrLSbNiwwfTs2dP8/ve/D4zpDMeakA6Byy+/3MyZMyfw+uTJkyY5Odnk5+eHsaqOpaamxkgy27ZtM8YYU19fb6Kjo82GDRsCY/bt22ckmZKSEmOMMS+//LKJjIw0VVVVgTErV640cXFxxufzGWOMufvuu83QoUODvtdPfvITk5mZ2da7ZJ2jR4+agQMHmqKiIvP9738/ENIc69C55557zNixY8/Y7/f7jdvtNr/97W8DbfX19cbpdJo//elPxhhj3nvvPSPJ7Ny5MzDmb3/7m4mIiDCHDx82xhizYsUK07t378Cxb/7egwYNCvUuWWvy5Mnm5z//eVDb1KlTTVZWljGm8xxrLnefo+PHj6u8vFwZGRmBtsjISGVkZKikpCSMlXUszc/0TkhIkCSVl5erqakp6LgOHjxYqampgeNaUlKiYcOGBT0BLTMzU16vV3v37g2MOXWO5jFd8WczZ84cTZ48+bTjwbEOnRdffFGjRo3Sj370IyUmJmrEiBF6+umnA/2VlZWqqqoKOk4ul0tjxowJOtbx8fEaNWpUYExGRoYiIyNVWloaGHP11VfL4XAExmRmZqqiokKff/55W++mFa688koVFxfrgw8+kCS9/fbbeuONNzRx4kRJnedYd4rbgoZTax6ViWB+v185OTm66qqrdPHFF0uSqqqq5HA4Tns6WVJSkqqqqgJjWjruzX1fN8br9erLL78M6cPbbbZ+/Xrt2rVLO3fuPK2PYx06H374oVauXKnc3Fz953/+p3bu3Kk777xTDodD2dnZgWPV0nE69TgmJiYG9UdFRSkhISFoTFpa2mlzNPf17t27TfbPJvfee6+8Xq8GDx6sbt266eTJk3r44YeVlZUlSZ3mWBPSCLs5c+Zoz549euONN8JdSqd06NAh3XXXXSoqKlJMTEy4y+nU/H6/Ro0apUceeUSSNGLECO3Zs0eFhYXKzs4Oc3Wdy5///GetXbtW69at09ChQ7V7927l5OQoOTm5Ux1rLnefIx6VeW7mzp2rTZs26bXXXtN3vvOdQLvb7dbx48dVX18fNP7U4+p2u1s87s19XzcmLi6uS5zZSV9dzq6pqdFll12mqKgoRUVFadu2bVq2bJmioqKUlJTEsQ6R/v3766KLLgpqGzJkiA4ePCjpn8fq6/6/cLvdqqmpCeo/ceKE6urqzurn0dktWLBA9957r6ZPn65hw4ZpxowZmj9/vvLz8yV1nmNNSJ8jHpXZOsYYzZ07Vy+88IK2bNly2uWkkSNHKjo6Oui4VlRU6ODBg4Hjmp6ernfffTfoH1lRUZHi4uIC/1Gmp6cHzdE8piv9bMaPH693331Xu3fvDmyjRo1SVlZW4O8c69C46qqrTvso4QcffKDzzz9fkpSWlia32x10nLxer0pLS4OOdX19vcrLywNjtmzZIr/fH3hoUHp6urZv366mpqbAmKKiIg0aNKhLXOqWpC+++EKRkcER1q1bN/n9fkmd6Fi3y/K0Tm79+vXG6XSaNWvWmPfee8/Mnj3bxMfHB62ERbA77rjDuFwus3XrVvPJJ58Eti+++CIw5vbbbzepqalmy5YtpqyszKSnp5v09PRAf/PHgq677jqze/dus3nzZtOvX78WPxa0YMECs2/fPlNQUNDlPhbUklNXdxvDsQ6Vt956y0RFRZmHH37Y7N+/36xdu9Z0797d/Nd//VdgzOLFi018fLz5y1/+Yt555x0zZcqUFj8WNGLECFNaWmreeOMNM3DgwKCPBdXX15ukpCQzY8YMs2fPHrN+/XrTvXv3LvURrOzsbHPeeecFPoL1/PPPm759+5q77747MKYzHGtCOkSWL19uUlNTjcPhMJdffrnZsWNHuEuymqQWt9WrVwfGfPnll+YXv/iF6d27t+nevbu56aabzCeffBI0z0cffWQmTpxoYmNjTd++fc0vf/lL09TUFDTmtddeM8OHDzcOh8NccMEFQd+jq/rXkOZYh85LL71kLr74YuN0Os3gwYPNqlWrgvr9fr9ZuHChSUpKMk6n04wfP95UVFQEjfnss8/MzTffbHr27Gni4uLMrbfeao4ePRo05u233zZjx441TqfTnHfeeWbx4sVtvm828Xq95q677jKpqakmJibGXHDBBea+++4L+qhUZzjWPKoSAABL8Z40AACWIqQBALAUIQ0AgKUIaQAALEVIAwBgKUIaAABLEdIAAFiKkAYAwFKENAAAliKkAQCwFCENAICl/n+HJyJ03carvQAAAABJRU5ErkJggg==\n" }, "metadata": {} } ], "source": [ "text_lens = [len(n) for n in tokenized_text]\n", "\n", "import seaborn as sns\n", "sns.displot(text_lens)" ] }, { "cell_type": "markdown", "metadata": { "id": "ax4L4_5qLQNx" }, "source": [ "Vemos que tenemos varios documentos que tienen más de 1000 palabras." ] }, { "cell_type": "markdown", "metadata": { "id": "oC97vk_R6LY7" }, "source": [ "### Dividir las secuencias en subsecuencias" ] }, { "cell_type": "markdown", "metadata": { "id": "i5LmORHVLZ4M" }, "source": [ "Supongamos que el modelo del que disponemos **no puede procesar secuencias de mas de 200 palabras**. ¿Cómo resolver esta limitación y aplicar el modelo sobre este conjunto de datos?" ] }, { "cell_type": "markdown", "metadata": { "id": "v77oYtPm6LY7" }, "source": [ "Crearemos una funcion `split_text_with_context` que toma un texto de cantidad arbitraria de palabras y lo transforma en un arreglo de M textos o secuencias donde cada secuencia tiene como máximo `unique_words + context_words` palabras. Cada secuencia comienza con `context_words` palabras de la secuencia anterior para poder retener el contexto de la oración, generando así un *rolling window*.\n", "\n", "Como ejemplo, la siguiente imágen muestra un texto donde se aplicó esta transformación utilizando `context_words=3` y `unique_words=2`\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "xTMMAgXnCsg-" }, "source": [ "La función la hemos implementado de la siguiente forma:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3Wp1ry276LY7" }, "outputs": [], "source": [ "from m72109.nlp.transformation import split_text_with_context" ] }, { "cell_type": "markdown", "source": [ "> Nota: El siguiente codigo simplemente muestra la implementación de la función. Su utilización es simplemente educativa." ], "metadata": { "id": "isEdGyQvEUrC" } }, { "cell_type": "code", "source": [ "from inspect import getsource\n", "print(getsource(split_text_with_context))" ], "metadata": { "id": "XeJqa9UWDcM2", "outputId": "1da592af-e343-450d-b891-5ce7bdd91b62", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "def split_text_with_context(text: str, unique_words: int = 150, context_words: int = 50) -> List[str]:\n", " \"\"\"\n", " Divide una secuencia de palabras de longitud arbitraria en una lista de secuencias de no mas que\n", " `context_words + unique_words` palabras. Cada subsecuencia tendra `unique_words` palabras del texto\n", " original y las restantes palabras seran arrastradas de la subsecuencia anterior para manetener el \n", " contexto de la oración.\n", "\n", " Parameters\n", " ----------\n", " text : str\n", " El texto que queremos dividir en subsequencias.\n", " unique_words : int\n", " Cantidad de palabras únicas que deben manetenerse en cada subsecuencia.\n", " context_words : int\n", " Cantidad de palabras que deben traerse de la subsequencia anterior como contexto.\n", "\n", " Returns\n", " -------\n", " List[str]\n", " A list of sub-sequences of text of no more than `unique_words` + `context_words`.\n", " \"\"\"\n", "\n", " words = text.split()\n", " num_seqs = math.ceil(len(words)/unique_words)\n", "\n", " seqs = [' '.join(words[max(seq*unique_words - context_words, 0):(seq + 1)*unique_words]) for seq in range(num_seqs)]\n", " return seqs\n", "\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "7cFVWe1Z6LY7" }, "source": [ "Verifiquemos como se aplica en uno de los textos que tenemos disponibles. Buscaremos un texto con 1200 caracteres:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vcgx3ZPpFnHJ" }, "outputs": [], "source": [ "long_sequences = df.loc[df.text.str.len() > 1200]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Q3iyvcm1HQN4", "outputId": "5b03f213-ae51-4354-e5be-754cea5a87a6" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "From: jhwitten@cs.ruu.nl (Jurriaan Wittenberg)\n", "Subject: Re: Magellan Update - 04/16/93\n", "Organization: Utrecht University, Dept. of Computer Science\n", "Keywords: Magellan, JPL\n", "Lines: 29\n", "\n", "In <19APR199320262420@kelvin.jpl.nasa.gov> baalke@kelvin.jpl.nasa.gov \n", "(Ron Baalke) writes:\n", "\n", ">Forwarded from Doug Griffith, Magellan Project Manager\n", ">\n", "> MAGELLAN STATUS REPORT\n", "> April 16, 1993\n", ">\n", ">\n", ">2. Magellan has completed 7225 orbits of Venus and is now 39 days from\n", ">the end of Cycle-4 and the start of the Transition Experiment.\n", "Sorry I think I missed a bit of info on this Transition Experiment. What is it?\n", "\n", ">4. On Monday morning, April 19, the moon will occult Venus and\n", ">interrupt the tracking of Magellan for about 68 minutes.\n", "Will this mean a loss of data or will the Magellan transmit data later on ??\n", "\n", "BTW: When will NASA cut off the connection with Magellan?? Not that I am\n", "looking forward to that day but I am just curious. I believe it had something\n", "to do with the funding from the goverment (or rather _NO_ funding :-)\n", "\n", "ok that's it for now. See you guys around,\n", "Jurriaan.\n", " \n", "-- \n", "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n", "|----=|=-<- - - - - - JHWITTEN@CS.RUU.NL- - - - - - - - - - - - ->-=|=----|\n", "|----=|=-<-Jurriaan Wittenberg- - -Department of ComputerScience->-=|=----|\n", "|____/|\\_________Utrecht_________________The Netherlands___________/|\\____|\n", "\n" ] } ], "source": [ "sample_text = long_sequences['text'].iloc[0]\n", "print(sample_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ZQFVNXxr6LY7", "outputId": "64236c02-d3c6-40a2-e768-a65af23b86e1" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "********\n", "From: jhwitten@cs.ruu.nl (Jurriaan Wittenberg) Subject: Re: Magellan Update - 04/16/93 Organization: Utrecht University, Dept. of Computer Science Keywords: Magellan, JPL Lines: 29 In <19APR199320262420@kelvin.jpl.nasa.gov> baalke@kelvin.jpl.nasa.gov (Ron Baalke) writes: >Forwarded from Doug Griffith, Magellan Project Manager > > MAGELLAN STATUS REPORT\n", "********\n", "Lines: 29 In <19APR199320262420@kelvin.jpl.nasa.gov> baalke@kelvin.jpl.nasa.gov (Ron Baalke) writes: >Forwarded from Doug Griffith, Magellan Project Manager > > MAGELLAN STATUS REPORT > April 16, 1993 > > >2. Magellan has completed 7225 orbits of Venus and is now 39 days from >the end of Cycle-4 and the start of the Transition Experiment. Sorry I think I missed a bit of info\n", "********\n", ">the end of Cycle-4 and the start of the Transition Experiment. Sorry I think I missed a bit of info on this Transition Experiment. What is it? >4. On Monday morning, April 19, the moon will occult Venus and >interrupt the tracking of Magellan for about 68 minutes. Will this mean a loss of data or will the Magellan transmit\n", "********\n", "the tracking of Magellan for about 68 minutes. Will this mean a loss of data or will the Magellan transmit data later on ?? BTW: When will NASA cut off the connection with Magellan?? Not that I am looking forward to that day but I am just curious. I believe it had something to do with the funding from the\n", "********\n", "to that day but I am just curious. I believe it had something to do with the funding from the goverment (or rather _NO_ funding :-) ok that's it for now. See you guys around, Jurriaan. -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- |----=|=-<- - - - - - JHWITTEN@CS.RUU.NL- - - - - - - - - - - - ->-=|=----| |----=|=-<-Jurriaan Wittenberg- -\n", "********\n", "- - - - JHWITTEN@CS.RUU.NL- - - - - - - - - - - - ->-=|=----| |----=|=-<-Jurriaan Wittenberg- - -Department of ComputerScience->-=|=----| |____/|\\_________Utrecht_________________The Netherlands___________/|\\____|\n" ] } ], "source": [ "transformed = split_text_with_context(sample_text, 40, 20)\n", "\n", "for sequence in transformed:\n", " print('********')\n", " print(sequence)" ] }, { "cell_type": "markdown", "metadata": { "id": "v2WZ2Gsa6LY7" }, "source": [ "Apliquemoslo sobre todo el dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZE2mANQy6LY7" }, "outputs": [], "source": [ "df.loc[:,'text'] = df['text'].apply(split_text_with_context)" ] }, { "cell_type": "markdown", "metadata": { "id": "Hov3C-Rv6LY8" }, "source": [ "### Generando un nuevo dataset para entrenar el modelo" ] }, { "cell_type": "markdown", "metadata": { "id": "nbrDVfl06LY8" }, "source": [ "Hasta el momento disponemos de un dataset donde una de sus columnas es un arreglo de secuencias de texto. Esta estructura de datos no puede ser utilizada con un modelo de procesamiento de texto y por lo tanto es necesario \"aplanarla\". Esto quiere decir que debemos convertir los elementos del arreglo en filas de nuestro dataset." ] }, { "cell_type": "markdown", "metadata": { "id": "YhT_8IxB6LY8" }, "source": [ "El método `explode` transforma un data frame donde una de sus columnas es un arreglo, en otro data frame donde los elementos del arreglo se transforman en filas y los restantes valores son duplicados. Este método nos ayudará en este caso a que todas las subsecuencias que se generaron de la misma secuencia reciban los mismos valores para las restantes columnas (como por ejemplo, la variable a predecir). El efecto de explode es el siguiente:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0mDjItAw6LY8" }, "outputs": [], "source": [ "df = df.explode('text').reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IwN8DNQc6LY8", "outputId": "d0bd48ce-c0df-4ef1-a8f0-0992127dff52" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(2814, 2)" ] }, "metadata": {}, "execution_count": 12 } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "s_D6GPOpIGoZ" }, "source": [ "> Note como pasamos de tener 1073 filas a tener 2119. Las filas extras que se agregaron corresponden a secuencias de texto que eran demasiado largas y que fueron divididas en subsecuencias, aunque manteniendo la misma anotación de la secuencia original." ] }, { "cell_type": "markdown", "metadata": { "id": "WQfFzkhq6LY8" }, "source": [ "Resultado final" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 363 }, "id": "_mPGKjEn6LY8", "outputId": "ca470eb3-67bd-4081-c7ea-9ff0a9dab5e2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " text category\n", "0 From: bil@okcforum.osrhe.edu (Bill Conner) Sub... 0\n", "1 From: jhwitten@cs.ruu.nl (Jurriaan Wittenberg)... 1\n", "2 the tracking of Magellan for about 68 minutes.... 1\n", "3 From: sysmgr@king.eng.umd.edu (Doug Mohney) Su... 1\n", "4 > >Nobody who is interested in launching thing... 1\n", "5 little itty-bitty payloads in LEO with your la... 1\n", "6 From: pgf@srl03.cacs.usl.edu (Phil G. Fraering... 1\n", "7 gamma ray spectroscopy on distant asteroids wi... 1\n", "8 From: Nanci Ann Miller ... 0\n", "9 decided to improve both of their images on wha... 0" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textcategory
0From: bil@okcforum.osrhe.edu (Bill Conner) Sub...0
1From: jhwitten@cs.ruu.nl (Jurriaan Wittenberg)...1
2the tracking of Magellan for about 68 minutes....1
3From: sysmgr@king.eng.umd.edu (Doug Mohney) Su...1
4> >Nobody who is interested in launching thing...1
5little itty-bitty payloads in LEO with your la...1
6From: pgf@srl03.cacs.usl.edu (Phil G. Fraering...1
7gamma ray spectroscopy on distant asteroids wi...1
8From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ...0
9decided to improve both of their images on wha...0
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 13 } ], "source": [ "df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bBOeG-bG6LY8" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "name": "long_sequences.ipynb", "provenance": [], "toc_visible": true }, "kernel_info": { "name": "nlp-py38" }, "kernelspec": { "display_name": "NLP (Python 3.8)", "language": "python", "name": "nlp-py38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "nteract": { "version": "nteract-front-end@1.0.0" } }, "nbformat": 4, "nbformat_minor": 0 }