Abrir en Google Colab
|
Descargar notebook
|
BERT en un problema de modelado de tópicos
Introducción
Los modelos basados en transformers nos pueden ayudar a resolver varios tipos de problemas. Desde problemas de clasificación y regresión hasta tareas más complejas como resumen de textos o generación de leguaje condicionado. Veremos como aplicar las técnicas de modelado de tópicos (Topic Modeling) utilizando un modelo basado en BERT para nuestro típico problema de clasificación de tweets.
Para ejecutar este notebook
Para ejecutar este notebook, instale las siguientes librerias:
[ ]:
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \
--quiet --no-clobber --directory-prefix ./Datasets/mascorpus/
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/normalization.py \
--quiet --no-clobber --directory-prefix ./m72109/nlp/
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/neural/bertopic.txt \
--quiet --no-clobber
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/preprocessing/Normalization.txt \
--quiet --no-clobber
!pip install -r Normalization.txt --quiet
!pip install -r bertopic.txt --quiet
|████████████████████████████████| 10.4 MB 4.6 MB/s
|████████████████████████████████| 235 kB 45.2 MB/s
|████████████████████████████████| 184 kB 46.9 MB/s
|████████████████████████████████| 1.0 MB 42.3 MB/s
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.4.0 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.5 which is incompatible.
confection 0.0.2 requires srsly<3.0.0,>=2.4.0, but you have srsly 1.0.5 which is incompatible.
|████████████████████████████████| 3.1 MB 5.3 MB/s
|████████████████████████████████| 831.4 MB 2.5 kB/s
|████████████████████████████████| 306 kB 57.5 MB/s
|████████████████████████████████| 90 kB 8.5 MB/s
|████████████████████████████████| 163 kB 57.8 MB/s
|████████████████████████████████| 3.3 MB 36.7 MB/s
|████████████████████████████████| 880 kB 53.6 MB/s
|████████████████████████████████| 5.2 MB 47.1 MB/s
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
|████████████████████████████████| 88 kB 5.8 MB/s
|████████████████████████████████| 85 kB 4.3 MB/s
|████████████████████████████████| 636 kB 56.0 MB/s
|████████████████████████████████| 1.3 MB 68.6 MB/s
|████████████████████████████████| 1.1 MB 59.0 MB/s
|████████████████████████████████| 19.1 MB 1.2 MB/s
|████████████████████████████████| 19.1 MB 94.5 MB/s
|████████████████████████████████| 21.0 MB 1.2 MB/s
|████████████████████████████████| 23.2 MB 1.4 MB/s
|████████████████████████████████| 23.3 MB 1.5 MB/s
|████████████████████████████████| 23.3 MB 15.3 MB/s
|████████████████████████████████| 22.1 MB 61.0 MB/s
|████████████████████████████████| 22.1 MB 12.5 MB/s
Building wheel for hdbscan (PEP 517) ... done
Building wheel for sentence-transformers (setup.py) ... done
Building wheel for umap-learn (setup.py) ... done
Building wheel for pynndescent (setup.py) ... done
Building wheel for sacremoses (setup.py) ... done
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.13.1 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
[ ]:
import warnings
warnings.filterwarnings('ignore')
Cargamos el set de datos
[ ]:
import pandas as pd
tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')
[ ]:
from m72109.nlp.normalization import TweetTextNormalizer
[ ]:
normalizer = TweetTextNormalizer(lemmatize=False, stem=False, reduce_len=True, strip_handles=True, strip_stopwords=False, strip_urls=True, strip_accents=True)
[ ]:
docs = normalizer.transform(tweets['TEXTO'])
Verificando el hardware disponible
[ ]:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print("Este notebook se está ejecutando en", device)
Este notebook se está ejecutando en cpu
Clustering
Los modelos como BERT son capaces de generar representaciones o embeddings contextualizados para las secuencias de texto que se introducen. En uno de los ejemplos anteriores, entrenamos un modelo de clasificación de tweets. El mismo consistia de una arquitectura basada en transformers + un clasificador.
En este ejemplo, no utilizaremos la capa de clasificación sino que solo nos quedaremos con los embeddings. Tenga en cuenta que, a pesar de que no utilizamos el clasificador (MLP), los embeddings de BERT fueron ajustados (fine-tune) al problema de clasificación de tweets puntualmente. Esto hace que nuestra capacidad de clustering sea más acorde al conjunto de datos.
Para mostrar como funciona, carguemos el modelo de clasificación de tweets. El mismo está publicado en HuggingFace bajo la cuenta de esta materia:
[ ]:
from transformers import pipeline
embedding_model = pipeline("feature-extraction", model="fce-m72109/mascorpus-bert-classifier")
Some weights of the model checkpoint at fce-m72109/mascorpus-bert-classifier were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Note como la tarea que indicamos en el pipeline es
feature-extration.
Clusters según palabras
Utilizando la libraria BERTopic, podemos computar los embeddings para su visualización:
[ ]:
from bertopic import BERTopic
topic_model = BERTopic(language='spanish', embedding_model=embedding_model)
[ ]:
topics, probs = topic_model.fit_transform(docs)
Podemos ver que la libraría a detectado 28 diferentes tópicos:
[ ]:
import numpy as np
np.unique(np.asarray(topics))
array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])
Veamos como estos tópicos se agrupan:
[ ]:
topic_model.visualize_topics()
Clusters según oraciones
Podemos también computar los embeddings de todo un documento u oración. La libraría sentence-transformers nos da esta facilidad. De igual forma, indicaremos el modelo que entrenamos para la clasificación de tweets:
[ ]:
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer('fce-m72109/mascorpus-bert-classifier')
embeddings = sentence_model.encode(docs, show_progress_bar=False)
WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/fce-m72109_mascorpus-bert-classifier. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/fce-m72109_mascorpus-bert-classifier were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Ejecutamos las visualizaciones con los embeddings originales
[ ]:
topic_model.visualize_documents(docs, embeddings=embeddings)
De forma alternativa, podemos reducir la dimensionalidad de los embeddings para que la ejecución sea mucho mas rápida
[ ]:
from umap import UMAP
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
Abrir en Google Colab
Descargar notebook