BERT en un problema de modelado de tópicos

Introducción

Los modelos basados en transformers nos pueden ayudar a resolver varios tipos de problemas. Desde problemas de clasificación y regresión hasta tareas más complejas como resumen de textos o generación de leguaje condicionado. Veremos como aplicar las técnicas de modelado de tópicos (Topic Modeling) utilizando un modelo basado en BERT para nuestro típico problema de clasificación de tweets.

Para ejecutar este notebook

Para ejecutar este notebook, instale las siguientes librerias:

[ ]:

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \
    --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/normalization.py \
    --quiet --no-clobber --directory-prefix ./m72109/nlp/

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/neural/bertopic.txt \
    --quiet --no-clobber

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/preprocessing/Normalization.txt \
    --quiet --no-clobber

!pip install -r Normalization.txt --quiet
!pip install -r bertopic.txt --quiet

     |████████████████████████████████| 10.4 MB 4.6 MB/s
     |████████████████████████████████| 235 kB 45.2 MB/s
     |████████████████████████████████| 184 kB 46.9 MB/s
     |████████████████████████████████| 1.0 MB 42.3 MB/s
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.4.0 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.5 which is incompatible.
confection 0.0.2 requires srsly<3.0.0,>=2.4.0, but you have srsly 1.0.5 which is incompatible.
     |████████████████████████████████| 3.1 MB 5.3 MB/s
     |████████████████████████████████| 831.4 MB 2.5 kB/s
     |████████████████████████████████| 306 kB 57.5 MB/s
     |████████████████████████████████| 90 kB 8.5 MB/s
     |████████████████████████████████| 163 kB 57.8 MB/s
     |████████████████████████████████| 3.3 MB 36.7 MB/s
     |████████████████████████████████| 880 kB 53.6 MB/s
     |████████████████████████████████| 5.2 MB 47.1 MB/s
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
     |████████████████████████████████| 88 kB 5.8 MB/s
     |████████████████████████████████| 85 kB 4.3 MB/s
     |████████████████████████████████| 636 kB 56.0 MB/s
     |████████████████████████████████| 1.3 MB 68.6 MB/s
     |████████████████████████████████| 1.1 MB 59.0 MB/s
     |████████████████████████████████| 19.1 MB 1.2 MB/s
     |████████████████████████████████| 19.1 MB 94.5 MB/s
     |████████████████████████████████| 21.0 MB 1.2 MB/s
     |████████████████████████████████| 23.2 MB 1.4 MB/s
     |████████████████████████████████| 23.3 MB 1.5 MB/s
     |████████████████████████████████| 23.3 MB 15.3 MB/s
     |████████████████████████████████| 22.1 MB 61.0 MB/s
     |████████████████████████████████| 22.1 MB 12.5 MB/s
  Building wheel for hdbscan (PEP 517) ... done
  Building wheel for sentence-transformers (setup.py) ... done
  Building wheel for umap-learn (setup.py) ... done
  Building wheel for pynndescent (setup.py) ... done
  Building wheel for sacremoses (setup.py) ... done
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.13.1 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.

[ ]:

import warnings
warnings.filterwarnings('ignore')

Cargamos el set de datos

[ ]:

import pandas as pd

tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')

[ ]:

from m72109.nlp.normalization import TweetTextNormalizer

[ ]:

normalizer = TweetTextNormalizer(lemmatize=False, stem=False, reduce_len=True, strip_handles=True, strip_stopwords=False, strip_urls=True, strip_accents=True)

[ ]:

docs = normalizer.transform(tweets['TEXTO'])

Verificando el hardware disponible

[ ]:

import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

print("Este notebook se está ejecutando en", device)

Este notebook se está ejecutando en cpu

Clustering

Los modelos como BERT son capaces de generar representaciones o embeddings contextualizados para las secuencias de texto que se introducen. En uno de los ejemplos anteriores, entrenamos un modelo de clasificación de tweets. El mismo consistia de una arquitectura basada en transformers + un clasificador.

En este ejemplo, no utilizaremos la capa de clasificación sino que solo nos quedaremos con los embeddings. Tenga en cuenta que, a pesar de que no utilizamos el clasificador (MLP), los embeddings de BERT fueron ajustados (fine-tune) al problema de clasificación de tweets puntualmente. Esto hace que nuestra capacidad de clustering sea más acorde al conjunto de datos.

Para mostrar como funciona, carguemos el modelo de clasificación de tweets. El mismo está publicado en HuggingFace bajo la cuenta de esta materia:

[ ]:

from transformers import pipeline

embedding_model = pipeline("feature-extraction", model="fce-m72109/mascorpus-bert-classifier")

Some weights of the model checkpoint at fce-m72109/mascorpus-bert-classifier were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Note como la tarea que indicamos en el pipeline es feature-extration.

Clusters según palabras

Utilizando la libraria BERTopic, podemos computar los embeddings para su visualización:

[ ]:

from bertopic import BERTopic

topic_model = BERTopic(language='spanish', embedding_model=embedding_model)

[ ]:

topics, probs = topic_model.fit_transform(docs)

Podemos ver que la libraría a detectado 28 diferentes tópicos:

[ ]:

import numpy as np

np.unique(np.asarray(topics))

array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])

Veamos como estos tópicos se agrupan:

[ ]:

topic_model.visualize_topics()

Clusters según oraciones

Podemos también computar los embeddings de todo un documento u oración. La libraría sentence-transformers nos da esta facilidad. De igual forma, indicaremos el modelo que entrenamos para la clasificación de tweets:

[ ]:

from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer('fce-m72109/mascorpus-bert-classifier')
embeddings = sentence_model.encode(docs, show_progress_bar=False)

WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/fce-m72109_mascorpus-bert-classifier. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/fce-m72109_mascorpus-bert-classifier were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Ejecutamos las visualizaciones con los embeddings originales

[ ]:

topic_model.visualize_documents(docs, embeddings=embeddings)

De forma alternativa, podemos reducir la dimensionalidad de los embeddings para que la ejecución sea mucho mas rápida

[ ]:

from umap import UMAP

reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)