Metadata-Version: 2.1
Name: bertopic
Version: 0.7.0
Summary: BERTopic performs topic Modeling with state-of-the-art transformer models.
Home-page: https://github.com/MaartenGr/BERTopic
Author: Maarten P. Grootendorst
Author-email: maartengrootendorst@gmail.com
License: UNKNOWN
Project-URL: Documentation, https://maartengr.github.io/BERTopic/
Project-URL: Source Code, https://github.com/MaartenGr/BERTopic/
Project-URL: Issue Tracker, https://github.com/MaartenGr/BERTopic/issues
Keywords: nlp bert topic modeling embeddings
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: numpy (>=1.20.0)
Requires-Dist: hdbscan (>=0.8.27)
Requires-Dist: umap-learn (>=0.5.0)
Requires-Dist: pandas (>=1.1.5)
Requires-Dist: scikit-learn (>=0.22.2.post1)
Requires-Dist: tqdm (>=4.41.1)
Requires-Dist: sentence-transformers (>=0.4.1)
Requires-Dist: plotly (<4.14.3,>=4.7.0)
Provides-Extra: all
Requires-Dist: torch (<1.7.1,>=1.4.0) ; extra == 'all'
Requires-Dist: flair (==0.7) ; extra == 'all'
Requires-Dist: spacy (>=3.0.1) ; extra == 'all'
Requires-Dist: tensorflow ; extra == 'all'
Requires-Dist: tensorflow-hub ; extra == 'all'
Requires-Dist: tensorflow-text ; extra == 'all'
Requires-Dist: gensim (>=3.6.0) ; extra == 'all'
Provides-Extra: dev
Requires-Dist: mkdocs (>=1.1) ; extra == 'dev'
Requires-Dist: mkdocs-material (>=4.6.3) ; extra == 'dev'
Requires-Dist: mkdocstrings (>=0.8.0) ; extra == 'dev'
Requires-Dist: pytest (>=5.4.3) ; extra == 'dev'
Requires-Dist: pytest-cov (>=2.6.1) ; extra == 'dev'
Requires-Dist: torch (<1.7.1,>=1.4.0) ; extra == 'dev'
Requires-Dist: flair (==0.7) ; extra == 'dev'
Requires-Dist: spacy (>=3.0.1) ; extra == 'dev'
Requires-Dist: tensorflow ; extra == 'dev'
Requires-Dist: tensorflow-hub ; extra == 'dev'
Requires-Dist: tensorflow-text ; extra == 'dev'
Requires-Dist: gensim (>=3.6.0) ; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs (>=1.1) ; extra == 'docs'
Requires-Dist: mkdocs-material (>=4.6.3) ; extra == 'docs'
Requires-Dist: mkdocstrings (>=0.8.0) ; extra == 'docs'
Provides-Extra: flair
Requires-Dist: torch (<1.7.1,>=1.4.0) ; extra == 'flair'
Requires-Dist: flair (==0.7) ; extra == 'flair'
Provides-Extra: gensim
Requires-Dist: gensim (>=3.6.0) ; extra == 'gensim'
Provides-Extra: spacy
Requires-Dist: spacy (>=3.0.1) ; extra == 'spacy'
Provides-Extra: test
Requires-Dist: pytest (>=5.4.3) ; extra == 'test'
Requires-Dist: pytest-cov (>=2.6.1) ; extra == 'test'
Provides-Extra: use
Requires-Dist: tensorflow ; extra == 'use'
Requires-Dist: tensorflow-hub ; extra == 'use'
Requires-Dist: tensorflow-text ; extra == 'use'

[![PyPI - Python](https://img.shields.io/badge/python-v3.6+-blue.svg)](https://pypi.org/project/bertopic/)
[![Build](https://img.shields.io/github/workflow/status/MaartenGr/BERTopic/Code%20Checks/master)](https://pypi.org/project/bertopic/)
[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/BERTopic/)
[![PyPI - PyPi](https://img.shields.io/pypi/v/BERTopic)](https://pypi.org/project/bertopic/)
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/VLAC/blob/master/LICENSE)
[![DOI](https://zenodo.org/badge/297672263.svg)](https://zenodo.org/badge/latestdoi/297672263)


# BERTopic

<img src="images/logo.png" width="35%" height="35%" align="right" />

BERTopic is a topic modeling technique that leverages ðŸ¤— transformers and c-TF-IDF to create dense clusters
allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports 
visualizations similar to LDAvis! 

Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99) 
and [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4).

## Installation

Installation, with sentence-transformers, can be done using [pypi](https://pypi.org/project/bertopic/):

```bash
pip install bertopic
```

You may want to install more depending on the transformers and language backends that you will be using. 
The possible installations are: 

```bash
pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]
```

To install all backends:

```bash
pip install bertopic[all]
```


## Getting Started
For an in-depth overview of the features of BERTopic 
you can check the full documentation [here](https://maartengr.github.io/BERTopic/) or you can follow along 
with one of the examples below:

| Name  | Link  |
|---|---|
| Topic Modeling with BERTopic  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing)  |
| (Custom) Embedding Models in BERTopic  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing) |
| Advanced Customization in BERTopic  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |
| (semi-)Supervised Topic Modeling with BERTopic  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing)  |
| Dynamic Topic Modeling with Trump's Tweets  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing)  |



## Quick Start
We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)
```

After generating topics, we can access the frequent topics that were generated:

```python
>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
49	693	49_windows_drive_dos_file
32	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
22	381	22_key_encryption_keys_encrypted
```

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most 
frequent topic that was generated, topic 49:

```python
>>> topic_model.get_topic(49)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]
```  

**NOTE**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages. 

## Visualize Topics
After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

```python
topic_model.visualize_topics()
``` 

<img src="images/topic_visualization.gif" width="60%" height="60%" align="center" />


## Embedding Models
BERTopic supports many embedding models that can be used to embed the documents and words:
* Sentence-Transformers
* Flair
* Spacy
* Gensim
* USE

Click [here](https://maartengr.github.io/BERTopic/tutorial/embeddings/embeddings.html) 
for a full overview of all supported embedding models. 

### Sentence-Transformers  
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) 
and pass it to BERTopic:

```python
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")
```

Or select a SentenceTransformer model with your own parameters:

```python
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model)
```

### Flair  
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that 
is publicly available. Flair can be used as follows:

```python
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
```

You can select any ðŸ¤— transformers model [here](https://huggingface.co/models).

**Custom Embeddings**    
You can also use previously generated embeddings by passing it to `fit_transform()`:

```python
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs, embeddings)
```

## Dynamic Topic Modeling
Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics 
over time. These methods allow you to understand how a topic is represented across different times. 
Here, we will be using all of Donald Trump's tweet so see how he talked over certain topics over time: 

```python
import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
```

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

```python
topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(tweets)
```

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this 
by simply calling `topics_over_time` and pass in his tweets, the corresponding timestamps, and the related topics:

```python
topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps)
```

Finally, we can visualize the topics by simply calling `visualize_topics_over_time()`: 

```python
topic_model.visualize_topics_over_time(topics_over_time, top_n=6)
```

<img src="images/dtm.gif" width="80%" height="80%" align="center" />

## Overview
For quick access to common function, here is an overview of BERTopic's main methods:

| Method | Code  | 
|-----------------------|---|
| Fit the model    |  `BERTopic().fit(docs)` |
| Fit the model and predict documents    |  `BERTopic().fit_transform(docs)` |
| Predict new documents    |  `BERTopic().transform([new_doc])` |
| Access single topic   | `BERTopic().get_topic(topic=12)`  |   
| Access all topics     |  `BERTopic().get_topics()` |
| Get topic freq    |  `BERTopic().get_topic_freq()` |
| Get all topic information|  `BERTopic().get_topic_info()` |
| Get topics per class | `BERTopic().topics_per_class(docs, topics, classes)` |
| Dynamic Topic Modeling | `BERTopic().topics_over_time(docs, topics, timestamps)` |
| Visualize Topics    |  `BERTopic().visualize_topics()` |
| Visualize Topic Probability Distribution    |  `BERTopic().visualize_distribution(probs[0])` |
| Visualize Topics over Time   |  `BERTopic().visualize_topics_over_time(topics_over_time)` |
| Visualize Topics per Class | `BERTopic().visualize_topics_per_class(topics_per_class)` | 
| Update topic representation | `BERTopic().update_topics(docs, topics, n_gram_range=(1, 3))` |
| Reduce nr of topics | `BERTopic().reduce_topics(docs, topics, nr_topics=30)` |
| Find topics | `BERTopic().find_topics("vehicle")` |
| Save model    |  `BERTopic().save("my_model")` |
| Load model    |  `BERTopic.load("my_model")` |
| Get parameters |  `BERTopic().get_params()` |

## Citation
To cite BERTopic in your work, please use the following bibtex reference:

```bibtex
@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.5.0},
  doi          = {10.5281/zenodo.4430182},
  url          = {https://doi.org/10.5281/zenodo.4430182}
}
```


