Metadata-Version: 2.1
Name: biterm
Version: 0.1.1
Summary: Biterm Topic Model
Home-page: https://github.com/markoarnauto/biterm
Author: markoarnauto
Author-email: markus.tretzmueller@cortecs.at
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: tqdm

# Biterm Topic Model

This is a simple Cython implenentaion of the awesome
[Biterm Topic Model](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf).
This model is accurate in short text classification.
It explicitly models the word co-occurrence patterns in the whole corpus to solve the problem of sparse word co-occurrence at document-level.

Simply install by:
```
pip install biterm
```

Load some short texts and vectorize them via sklearn.

```python
    from sklearn.feature_extraction.text import CountVectorizer

    texts = open('./data/reuters.titles').read().splitlines()[:50]
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()
```
Get the vocabulary and the biterms from the texts.
```python
    from biterm.utility import vec_to_biterms

    vocab = np.array(vec.get_feature_names())
    biterms = vec_to_biterms(X)
```
Create a BTM and pass the biterms to train it.
```python
    from biterm.cbtm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)
```
Save a topic plot using pyLDAvis and explore the results! (also see *simple_btml.py*)
```python
    from biterm.btm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)
```

Inference is done with Gibbs Sampling and it's not really fast. The implementation is not meant for production.
But if you have to classify a lot of texts you can try using online learning. Also use the Cython version.
```python
import numpy as np
import pyLDAvis
from biterm.cbtm import oBTM # use Cython model
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary # helper functions

if __name__ == "__main__":

    texts = open('./data/reuters.titles').read().splitlines()

    # vectorize texts
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

    # get vocabulary
    vocab = np.array(vec.get_feature_names())

    # get biterms
    biterms = vec_to_biterms(X)

    # create btm
    btm = oBTM(num_topics=20, V=vocab)

    print("\n\n Train Online BTM ..")
    for i in range(0, len(biterms), 100): # prozess chunk of 200 texts
        biterms_chunk = biterms[i:i + 100]
        btm.fit(biterms_chunk, iterations=50)
    topics = btm.transform(biterms)

    print("\n\n Visualize Topics ..")
    vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0))
    pyLDAvis.save_html(vis, './vis/online_btm.html')

    print("\n\n Topic coherence ..")
    topic_summuary(btm.phi_wz.T, X, vocab, 10)

    print("\n\n Texts & Topics ..")
    for i in range(len(texts)):
        print("{} (topic: {})".format(texts[i], topics[i].argmax()))
```

