Metadata-Version: 2.1
Name: bulstem
Version: 0.3.2
Summary: Python version of the BulStem stemming algorithm
Home-page: https://github.com/mhardalov/bulstem-py
Author: Momchil Hardalov
Author-email: momchil.hardalov@gmail.com
License: Apache License, Version 2.0
Keywords: NLP stemmer Bulgarian bulstem
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.6.0
Description-Content-Type: text/markdown
Provides-Extra: testing
Requires-Dist: pytest ; extra == 'testing'
Requires-Dist: nltk ; extra == 'testing'

# BulStem-py: A Python Re-implementation of BulStem - inflectional stemmer for Bulgarian

[![build](https://img.shields.io/circleci/build/github/mhardalov/bulstem-py/master)](https://circleci.com/gh/mhardalov/bulstem-py)
[![license](https://img.shields.io/github/license/mhardalov/bulstem-py.svg?color=blue)](https://github.com/mhardalov//bulstem-py/blob/master/LICENSE)


## Introduction
This is the Python version of the BulStem stemming algorithm. It follows the algorithm presented in

```
Nakov, P. BulStem: Design and evaluation of inflectional stemmer for Bulgarian. In Workshop on 
Balkan Language Resources and Tools (Balkan Conference in Informatics).
```

See http://people.ischool.berkeley.edu/~nakov/bulstem/ for the homepage of the algorithm. Also, check the original [paper](http://people.ischool.berkeley.edu/~nakov/bulstem/BulStem.pdf) for more details and examples.

## Implementation

This implementation, in contrast of other available, uses a Trie, instead of Dictionary/Hashtable/, to find the longest possible rule, which can be applied to a certain token.
The Stemmer class is derived from NLTK's `StemmerI` interface, making it fully compatible with its pipelines. 

Basic algorithm steps:
1. Find the position of the first vowel in the token.
2. Finds the longest possible rule traversing the string in reverse order until there is a matching suffix, or the position of the first vowel found in Step. 1.
3. Prepend the non-stemmed prefix to the stemmed suffix (Step. 2).

## Installation

This library is compatible Python >= 3.6.

Clone the repository and run:

### With pip

```bash
pip install -e .
pip install -r requirements.txt
```

### Test

A set of tests are included in the project, under the [tests folder](https://github.com/mhardalov/bulstem-py/tree/master/tests).
The test suit can be run as follows:


```bash
pip install -e ".[testing]"
pip install -r requirements-test.txt
python -m unittest
```

## Usage

The library needs a set of rules to apply stemming properly. The rules can be either a list to the `BulStemmer` constructor, or a path to a file containing them.

For both options the rules need to be formatted as follows:

`word ==> stem ==> freq`

Pre-defined set of rules is included in the distribution, and can be used directly by the user, and can be found [here](https://github.com/mhardalov/bulstem-py/tree/master/bulstem/stemrules). (examples: [Reading the rules from an external file](#reading-the-rules-from-an-external-file))

### Manually loading rules

```python
from bulstem.stem import BulStemmer

stemmer = BulStemmer(["ой ==> о 10"], min_freq=0, left_context=2)
stemmer.stem('порой')# Excepted output: 1. 'поро'
```

`BulStemmer` constructor params:
1. `rules` - Iterable of strings containing rules.
2. `min_freq` - The minimum frequency of a rule to be used when stemming.
3. `left_context` - Size of the prefix which will not be stemmed.

### Reading the rules from an external file

```python
from bulstem.stem import BulStemmer


# Pre-defined names of rule sets
PRE_DEFINED_RULES = ['stem-context-1', 
                     'stem-context-2',
                     'stem-context-3']

# Excepted output:
# 1 втор
# 2 втори
# 3 вторият
for i, rules_name in enumerate(PRE_DEFINED_RULES, start=1):
    stemmer = BulStemmer.from_file(rules_name, min_freq=2, left_context=i)
    print(i, stemmer.stem('вторият'))

stemmer = BulStemmer.from_file('stem_rules_context_2_utf8.txt', min_freq=2, left_context=i)
stemmer.stem('вторият') # Excepted output: 1. 'втори'
stemmer.stem('вероятен') # Excepted output: 1. 'вероят'
```

`BulStemmer.from_file` params:
1. `path` - Path (or pre-defined name) to the rules file formatted, as follows: word ==> stem ==> freq.
2. `min_freq` - The minimum frequency of a rule to be used when stemming.
3. `left_context` - Size of the prefix which will not be stemmed.


## Other implementations

[Perl (Original)](http://people.ischool.berkeley.edu/~nakov/bulstem/apply_stem.pl),
[Java (JDK 1.4)](http://people.ischool.berkeley.edu/~nakov/bulstem/Stemmer.java),
[Ruby](https://github.com/tbmihailov/bulstem),
[C#](https://github.com/tbmihailov/bulstem-cs),
[Python2](https://github.com/peio/PyBulStem),
[GATE plugin (Java)](https://gate.ac.uk/gate/plugins/Lang_Bulgarian/src/gate/bulstem/BulStemPR.java)

## License

For license information, see [LICENSE](https://github.com/mhardalov/bulstem-py/blob/master/LICENSE).


