Metadata-Version: 2.1
Name: EntityNormalizer
Version: 0.1.0.post2
Summary: Library for normalizing entities based on a dictionary
Home-page: https://github.com/sociocom/EntityNormalizer
Download-URL: https://github.com/sociocom/EntityNormalizer/archive/refs/tags/0.1.0.tar.gz
Author: Gabriel Herman Bernardim Andrade
License: MIT
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: pandas
Requires-Dist: rapidfuzz
Requires-Dist: mojimoji

# Entity Normalizer

Python tool for normalizing entities based on a dictionary.

## Usage

This tool can be used as:
- a *command line tool*, by cloning this repository and running `pip install .` under the root of this source; then you can run`main.py` with the required parameters to process your entity-listed file.
- a *Python package* , by installing the package using `pip install EntityNormalizer`

### Input and output
  
The input file must contain one entity per line. 
The output file will contain the normalized entities, again, one per line.  
The dicitory file must be a comma-separated table file, i.e., `csv`.

If the entity does not produce any match in the dictionary, it will be normalized to `[NO_MATCH]`. 
If the entity is found in the dictionary but the normalization is empty, it will be normalized to `[NO_NORM_FOUND]`.  
  
----
  
### Command line usage

`python main.py input output dictionary source target [--matching_threshold MATCHING_THRESHOLD] [--index]`

#### Parameters

- `input`: Input file path [Required]
- `output`: Output file path [Required]
- `dictionary`: Normalization dictionary file path [Required]
- `source`: Surface form column from dictionary [Required]
- `target`: Normalization column from dictionary [Required]
- `matching_threshold`: Threshold of string similarity for the normalization to be accepted (default: 50) [Optional]
- `index`: Use column indexes instead of names [Optional]
  
#### Example

- With column names:  
    `python main.py data/input.txt data/output.txt data/dictionary.csv surface_form_col normalization_col --matching_threshold 50`
- With integer column indexes: 
    `python main.py data/input.txt data/output.txt data/dictionary.csv --index source 0 target 2 --matching_threshold 80`

---

 ### Python package usage
After instalation,  the `normalized` function can be invoked with the dicitonary and a `list` of entities to produce a `list` of normalized entities.

#### Example
```python
from EntityNormalizer import EntityDictionary, normalize

entities  =  [entity1, entity2, entity3]

normalization_dictionary  = EntityDictionary('data/dictionary.csv',  'surface_forms',  'normalizations')
normalized  = normalize(entities,  normalization_dictionary,  matching_threshold=70)

print(normalized)
```
