Metadata-Version: 2.1
Name: binonymizer
Version: 0.1
Summary: Binonymizer is a tool in Python that aims at tagging personal data in a parallel corpus.
Home-page: https://github.com/bitextor/binonymizer
Author: Prompsit Language Engineering
Author-email: info@prompsit.com
Maintainer: Marta Bañón
Maintainer-email: mbanon@prompsit.com
License: GNU General Public License v3.0
Project-URL: Binonymizer on GitHub, https://github.com/bitextor/binonymizer
Project-URL: Prompsit Language Engineering, http://www.prompsit.com
Project-URL: Paracrawl, https://paracrawl.eu/
Platform: UNKNOWN
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.6
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: POSIX :: Linux
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Filters
Description-Content-Type: text/markdown
Requires-Dist: certifi (==2018.11.29)
Requires-Dist: chardet (==3.0.4)
Requires-Dist: cymem (==2.0.2)
Requires-Dist: cytoolz (==0.9.0.1)
Requires-Dist: dill (==0.2.9)
Requires-Dist: idna (==2.8)
Requires-Dist: jpype1
Requires-Dist: msgpack (==0.5.6)
Requires-Dist: msgpack-numpy (==0.4.3.2)
Requires-Dist: murmurhash (==1.0.1)
Requires-Dist: numpy (==1.16.1)
Requires-Dist: plac (==0.9.6)
Requires-Dist: preshed (==2.0.1)
Requires-Dist: regex (==2018.1.10)
Requires-Dist: requests (==2.21.0)
Requires-Dist: semver (==2.8.1)
Requires-Dist: six (==1.12.0)
Requires-Dist: spacy (==2.0.18)
Requires-Dist: thinc (==6.12.1)
Requires-Dist: toolz (==0.9.0)
Requires-Dist: tqdm (==4.31.1)
Requires-Dist: ujson (==1.35)
Requires-Dist: urllib3 (==1.24.1)
Requires-Dist: wrapt (==1.10.11)

# binonymizer

Binonymizer is a tool in Python that aims at tagging personal data<sup>1</sup> in a parallel corpus.

For example, for a input like:

```
URL1  URL2  My name is Marta and my email is fake@email.com    Mi nombre es Marta y mi email es fake@email.com
```

Binonymizer's output will be:

```
URL1 URL2 My name is <entity class="PER">Marta</entity> and my email is <entity class="EMAIL">fake@email.com</entity> Mi nombre es <entity class="PER">Marta</entity> y mi email es <entity class="EMAIL">fake@email.com</entity>
```
## Detectable entity tipes

Currently, the Binonymizer is able to detect and tag the following types of entities:

* PER: person names
* ORG: organism and company names
* EMAIL: email addresses
* PHONE: phone numbers
* ADDRESS: addresses
* ID: personal card IDs (such as spanish DNIs)
* MISC: other personal data, or when the type it's uncertain 
* OTHER: other

## Installation & Requirements

Binonymizer works with Python 3.6.

Requirements can be installed by using `pip`:

```
python3.6 -m pip install -r requirements.txt
 ```
Language-dependant packages and models are automatically downloaded and installed on runtime, if needed.

## Usage

Binonymizer can be run with:

```bash
binonymizer.py [-h] --format {tmx,cols} [--tmp_dir TMP_DIR]
                     [-b BLOCK_SIZE] [-p PROCESSES] [-q] [--debug]
                     [--logfile LOGFILE] [-v]
                     input [output] srclang trglang
```


### Parameters
* positional arguments:
  * input: File to be anonymized (See format below)
  * output: File with anonymization annotations (default: standard output)
  * srclang: Source language code of the input
  * trglang: Target language code of the input
* optional arguments:
  * -h, --help: show this help message and exit
* Mandatory:
  * --format {tmx,cols}: Input file format. Values: cols, tmx  ("cols" format: URL1 URL2 SOURCE_SENTENCE TARGET_SENTENCE [extra columns] tab-separated)
* Optional:
  * --tmp_dir TMP_DIR: Temporary directory where creating the temporary files of this program (default: default system temp dir, defined by the environment variable TMPDIR in Unix)
  * -b BLOCK_SIZE, --block_size BLOCK_SIZE: Sentence pairs per block (default: 10000)
  * -p PROCESSES, --processes PROCESSES: Number of processes to use (default: all CPUs minus one)
* Logging:
  * -q, --quiet: Silent logging mode (default: False)
  * --debug: Debug logging mode (default: False)
  * --logfile LOGFILE: Store log to a file (default: standard error output)
  * -v, --version: show version of this script and exit

### Example
```bash
python3.6 binonymizer.py corpus.en-es.raw corpus.en-es.anon en es --format cols  --tmp_dir /tmpdir -b50000 -p31 
```
This will read the corpus "corpus.en-es.raw", which is in a column-based format, extracting personal data and writing the tagged output in "corpus.en-es.anon". Binonymizer will run in blocks of 50000 sentences, using 31 cores, and writing temporary files in /tmpdir


## Lite version

Although `binonymizer` makes use of parallelization  by distributing workload to the available cores, some users might prefer to implement their own parallelization strategies. For that reason, a single-thread version of the scripts is provided: `binonymizer_lite.py`. The usage is exactly the same as for the full version, but omitting the blocksize (-b) and processes (-p) parameter.


## TO DO
* Pip installable
* Fully support TMX input/output
* Address recognition
* GPU support
* Automate Prompsit-python-bindings submodule ( git submodule update --remote ,  python3.6 setup.py install)



<sup>1</sup>: See EC definition of "personal information": https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en


