Metadata-Version: 2.1
Name: MordinezNLP
Version: 0.1.0b10
Summary: Powerfull python tool for modern NLP processing
Home-page: https://github.com/BMarcin/MordinezNLP
Author: Marcin Borzymowski
License: MIT
Keywords: NLP text preprocessing cleaning tool
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Education
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: clean-text (>=0.3.0)
Requires-Dist: colorama (==0.4.4)
Requires-Dist: elasticsearch (>=7.10.1)
Requires-Dist: pdfplumber (>=0.5.25)
Requires-Dist: requests (>=2.25.1)
Requires-Dist: selectolax (>=0.2.10)
Requires-Dist: spacy (==3.0.1)
Requires-Dist: spacy-legacy (==3.0.1)
Requires-Dist: stanza (==1.1.1)
Requires-Dist: torch (>=1.7.1)
Requires-Dist: tqdm (>=4.50.2)

<h1 align="center">
MordinezNLP
<h1>

<p align="center">
    <a href="https://github.com/BMarcin/MordinezNLP/blob/main/.github/workflows/tests.yml">
        <img alt="GitHub" src="https://img.shields.io/github/workflow/status/BMarcin/MordinezNLP/Test%20and%20build%20WHL">
    </a>
    <a href="https://github.com/BMarcin/MordinezNLP/blob/main/LICENSE">
        <img alt="GitHub" src="https://img.shields.io/github/license/BMarcin/MordinezNLP">
    </a>
    <a href="https://github.com/BMarcin/MordinezNLP/stargazers">
        <img alt="GitHub" src="https://img.shields.io/github/stars/BMarcin/MordinezNLP?style=social">
    </a>
</p>

<h3 align="center">
    Useful toolkit for NLP projects
</h3>

<p>
MordinezNLP provides tools to download the data from the web, CommonCrawl and ElasticSearch using multiprocessing and custom file processing functions

MordinezNLP has is a powerful tool to clean up dirty texts to make use of them in Neural Networks with better performance.

Use MordinezNLP to extract text data from PDFs (tables ommiting) and from HTMLs.

MordinezNLP is build on top of the SpaCy and Stanza.
</p>

<h3 align="center">Quick tour</h3>
Text cleaning and POS tagging

```python
from MordinezNLP.processors import BasicProcessor
from MordinezNLP.pipelines import PartOfSpeech
from MordinezNLP.tokenizers import spacy_tokenizer
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = spacy_tokenizer(nlp)

bp = BasicProcessor()
post_process = bp.process("this is my text to process by a funcion", language='en')

pos_tagger = PartOfSpeech(
    nlp,
    'en'
)

pos_output = pos_tagger.process(
    [post_process],
    4,
    30,
)
```

CommonCrawl downloader

```python
from MordinezNLP.downloaders import CommonCrawlDownloader

ccd = CommonCrawlDownloader(
    [
        "reddit.com/r/space/*",
        "reddit.com/r/spacex/*",
    ]
)
ccd.download('./test_data')
```

PDF parser

```python
from io import BytesIO
from MordinezNLP.parsers import process_pdf

with open("my_pdf_doc.pdf", "rb") as f:
       pdf = BytesIO(f.read())
   output = process_pdf(pdf)
   print(output)
```


<h3 align="center">
Installation
</h3>

<h4>With pip</h4>

```bash
pip install MordinezNLP
```

<h3 align="center">
URLs
</h3>

- [Docs](https://mordineznlp.readthedocs.io/en/latest/)
- [GitHub](https://github.com/BMarcin/MordinezNLP)


