Metadata-Version: 2.1
Name: SqueakyCleanText
Version: 0.1.1
Summary: A comprehensive text cleaning and preprocessing pipeline.
Home-page: https://github.com/rhnfzl/SqueakyCleanText
Author: Rehan Fazal
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lingua-language-detector
Requires-Dist: nltk
Requires-Dist: emoji
Requires-Dist: ftfy
Requires-Dist: Unidecode
Requires-Dist: beautifulsoup4
Requires-Dist: transformers
Requires-Dist: torch
Provides-Extra: dev
Requires-Dist: hypothesis ; extra == 'dev'
Requires-Dist: faker ; extra == 'dev'
Requires-Dist: flake8 ; extra == 'dev'
Requires-Dist: pytest ; extra == 'dev'


# SqueakyCleanText

SqueakyCleanText is a handy text cleaning package designed to sanitize text for classical machine learning models and language models (such as BERT, RoBERTa) without altering the meaning of the text.

## Features

- Text sanitization for classical ML models and language models.
- Removes unnecessary characters and normalizes text.
- Supports Named Entity Recognition (NER).
- Identifies the language of the text.
- Provides cleaned text with stopwords removed.

## Installation

To install SqueakyCleanText, use the following pip command:

```sh
pip install SqueakyCleanText
```

## Usage

Here's a simple example to demonstrate how to use the SqueakyCleanText package:

```python
from sct import sct

# Initialize the TextCleaner
sx = sct.TextCleaner()

# Process the text
#lmtext : Text for Language Models, cmtext : Text for Classical ML, language : Language provided
lmtext, cmtext, language = sx.process("Hello, My name is John!")
# Output the result
print(lmtext, cmtext, language)
# Hello, My name is hello name ENGLISH
```

## API

### `sct.TextCleaner`

#### `process(text: str) -> Tuple[str, str, str]`

Processes the input text and returns a tuple containing:
- Cleaned text with punctuation and unnecessary characters removed.
- Cleaned text with stopwords removed.
- Detected language of the text.

## TODO

- Add the ability to change the NER models from the config file, supporting AutoModel and AutoTokenizer.
- Expand language support for stopwords.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgements

- Thanks to the contributors and the community for their support.
