Metadata-Version: 2.1
Name: WrdSmth
Version: 0.1.6
Summary: Your Python Text Preprocessing Toolkit
Author: Nazaryan Artem Karapetovich
Author-email: spanishiwasc2@gmail.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Description-Content-Type: text/markdown
Requires-Dist: nltk
Requires-Dist: numba
Requires-Dist: spacy
Requires-Dist: langdetect
Requires-Dist: scikit-learn

# WrdSmth: Your Python Text Preprocessing Toolkit

**WrdSmth** is a versatile Python library designed to streamline your text preprocessing workflow. Whether you're working on Natural Language Processing (NLP) tasks, data analysis, or machine learning projects, WrdSmth provides a comprehensive suite of tools to clean, transform, and prepare your text data for optimal results.

**Full [Documentation](https://github.com/SL1dee36/wrdsmth-api) you can read on [GitHub](https://github.com/SL1dee36/wrdsmth-api)**

**Key Features:**

* **Cleaning:** Remove unwanted characters, HTML tags, punctuation, and extra whitespace.
* **Tokenization:** Split text into individual words or sentences.
* **Stemming:** Reduce words to their base form (stem).
* **Lemmatization:** Convert words to their canonical form (lemma).
* **Vectorization:** Transform text into numerical vectors using TF-IDF.

**Easy to Use:**

WrdSmth offers a simple and intuitive API, making it easy to integrate into your existing projects. Just install it with `pip`:

```bash
pip install WrdSmth
```


## Usage

### 1. Cleaning Text

The `clean_text` function provides various options for cleaning text data. 

```python
from WrdSmth.cleaning import clean_text

text = "This is an example text with <br> HTML tags, punctuation!@#$%^&*(), numbers 123, a URL https://www.example.com and an email example@example.com."

# Clean text with all default options
cleaned_text = clean_text(text)
print(cleaned_text)
# Output: this is an example text with html tags numbers 123 a url httpwwwexamplecom and an email exampleexamplecom
```

**Parameters:**

* `text` (str): Text to be cleaned.
* `remove_html` (bool, optional): Remove HTML tags. Defaults to `True`.
* `remove_punctuation` (bool, optional): Remove punctuation. Defaults to `True`.
* `lowercase` (bool, optional): Convert text to lowercase. Defaults to `True`.
* `remove_extra_spaces` (bool, optional): Remove extra spaces. Defaults to `True`.
* `remove_numbers` (bool, optional): Remove numbers. Defaults to `False`.
* `replace_urls` (bool, optional): Replace URLs with a placeholder. Defaults to `False`.
* `replace_emails` (bool, optional): Replace email addresses with a placeholder. Defaults to `False`.
* `custom_regex` (str, optional): Custom regular expression pattern to remove. Defaults to `None`.
* `normalize_unicode` (bool, optional): Normalize Unicode characters. Defaults to `False`.


### 2. Tokenization

The `tokenize_text` function offers various tokenization methods:

```python
from WrdSmth.tokenization import tokenize_text

text = "This is a sentence. This is another sentence."

# Word tokenization
word_tokens = tokenize_text(text, method='word')
print(word_tokens)
# Output: ['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

# Sentence tokenization
sentence_tokens = tokenize_text(text, method='sentence')
print(sentence_tokens)
# Output: ['This is a sentence.', 'This is another sentence.']
```

**Parameters:**

* `text` (str): Text to be tokenized.
* `method` (str, optional): Tokenization method ('word', 'sentence', 'regex', 'custom'). Defaults to 'word'.
* `language` (str, optional): Language of the text. Defaults to 'english'. If None, the language will be detected automatically.
* `n_gram_range` (tuple, optional): Minimum and maximum n-gram lengths (for 'word' method). Defaults to (1, 1).
* `regex_pattern` (str, optional): Regular expression pattern for tokenization (for 'regex' method). Defaults to None.
* `remove_stopwords` (bool, optional): Whether to remove stop words. Defaults to False.
* `stopwords` (list, optional): List of stop words to remove. Defaults to None (uses NLTK's English stop words).
* `lowercase` (bool, optional): Whether to lowercase the tokens. Defaults to False.
* `custom_tokenizer` (callable, optional): Custom tokenizer function. Defaults to None.

[and more...](https://github.com/SL1dee36/wrdsmth-api)
