Metadata-Version: 2.1
Name: cadence-punctuation
Version: 1.0.0
Summary: Multilingual punctuation restoration model for Indic languages
Home-page: https://github.com/AI4Bharat/Cadence
Author: AI4Bhārat
Author-email: opensource@ai4bharat.org
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch >=2.1.0
Requires-Dist: transformers >=4.51.3
Requires-Dist: safetensors >=0.4.1
Requires-Dist: numpy <2.0.0,>=1.21.0
Requires-Dist: huggingface-hub >=0.30.2

# Cadence

A multilingual punctuation restoration model based on Gemma-3-1b

## Features
- **Multilingual Support**: English + 22 Indic languages
- **Unimodel**: A single model for punctuations (doesn't require language identifier)
- **Encoder**: Bi-directional encoder (blazing fast)
- **AutoModel Compatible**: Easy integration with Hugging Face ecosystem
- **Efficient Processing**: Supports batch processing and sliding window for long texts
- **Script-Aware**: Handles multiple scripts with appropriate punctuation rules

## Installation

```bash
pip install cadence-punctuation
```

## Quick Start

### Using the python package (Recommended)

```python
from cadence import PunctuationModel

# Load model (local path or Hugging Face model ID)
model = PunctuationModel("path/to/download/weights")

# Punctuate single text
text = "hello world how are you today"
result = model.punctuate([text])
print(result[0])  # "Hello world, how are you today?"

# Punctuate multiple texts
texts = [
    "hello world how are you",
    "this is another test sentence",
    "यह एक हिंदी वाक्य है"  # Hindi example
]
results = model.punctuate(texts, batch_size=8)
for original, punctuated in zip(texts, results):
    print(f"Original: {original}")
    print(f"Punctuated: {punctuated}")
```

### Using AutoModel

```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
model_name = "ai4bharat/Cadence"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

id2label = model.config.id2label

text = "यह एक वाक्य है इसका क्या मतलब है"
# text = "this is a test sentence what do you think"

# Tokenize input and prepare for model
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs['input_ids'][0] # Get input_ids for the first (and only) sentence

with torch.no_grad():
    outputs = model(**inputs)
    predictions_for_sentence = torch.argmax(outputs.logits, dim=-1)[0]


result_tokens_and_punctuation = []
all_token_strings = tokenizer.convert_ids_to_tokens(input_ids.tolist()) # Get all token strings

for i, token_id_value in enumerate(input_ids.tolist()):
    # Process only non-padding tokens based on the attention mask
    if inputs['attention_mask'][0][i] == 0:
        continue

    current_token_string = all_token_strings[i]

    is_special_token = token_id_value in tokenizer.all_special_ids
    
    if not is_special_token:
        result_tokens_and_punctuation.append(current_token_string)
    
    predicted_punctuation_id = predictions_for_sentence[i].item()
    punctuation_character = id2label[predicted_punctuation_id]

    if punctuation_character != "O" and not is_special_token:
        result_tokens_and_punctuation.append(punctuation_character)

punctuated_text = tokenizer.convert_tokens_to_string(result_tokens_and_punctuation)

print(f"Original Text: {text}")
print(f"Punctuated Text: {punctuated_text}")
```


## Officially Supported Languages
- English, Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu

Tokenizer doesn't support Manipuri's Meitei script. The model can punctuate if the text is transliterated to Bengali's script.

One can try using this model for languages not listed above. Performance may vary.

## Supported Punctuation
The model can predict the following punctuation marks:
- Period (.)
- Comma (,)  
- Question mark (?)
- Exclamation mark (!)
- Semicolon (;)
- Colon (:)
- Hyphen (-)
- Quotes (" and ')
- Ellipse (...)
- Parentheses ()
- Hindi Danda (।)
- Urdu punctuation (۔، ؟)
- Arabic punctuation (٬ ،)
- Santali punctuation (᱾ ᱾।)
- Sanskrit punctuation (॥)
- And various combinations

## Configuration Options

### PunctuationModel Parameters

All the parameters are optional to pass.
- `model_path`: Local path to download the weights (default: None)
- `gpu_id`: GPU device ID (default: None for auto-detection)
- `cpu`: Force CPU usage (default: False)
- `max_length`: Maximum sequence length/Sliding Window width when enabled (default: 300)
- `sliding_window`: Enable sliding window for long texts (default: True)
- `verbose`: Enable verbose logging (default: False)
- `d_type`: Precision with which weights are loaded (default: bfloat16)


```python
# Custom configuration
model = PunctuationModel(
    model_path="path/to/download/weights",
    gpu_id=0,  # Use specific GPU
    max_length=512,  # length for trunation; also used as window size when sliding_window=True
    sliding_window=True,  # Handle long texts
    verbose=False,  # Quiet mode
    d_type="bfloat16"
)

# Process long texts with sliding window
long_text = "Your very long text here..." * 100
result = model.punctuate([long_text])
```

## License
MIT License
