Metadata-Version: 2.4
Name: bournemouth-forced-aligner
Version: 0.1.0
Summary: Bournemouth Forced Aligner - Phoneme-level timestamp extraction
Home-page: https://github.com/tabahi/bournemouth-forced-aligner
Author: Tabahi
Author-email: Tabahi <tabahi@duck.com>
Maintainer-email: Tabahi <tabahi@duck.com>
License: gplv3
Project-URL: Homepage, https://github.com/tabahi/bournemouth-forced-aligner
Project-URL: Documentation, https://github.com/tabahi/bournemouth-forced-aligner#readme
Project-URL: Repository, https://github.com/tabahi/bournemouth-forced-aligner
Project-URL: Bug Tracker, https://github.com/tabahi/bournemouth-forced-aligner/issues
Project-URL: Changelog, https://github.com/tabahi/bournemouth-forced-aligner/blob/main/CHANGELOG.md
Project-URL: CUPE Models, https://huggingface.co/Tabahi/CUPE-2i
Keywords: phoneme,alignment,speech,audio,timestamp,forced-alignment,bournemouth,CUPE,speech-recognition,linguistics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Education
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: torchaudio>=0.9.0
Requires-Dist: huggingface_hub>=0.8.0
Requires-Dist: numpy>=1.19.0
Requires-Dist: click>=8.0.0
Requires-Dist: phonemizer>=3.3.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Requires-Dist: pre-commit>=2.17.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=6.0; extra == "test"
Requires-Dist: pytest-cov>=3.0.0; extra == "test"
Requires-Dist: pytest-xdist>=2.5.0; extra == "test"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "docs"
Requires-Dist: myst-parser>=0.17.0; extra == "docs"
Provides-Extra: audio
Requires-Dist: librosa>=0.8.0; extra == "audio"
Requires-Dist: soundfile>=0.10.0; extra == "audio"
Requires-Dist: pydub>=0.25.0; extra == "audio"
Provides-Extra: all
Requires-Dist: bournemouth-forced-aligner[audio,dev,docs,test]; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python


# Bournemouth Forced Aligner (BFA)


A Python library for extracting phoneme-level timestamps from audio files and transcriptions. 
**URL:** https://github.com/tabahi/bournemouth-forced-aligner

- Find out the exact time in millisecond when a phoneme/word is being spoken in an audio clip, provided you already have the text for the audio.

- This project depends on pretrained models from Contextless Universal Phoneme Encoder (CUPE): https://github.com/tabahi/contexless-phonemes-CUPE
- Currently, only the English model has been pretrained upto useable accuracy.
- Work in progress.

## Features
- **Fast alignment** Takes 0.2s for 10s audio on CPU.
- **Phoneme-level timestamp extraction** from audio with high accuracy
- **Viterbi algorithm** with confidence scoring and target boosting
- **Multi-language support** via espeak phonemization (*current acoustic model is English only)
- **Embedding extraction** contextless, pure phoneme embeddings for downstream machine learning tasks
- **Word-level alignment** derived from phoneme timestamps
- **Command-line interface** for handsoff use.
- **JSON output format** for easy integration with other tools
- **Textgrid output format** for importing timestamps into Praat

## Installation

### From PyPI (recommended)
```bash
pip install bournemouth-forced-aligner

# Dependencies
apt-get install espeak-ng
```

Check the installation:
```bash
# Show help
balign --help

# Show version
balign --version

# Test installation
python -c "from bournemouth_aligner import PhonemeTimestampAligner; print('Installation OK')"
```

## Getting Started

Start with example.py

```python

import torch
import time
import json
from bournemouth_aligner import PhonemeTimestampAligner


transcription = "butterfly"
audio_path = "examples/samples/audio/109867__timkahn__butterfly.wav"

model_name = "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" 
extractor = PhonemeTimestampAligner(model_name=model_name, lang='en-us', duration_max=10, device='cpu')

audio_wav = extractor.load_audio(audio_path) # can replace it with custom audio source

t0 = time.time()

timestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=True)

t1 = time.time()
print("Timestamps:")
print(json.dumps(timestamps, indent=4, ensure_ascii=False))
print(f"Processing time: {t1 - t0:.2f} seconds") # 0.2s

```

## Output

Sample output:
```json
{
    "segments": [
        {
            "start": 0.0,
            "end": 1.2588125,
            "text": "butterfly", "ph66": [29, 10, 58, 9, 43, 56, 23], "pg16": [7, 2, 14, 2, 8, 13, 5], "coverage_analysis": {"target_count": 7, "aligned_count": 7, "missing_count": 0, "extra_count": 0, "coverage_ratio": 1.0, "missing_phonemes": [], "extra_phonemes": []}, "ipa": ["b", "ʌ", "ɾ", "ɚ", "f", "l", "aɪ"], "word_num": [0, 0, 0, 0, 0, 0, 0], "words": [
                "butterfly"
            ],
            "phoneme_ts": [
                {
                    "phoneme_idx": 29,
                    "phoneme_label": "b",
                    "start_ms": 33.56833267211914,
                    "end_ms": 50.35249710083008,
                    "confidence": 0.9849503040313721
                },
                ...,
                {
                    "phoneme_idx": 23,
                    "phoneme_label": "aɪ",
                    "start_ms": 604.22998046875,
                    "end_ms": 621.01416015625,
                    "confidence": 0.21650740504264832
                }
            ],
            "group_ts": [
                {
                    "group_idx": 7,
                    "group_label": "voiced_stops",
                    "start_ms": 33.56833267211914,
                    "end_ms": 50.35249710083008,
                    "confidence": 0.9911064505577087
                },
                ...,
                {
                    "group_idx": 5,
                    "group_label": "diphthongs",
                    "start_ms": 604.22998046875,
                    "end_ms": 621.01416015625,
                    "confidence": 0.4117060899734497
                }
            ],
            "words_ts": [
                {
                    "word": "butterfly", "start_ms": 33.56833267211914, "end_ms": 621.01416015625, "confidence": 0.6550856615815844, "ph66": [29, 10, 58, 9, 43, 56, 23], "ipa": ["b", "ʌ", "ɾ", "ɚ", "f", "l", "aɪ"]
                }
            ]
        }
    ]
}
```
Output keys:
- 'ph66' standardized 66 phoneme classes including silence. See more in [mapper66.py](bournemouth_aligner/mapper66.py)
- 'pg16' standardized 16 phoneme category grounds such as latery, lower front vowels, rhotics, etc. See complete mapping in `phoneme_groups_index` in  [mapper66.py](bournemouth_aligner/mapper66.py)
- 'ipa' - list of IPA sequences generated by espeak. These can cause unicode issues.
- 'words' - List of words splitted by simple regex: `re.findall(r"\b\w+\b|[.,!?;:]", "sentence")`
- 'phoneme_ts': aligned timestamps for phonemes (ph66).
- 'group_ts': aligned timestamps for phoneme groups (pg16). These can be more accurate than phoneme timestamps.
- 'word_num': list of index of word for each phoneme in 'ph66'. The point to the word number each corresponding phoneme belongs to.
- 'words_ts': aligned timestamps for words, remapped from 'phoneme_ts'.
- 'coverage_analysis': metrics for alignment quality. Reports insertions and deletions.




## Alignment Accuracy

Below is an example Praat TextGrid visualization of phoneme-level alignment produced by [BFA](https://github.com/tabahi/bournemouth-forced-aligner), compared with [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner)

![Praat Alignment Example](examples/samples/images/LJ02_praat.png)

Currently, there are no reliable metrics to report other than speed. The timing distance error is 40ms on TIMIT (unreliable). MFA takes 10 seconds for 2 seconds audio which makes it difficult for real-time use. BFA has potential for real-time inference due to being contextless ([see CUPE](https://huggingface.co/Tabahi/CUPE-2i)). If you have any comparisons or suggestions for improvements then please let me know in the issues.



## How does it work?

The phoneme probabilities extracted from [CUPE](https://huggingface.co/Tabahi/CUPE-2i) are passed through [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm), which makes forward and backward tracing paths over the probabilities to pick the right frames at which the expexted phonemes beging and end. See the core code in [ViterbiDecoder](bournemouth_aligner/forced_alignment.py). Due to many for-loops, it works better on CPU. There is a potential for more optimization.



## Advanced Usage

See [example_advanced.py](examples/example_advanced.py) for more advanced batch processing.

Extract timestamps directly from audio by first transcribing using whisper:

```python
# pip install git+https://github.com/openai/whisper.git 

import whisper
import json
from bournemouth_aligner import PhonemeTimestampAligner

model = whisper.load_model("turbo")

audio_path = "audio.wav"
srt_path = "whisper_output.srt.json"
ts_out_path = "timestamps.vs2.json"

result = model.transcribe(audio_path)

# save whisper output
with open(srt_path, "w") as srt_file:
    json.dump(result, srt_file)

extractor = PhonemeTimestampAligner(model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt", lang='en-us', duration_max=10, device='cpu')
timestamps_dict = extractor.process_srt_file(srt_path, audio_path, ts_out_path, extract_embeddings=False, vspt_path=None, debug=False)

with open(ts_out_path, "w") as ts_file:
    json.dump(timestamps_dict, ts_file)

```

Build it step by step in notebook:
```python
import torch
import torchaudio
from bournemouth_aligner import PhonemeTimestampAligner



# Step1: Initialize PhonemeTimestampAligner
device = 'cpu' # CPU is faster for sigle file processing
duration_max = 10 # it's only for padding and clipping. Set it more than your expected duration
model_name = "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" # Find more models at: https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt
lang = 'en-us' # Each CUPE model is trained on a specific language(s)
extractor = PhonemeTimestampAligner(model_name=model_name, lang=lang, duration_max=duration_max, device='cpu')





# Step 2a: Load and preprocess audio - manually

audio_path = "examples/samples/audio/Schwa-What.wav"
audio_wav, sr = torchaudio.load(audio_path, normalize=True) #  normalize=True is for torch dtype normalization, not for amplitude

# Stick with the CUPE's sample rate of 16000. For consistency, use the same audio loading and resampling pipeline same as the CUPE's training preprocessing:
resampler = torchaudio.transforms.Resample(
        orig_freq=sr,
        new_freq=160000,
        lowpass_filter_width=64,
        rolloff=0.9475937167399596,
        resampling_method="sinc_interp_kaiser",
        beta=14.769656459379492,
    )
audio_wav = resampler(audio_wav)

rms = torch.sqrt(torch.mean(audio_wav ** 2)) # rms normalize (better to have at least 75% voiced duration)
audio_wav = (audio_wav / rms) if rms > 0 else audio_wav





# Step 2b: Load and preprocess audio - streamlining
audio_wav =  extractor.load_audio(audio_path)




# Step2: Load/create text transcriptions:
transcription = "ah What!"



# Step3: Align
timestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=False)


# Step4 (optional): Convert to textgrid
extractor.convert_to_textgrid(timestamps, output_file="output_timestamps.TextGrid", include_confidence=False)
```


If you are interested in using the phoneme embeddings for machine learning then check out [this example](examples/read_embeddings.py).



# Command Line Interface (CLI)

Bournemouth Forced Aligner includes a powerful command-line interface for batch processing and automation. The CLI command is `balign`.


After installation, the `balign` command will be available in your terminal.


```bash
# Basic usage
balign audio.wav transcription.srt.json output.json

# With debug output
balign audio.wav transcription.srt.json output.json --debug

# Extract embeddings too
balign audio.wav transcription.srt.json output.json --embeddings embeddings.pt
```

## Command Syntax

```bash
balign [OPTIONS] AUDIO_PATH SRT_PATH OUTPUT_PATH

# example:
balign audio.wav transcription.srt.json output.json --device cuda:0 --embeddings output_embd.pt --duration-max 5
```

### Required Arguments

- **`AUDIO_PATH`**: Path to audio file (supports .wav, .mp3, .flac, etc.)
- **`SRT_PATH`**: Path to SRT file in JSON format (see [Input Format](#input-format))
- **`OUTPUT_PATH`**: Path for output timestamps file (.json)

### Options

| Option | Default | Description |
|--------|---------|-------------|
| `--model TEXT` | `en_libri1000_uj01d_e199_val_GER=0.2307.ckpt` | CUPE model name from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt) |
| `--lang TEXT` | `en-us` | Language code for phonemization ([espeak codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)) |
| `--device TEXT` | `cpu` | Device for inference (`cpu` or `cuda`) |
| `--embeddings PATH` | None | Path to save phoneme embeddings (.pt file) |
| `--duration-max FLOAT` | `10.0` | Maximum segment duration in seconds. Shorter max duration will use less CUDA memory and will be faster.
| `--debug / --no-debug` | `False` | Enable detailed debug output |
| `--boost-targets / --no-boost-targets` | `True` | Enable target phoneme boosting for better alignment |
| `--help` | | Show help message and exit |
| `--version` | | Show version and exit |

## Usage Examples

### Basic Phoneme Alignment

```bash
# Simple alignment with English audio
balign speech.wav transcription.srt.json timestamps.json
```

### With Embeddings Extraction

```bash
# Extract phoneme embeddings for downstream tasks
balign speech.wav transcription.srt.json timestamps.json --embeddings speech_embeddings.pt
```

### Multi-language Support (*planned)

```bash
# Spanish audio
balign spanish_audio.wav transcription.srt.json output.json --lang es

# French audio  
balign french_audio.wav transcription.srt.json output.json --lang fr

# German audio
balign german_audio.wav transcription.srt.json output.json --lang de
```

### GPU Acceleration

```bash
# Use CUDA for faster processing
balign large_audio.wav transcription.srt.json output.json --device cuda
```

### Advanced Configuration

```bash
# Custom model with longer segments and debug output
balign audio.wav transcription.srt.json output.json \
    --model "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" \
    --duration-max 15 \
    --debug \
    --embeddings embeddings.pt
```

### Batch Processing

```bash
#!/bin/bash
# Process multiple files
for audio in *.wav; do
    base=$(basename "$audio" .wav)
    balign "$audio" "${base}.srt" "${base}_timestamps.json" --debug
done
```

## Input Format

The SRT file must be in JSON format with the following structure:

```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "hello world this is a test"
    },
    {
      "start": 3.5,
      "end": 7.2,
      "text": "another segment of speech"
    }
  ]
}
```

### Creating SRT Files

You can create SRT files using various methods:

**From Whisper output:**
```python
import whisper
import json

model = whisper.load_model("base")
result = model.transcribe("audio.wav")

# Convert to balign format
srt_data = {"segments": result["segments"]}
with open("transcription.srt.json", "w") as f:
    json.dump(srt_data, f, indent=2)
```

**Manual SRT creation:**
```python
import json

srt_data = {
    "segments": [
        {
            "start": 0.0,
            "end": 2.5,
            "text": "your transcribed text here"
        }
    ]
}

with open("transcription.srt.json", "w") as f:
    json.dump(srt_data, f, indent=2)
```

## Output Format

The CLI generates a detailed JSON file with phoneme-level timestamps:

```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "hello world",
      "ipa": "həloʊ wɜrld",
      "phoneme_ts": [
        {
          "phoneme_idx": 23,
          "phoneme_label": "h",
          "start_ms": 0.0,
          "end_ms": 120.5,
          "confidence": 0.95
        },
        {
          "phoneme_idx": 15,
          "phoneme_label": "ə",
          "start_ms": 120.5,
          "end_ms": 200.3,
          "confidence": 0.87
        }
      ],
      "words_ts": [
        {
          "word": "hello",
          "start_ms": 0.0,
          "end_ms": 650.2,
          "confidence": 0.91,
          "ph66": [23, 15, 31, 31, 45],
          "ipa": ["h", "ə", "l", "l", "oʊ"]
        },
        {
          "word": "world", 
          "start_ms": 650.2,
          "end_ms": 1200.8,
          "confidence": 0.89,
          "ph66": [52, 15, 48, 31, 8],
          "ipa": ["w", "ɜ", "r", "l", "d"]
        }
      ]
    }
  ]
}
```

## Debug Mode

Enable debug mode for detailed processing information:

```bash
balign audio.wav transcription.srt.json output.json --debug
```

Debug output includes:
- Model initialization status
- Audio processing details
- Phoneme sequence predictions
- Alignment coverage analysis
- Processing time statistics
- Confidence scores

Example debug output:
```
🚀 Bournemouth Forced Aligner
📁 Audio: audio.wav
📄 SRT: transcription.srt.json
💾 Output: output.json
🏷️  Language: en-us
🖥️  Device: cpu
🎯 Model: en_libri1000_uj01d_e199_val_GER=0.2307.ckpt
--------------------------------------------------
🔧 Initializing aligner...
Setting backend for language: en-us
✅ Aligner initialized successfully
🎵 Processing audio...
Loaded SRT file with 1 segments from transcription.srt.json
Resampling audio.wav from 22050Hz to 16000Hz
Expected phonemes: ['p', 'ɹ', 'ɪ', ...'ʃ', 'ə', 'n']
Target phonemes: 108, Expected: ['p', 'ɹ', 'ɪ', ..., 'ʃ', 'ə', 'n']
Spectral length: 600
Forced alignment took 135.305 ms
Aligned phonemes: 108
Target phonemes: 108
SUCCESS: All target phonemes were aligned!
Predicted phonemes 108
Predicted groups 108
start_offset_time 0.0
 1:   p, voiceless_stops  -> (0.000 - 32.183), Confidence: 0.554
 2:   ɹ, rhotics  -> (32.183 - 64.367), Confidence: 0.336
 ...
107:   ə, central_vowels  -> (9429.717 - 9445.809), Confidence: 0.434
108:   n, nasals  -> (9445.809 - 9477.992), Confidence: 0.824
Alignment Coverage Analysis:
  Target phonemes: 29
  Aligned phonemes: 29
  Coverage ratio: 100.00%

============================================================
PROCESSING SUMMARY
============================================================
Total segments processed: 1
Perfect sequence matches: 1/1 (100.0%)
Total phonemes aligned: 108
Overall average confidence: 0.502
============================================================
Results saved to: output.json
✅ Timestamps extracted to output.json
📊 Processed 1 segments with 108 phonemes
🎉 Processing completed successfully!
```




For more help, [open an issue on GitHub](https://github.com/tabahi/bournemouth-forced-aligner/issues).
