Metadata-Version: 2.4
Name: ion-tokenizer
Version: 1.0.8
Summary: Phrase-first semantic tokenizer compiler for neural language models
Requires-Python: <3.14,>=3.9
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: click>=8.0
Requires-Dist: datasets>=2.14
Requires-Dist: spacy>=3.5
Requires-Dist: tqdm>=4.65
Requires-Dist: orjson>=3.9
Requires-Dist: gensim>=4.3
Dynamic: license-file

<img width="186.25" height="198.25" alt="i-ion-light" src="https://github.com/user-attachments/assets/a6b0bcc5-99e7-4d06-8a79-9a0cee664086" />

# Ion

A phrase-first semantic tokenizer compiler for neural language models. Shorter sequences = fewer tokens = lower inference costs.

Ion discovers multi-word phrases ("going to", "in front of", "machine learning") and collapses them into single tokens. The result: **14-44% fewer tokens** than BPE on the same text, with the same or smaller vocabulary.

## Install

```bash
pip install ion-tokenizers
```

Python 3.9 - 3.13 supported.

## Quickstart

```bash
# Build a tokenizer from any text corpus
ion tokenize corpus.txt -o tokenizer.json

# Build from a HuggingFace dataset
ion tokenize wikitext -o tokenizer.json

# Check how well it compresses
ion stats -t tokenizer.json -c corpus.txt
```

## Benchmarks

All benchmarks train Ion and BPE on the same corpus and compare token counts on held-out data.

### Natural Language

| Domain | Vocab Size | Ion Advantage | Ion chars/tok | BPE chars/tok |
|--------|-----------|---------------|---------------|---------------|
| WikiText-103 | 10K | **22.1%** | 5.53 | 4.30 |
| AG News | 10K | **20.3%** | 5.48 | 4.37 |
| Legal Text | 10K | **38.1%** | 9.40 | 5.82 |
| Scientific Abstracts | 10K | **39.0%** | 9.54 | 5.82 |
| Conversational | 10K | **33.9%** | 6.11 | 4.04 |
| Medical Abstracts | 10K | **35.7%** | 9.27 | 5.95 |
| Technical Docs | 10K | **40.1%** | 8.54 | 5.12 |
| IMDB Reviews | 10K | **19.2%** | 5.22 | 4.22 |
| Literary Text | 10K | **37.6%** | 7.42 | 4.62 |

### Source Code (GitHub Repos)

| Repository | Language | Ion Advantage |
|-----------|----------|---------------|
| psf/requests | Python | **22.6%** |
| iluwatar/java-design-patterns | Java | **22.5%** |
| jekyll/jekyll | Ruby | **19.0%** |
| fastapi/fastapi | Python | **18.3%** |
| lodash/lodash | JavaScript | **14.5%** |
| expressjs/express | JavaScript | **12.0%** |
| pallets/flask | Python | **8.1%** |
| BurntSushi/ripgrep | Rust | **6.4%** |

Ion wins on all 14 repositories tested. Average advantage across all benchmarks: **~15-30%** depending on domain.

Run your own:

```bash
ion benchmark corpus.txt --vocab-size 10000
```

## How It Works

Ion uses a **phrase-first** architecture with a strict priority hierarchy:

```
Phrases > Words > BPE subwords > Characters
```

1. **Phrase discovery** — PMI-based multi-layer detection finds n-grams (2-5+ words) that co-occur more than chance
2. **Compression ranking** — Phrases are scored by `tokens_saved * length_bonus * frequency` and greedily selected
3. **Greedy longest-match encoding** — At encode time, the longest matching phrase wins. Deterministic, same input always produces same output
4. **Fallback hierarchy** — Unknown words fall back to BPE subwords (default), or optionally character-level or dynamic vocab

### Fallback Modes

| Mode | Flag | Description |
|------|------|-------------|
| BPE (default) | `--fallback-mode bpe` | Unknown words split into BPE subwords |
| Character | `--fallback-mode character` | Fall back to individual characters |
| Newword | `--fallback-mode newword` | Include all corpus words at build time; character fallback at runtime |
| Word | `--fallback-mode word` | Treat unknown words as single `<unk>` tokens |

### Vocabulary Modes

| Mode | Flag | Description |
|------|------|-------------|
| Standard | `--max-vocab N` | Limit vocabulary to N tokens (default: 20,000) |
| Take-Needed | `-tn` | Include ALL words and ALL discovered phrases, no cap |

## CLI Reference

```bash
ion                              # Launch interactive GUI
ion tokenize <source>            # Build tokenizer from corpus
ion stats -t tokenizer.json      # Show tokenizer statistics
ion benchmark <corpus>           # Ion vs BPE comparison
ion sweep <corpus>               # Vocab size sweep
ion compare t1.json t2.json      # Compare two tokenizers
ion iterate <corpus> -t t.json   # Adapt tokenizer to new data
ion retokenize <text> -t t.json  # Re-encode text with tokenizer
ion clean <text>                 # Preprocess text for compression
ion export -t t.json --format hf # Export to other formats
```

### Key Options

```bash
# Tokenizer building
--max-vocab 32000          # Vocabulary budget
--fallback-mode bpe        # bpe | character | newword | word
--phrase-target-pct 60     # % of vocab allocated to phrases
--max-phrase-layers 4      # Depth of n-gram discovery
--preserve-case            # Don't lowercase (useful for code)
--language en              # spaCy language code

# Corpus loading
--hf-split train           # HuggingFace dataset split
--hf-text-field text       # Text field name
--hf-config sample-10BT    # Dataset configuration
--max-samples 100000       # Max samples to load
```

## Python API

```python
from ion.tokenizer import IonTokenizer
from ion.builder import build_tokenizer

# Build from corpus
stats = build_tokenizer("corpus.txt", output="tokenizer.json", max_vocab_size=20000)

# Load and use
tokenizer = IonTokenizer.from_file("tokenizer.json")
ids = tokenizer.encode("the machine learning model")
text = tokenizer.decode(ids)
```

## HuggingFace Integration

**Use `IonTokenizerHF` as your tokenizer.** It's a drop-in replacement for any HF tokenizer — same `__call__`, `encode`, `decode`, `pad`, and `batch_encode_plus` interface — but it runs Ion's phrase-matching engine under the hood. Two lines to integrate, no pipeline changes.

```python
from ion import IonTokenizerHF

# Load your tokenizer (or a pretrained one like "ion-1m")
tokenizer = IonTokenizerHF.from_pretrained("tokenizer.json")

# Encode — same interface as any HF tokenizer
encoded = tokenizer("Hello world", return_tensors="pt")
encoded = tokenizer(["batch", "of", "texts"], padding=True, truncation=True)

# Works with HuggingFace Trainer
from transformers import Trainer
trainer = Trainer(model=model, tokenizer=tokenizer, train_dataset=dataset)
```

### Standard HF Export (when you can't use Ion as a dependency)

If you need a standalone `tokenizer.json` for ONNX runtimes, Rust inference, or environments where you can't install `ion-tokenizer`, use `ion export`:

```bash
# Without phrase joining (lossy — phrases become separate words)
ion export -t tokenizer.json -f huggingface -o ./hf_export/

# With phrase joining (preserves phrases via join character)
ion export -t tokenizer.json -f huggingface -o ./hf_export/ --hf-join-char _
```

**Phrase joining** replaces spaces in multi-word phrases with your chosen character before HF's whitespace splitting. For example, `"united states"` becomes `"united_states"` in the vocab and normalizer. The tokenizer then treats it as one token.

You can use any character as the join: `_` (underscore), `~` (tilde), `|` (pipe), `·` (middle dot), or any character that doesn't appear in your training data.

**At inference time**, replace the join character back to spaces in decoded output:

```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("./hf_export/")
decoded = tokenizer.decode(token_ids).replace("_", " ")  # restore spaces
```

The join character is saved in `tokenizer.json` as `ion_phrase_join_char` for programmatic access.

**Without `--hf-join-char`**, the export is lossy: HF splits on whitespace before token lookup, so multi-word phrases become separate words. You keep Ion's word vocabulary but lose the phrase compression that accounts for most of the 14-44% token savings.

`IonTokenizerHF` remains the recommended path — no join characters, no post-processing, exact same IDs, full fidelity.

### PyTorch Dataset

```python
from ion import IonTokenizerHF, IonDataset, ion_collate_fn
from torch.utils.data import DataLoader

tokenizer = IonTokenizerHF.from_pretrained("tokenizer.json")
dataset = IonDataset(texts, tokenizer, max_length=512)
loader = DataLoader(dataset, collate_fn=ion_collate_fn)
```

### Pretrained Models

Ion ships with pretrained English tokenizers you can use immediately:

```python
tokenizer = IonTokenizerHF.from_pretrained("ion-1m")  # 1M vocab, English prose
# Also available: ion-250k, ion-500k, ion-2m
```

> **Note:** Pretrained models are trained on English prose only. For other languages or domains, build a custom tokenizer with `ion tokenize <corpus>`.

## Code Tokenization

Ion auto-detects source code and applies the `code` domain preset, which enables **BPE fallback by default**. BPE is recommended for code because unknown identifiers (variable names, URLs, hashes) decompose into subwords instead of exploding into single characters or polluting the vocabulary.

```bash
# Auto-detected — BPE fallback enabled automatically
ion tokenize ./my-repo/

# Explicit domain + fallback override
ion tokenize ./my-repo/ --domain code --bpe-fallback
```

If you prefer `newword` fallback for code (e.g., with a large 100K+ vocab to capture all identifiers), pass `--fallback newword` explicitly. Ion will adjust `min_word_freq` down to 1 so rare identifiers are included in the vocabulary.

## Interactive GUI

Run `ion` with no arguments to launch the interactive terminal GUI with visual menu navigation, vocabulary configuration, and animated ASCII ion atom.

```bash
ion
```

## Licensing

Ion is **free for open-source models under 10B parameters**. Paid tiers:

| Use Case | Fee |
|----------|-----|
| Open-source, ≤10B params | **Free** |
| Closed-source, ≤10B params | **$100** (one-time) |
| Open-source, >10B params | **$1,000** (one-time) |
| Closed-source, >10B params | **$3,000/B over 10B** (one-time) |

Source-available: you may examine and redistribute with attribution, but no modifications or forks. See [LICENSE.md](LICENSE.md) for full terms.

Run `ion register` to set up your license.
