Lethe v0.2.0
Pseudo-anonymize and multiply structured data. Replace personal information with realistic synthetic values while preserving schema, types, and distributions.
How Lethe Supports Compliance
Regulatory compliance is not just about having a tool. It requires a process. Lethe fits into that process at several points:
- Data Protection Impact Assessments (DPIA): When a DPIA identifies that test or development environments process real personal data, Lethe provides the technical measure to eliminate that risk. You can demonstrate that test data is synthetically generated and cannot be traced back to data subjects.
- Data minimization: Before sharing data with third parties, analytics teams, or external vendors, pseudo-anonymize it with Lethe. Only the structural and statistical properties of the data travel outside your organization, the real personal information is replaced with fakes.
- Reduced risk profile: Pseudo-anonymized data carries significantly less risk than raw personal data. While it is still considered personal data under GDPR, breaching pseudo-anonymized data has a much lower impact since the exposed values are fake. For data to fall fully outside GDPR scope, you would need true anonymization (irreversible removal) rather than pseudo-anonymization (replacement).
- Breach impact reduction: If pseudo-anonymized data is exposed in a breach, the exposed values are synthetic fakes, significantly reducing harm to data subjects and potentially simplifying your notification obligations under GDPR Article 33/34.
- Reproducibility and auditability: Using
--seedensures that the same input always produces the same output. This makes your anonymization process auditable: you can re-run it and verify the results, which is valuable during regulatory audits.
Overview
Lethe is a command-line tool for data pseudo-anonymization and synthetic data generation. It reads CSV, TSV, TXT, and SQL dump files, detects personally identifiable information (PII) using a combination of column-name heuristics and NLP-based entity recognition, and replaces sensitive values with realistic fake data generated by Faker. This is pseudo-anonymization: PII is replaced, not removed, preserving the data's structure and utility while eliminating real personal information. All processing happens locally using spaCy language models. Nothing leaves your machine.
Lethe provides two core commands:
anonymize
Scan CSV, TSV, or TXT files with NLP and heuristics to detect PII. Replace detected values with consistent fake data. Processes files in chunks for bounded memory usage. Supports structured (columnar) and free-form text.
multiply
Expand a dataset by a given factor with synthetic rows. PII columns get fresh Faker values, ID columns get sequential integers, and non-PII columns preserve the original value distribution.
Key Properties
- Consistency: The same original value always maps to the same fake replacement within a session, preserving foreign-key relationships across tables.
- Reproducibility: Set a
--seedto get deterministic output every time. - Locale support: Generate fake data in any Faker-supported locale (Dutch, German, French, Japanese, and dozens more).
- Streaming: The anonymize pipeline processes files in configurable chunks, keeping memory usage bounded regardless of file size.
- Valid output: CSV output uses full quoting to handle multiline fields, commas, and special characters safely. The optional
--sanitizeflag ensures emails and URLs are RFC-compliant ASCII.
Installation
From PyPI
pip install lethe-cli
From source
git clone https://github.com/yourorg/lethe.git
cd lethe
pip install -e ".[dev]"
Download a spaCy model
The anonymize command requires a spaCy language model. Choose one:
# Transformer model (more accurate, slower, needs GPU for best performance)
python -m spacy download en_core_web_trf
# Small model (faster, lower accuracy, CPU-friendly)
python -m spacy download en_core_web_sm
multiply command does not require spaCy or Presidio. It uses column-name heuristics only, so it works without downloading any NLP models.
Dependencies
| Package | Version | Purpose |
|---|---|---|
presidio-analyzer | >=2.2, <3 | NLP-based PII detection |
presidio-anonymizer | >=2.2, <3 | Presidio integration |
spacy | >=3.7, <4 | NLP engine for entity recognition |
faker | >=25.0, <35 | Synthetic data generation |
typer[all] | >=0.12, <1 | CLI framework |
rich | >=13.0, <14 | Terminal output and progress bars |
pandas | >=2.0, <3 | DataFrame-based CSV I/O |
Quick Start
Anonymize a CSV file
lethe anonymize customers.csv
# Output: customers_anonymized.csv
Anonymize a TSV file
lethe anonymize data.tsv
# Output: data_anonymized.tsv (format auto-detected from extension)
Anonymize free-form text
lethe anonymize letter.txt --model sm
# PII is replaced inline: "Dear John Smith" -> "Dear Allison Hill"
Multiply a dataset 5x
lethe multiply customers.csv --factor 5
# 100 rows in -> 500 rows out
Generate Dutch data with sanitized output
lethe multiply customers.csv --factor 3 --locale nl_NL --sanitize --seed 42
Example input and output
Input (5 rows)
"id","first_name","last_name","email","phone","ssn","address","created_at","status"
"1","John","Smith","john.smith@example.com","+1-555-0101","123-45-6789","742 Evergreen Terrace","2024-01-15","active"
"2","Jane","Doe","jane.doe@testmail.com","+1-555-0102","987-65-4321","123 Main Street","2024-02-20","active"
"3","Bob","Johnson","bob.j@company.org","+1-555-0103","456-78-9012","456 Oak Avenue","2024-03-10","inactive"
"4","Alice","Williams","alice.w@domain.net","+1-555-0104","321-54-9876","789 Pine Road","2024-04-05","active"
"5","Charlie","Brown","charlie.b@email.com","+1-555-0105","654-32-1098","321 Elm Street","2024-05-12","active"
Output after lethe multiply --factor 3 (15 rows)
Each column is handled according to its type:
| Column | Classification | Behavior |
|---|---|---|
id | ID | Originals kept (1-5), synthetics continue sequentially (6-15) |
first_name | PII / PERSON | Every row gets a fresh Faker name |
last_name | PII / PERSON | Every row gets a fresh Faker name |
email | PII / EMAIL | Every row gets a fresh Faker email |
phone | PII / PHONE | Every row gets a fresh Faker phone number |
ssn | PII / SSN | Every row gets a fresh Faker SSN |
address | PII / LOCATION | Every row gets a fresh Faker address |
created_at | SKIP | Sampled randomly from original 5 values |
status | SKIP | Sampled randomly, preserving 80/20 active/inactive ratio |
Command: lethe anonymize
Scans a data file for PII using NLP (Presidio + spaCy) combined with column-name heuristics, then replaces detected values with consistent fake data. Supports CSV, TSV, and TXT files. The file format is auto-detected from the extension and content heuristics.
lethe anonymize INPUT_FILE [OPTIONS]
Options
| Option | Default | Description |
|---|---|---|
-o, --output PATH | <input>_anonymized.<ext> | Output file path (preserves original extension) |
-m, --model [trf|sm] | trf | spaCy model. trf = transformer (accurate), sm = small (fast) |
--mode [pseudo|redact|generalize|drop] | pseudo | Anonymization strategy. See Anonymization Modes below. |
-t, --threshold FLOAT | 0.35 | Minimum confidence score to treat a cell as PII |
--chunk-size INT | 5000 | Rows per chunk for streaming. Lower values use less memory. |
--locale TEXT | en_US | Faker locale for generated replacement values |
--seed INT | none | Random seed for reproducible output |
--clean | false | Delete the original input file after successful anonymization |
--confirm-clean | false | Required confirmation for --clean. Both flags must be present. |
Examples
# Basic anonymization
lethe anonymize data/customers.csv
# Fast mode with small model
lethe anonymize data/customers.csv --model sm
# Higher sensitivity (catches more PII, may have more false positives)
lethe anonymize data/customers.csv --threshold 0.2
# Lower sensitivity (only high-confidence detections)
lethe anonymize data/customers.csv --threshold 0.7
# Reproducible output with seed
lethe anonymize data/customers.csv --seed 42
# Dutch locale for replacement values
lethe anonymize data/customers.csv --locale nl_NL -o output/customers_nl.csv
# Large file with smaller chunks for lower memory usage
lethe anonymize data/large_export.csv --chunk-size 1000
# Anonymize a TSV file (tab-separated)
lethe anonymize data/export.tsv
# Anonymize a list of email addresses (one per line)
lethe anonymize data/emails.txt --model sm
# Anonymize free-form text (inline PII replacement)
lethe anonymize data/letter.txt
# Redact mode: replace PII with [ENTITY_TYPE] tokens
lethe anonymize data/customers.csv --mode redact
# Generalize mode: reduce PII to partial forms (email domains, year-only dates)
lethe anonymize data/customers.csv --mode generalize
# Drop mode: remove PII columns entirely
lethe anonymize data/customers.csv --mode drop
Anonymization Modes
Lethe supports four anonymization strategies via the --mode flag. The default mode (pseudo) replaces PII with realistic Faker-generated fakes. The three additional modes provide true anonymization where the original data is irreversibly removed, meeting standards like GDPR Recital 26.
| Mode | Strategy | Output Example | Reversible? |
|---|---|---|---|
pseudo | Replace with realistic fakes (Faker) | John Smith → Emily Davis | No (but linkable via seed) |
redact | Replace with entity type tokens | John Smith → [PERSON] | No |
generalize | Reduce to partial, less-specific form | john@example.com → ***@example.com | No |
drop | Remove PII columns/spans entirely | Column removed from output | No |
Generalization Rules
The generalize mode applies entity-type-specific reduction. When no useful partial form exists, it falls back to redaction ([ENTITY_TYPE]).
| Entity Type | Strategy | Example |
|---|---|---|
| EMAIL_ADDRESS | Keep domain only | john@example.com → ***@example.com |
| DATE_TIME | Year only | 2024-01-15 → 2024 |
| IP_ADDRESS | First two octets | 192.168.1.100 → 192.168.x.x |
| URL | Scheme + hostname | https://example.com/path → https://example.com |
| CREDIT_CARD | Last 4 digits | 4532015112830366 → ************0366 |
| PERSON, PHONE, LOCATION, etc. | Redact (no safe partial form) | [PERSON] |
Drop Mode Behavior
In structured files (CSV, TSV), drop removes entire columns where PII was detected. In free-form text, it removes PII spans inline and cleans up leftover whitespace. In SQL dumps, it removes columns from INSERT statements and reconstructs the SQL with the reduced column set.
Command: lethe multiply
Expands a CSV or TSV dataset by a given factor, producing N * factor total rows. The first N rows are the originals with PII replaced, and the remaining rows are fully synthetic. No NLP model is required. Does not support free-form text or one-value-per-line TXT files.
lethe multiply INPUT_FILE [OPTIONS]
Options
| Option | Default | Description |
|---|---|---|
-f, --factor INT | 3 | Multiplication factor. Output rows = input rows * factor |
-o, --output PATH | <input>_multiplied.<ext> | Output file path (preserves original extension) |
--locale TEXT | en_US | Faker locale for generated PII values |
--seed INT | none | Random seed for reproducible output |
--sanitize | false | Validate and fix emails and URLs to RFC-compliant ASCII |
--clean | false | Delete the original input file after successful multiplication |
--confirm-clean | false | Required confirmation for --clean. Both flags must be present. |
How multiplication works
The multiplier classifies every column in your input file into one of three buckets, then handles each bucket differently:
| Bucket | Detection | Original rows (0..N) | Synthetic rows (N+1..N*factor) |
|---|---|---|---|
| PII | Column name matches a PII pattern (e.g. first_name, email, ssn) |
Replaced with fresh Faker value | Fresh Faker value per row |
| ID | Column name matches an ID pattern and values are unique, monotonic integers | Kept as-is | Sequential integers from max(existing) + 1 |
| SAMPLE | Heuristic returns SKIP (non-PII) or None (unknown) | Kept as-is | Randomly sampled from existing column values |
Examples
# Triple a dataset
lethe multiply customers.csv --factor 3
# 100 rows -> 300 rows
# Generate 10x the data with Dutch names and addresses
lethe multiply customers.csv --factor 10 --locale nl_NL --sanitize
# Reproducible output for testing
lethe multiply customers.csv --factor 5 --seed 42 -o test_data.csv
# Just anonymize without adding rows (factor=1)
lethe multiply customers.csv --factor 1
--factor 1 gives you a quick, NLP-free anonymization that relies only on column-name heuristics. It is much faster than lethe anonymize but will only detect PII based on column names, not cell contents.
Architecture: Anonymize Pipeline
The anonymize command processes files through a streaming pipeline. The format is auto-detected, and the pipeline adapts accordingly:
For free-form text, the pipeline uses a different replacer that preserves surrounding text:
1. Format Detection
The file extension determines the initial format. For .txt files, the first 20 lines are inspected: consistent tab counts indicate TSV, short lines (avg < 60 chars, max 4 words) indicate one-value-per-line, everything else is free-form text. See File Format Support for details.
2. Reader
Reads the input file in configurable chunks (default 5,000 rows). CSV and TSV use pandas.read_csv with chunksize. Line-per-value and freeform readers iterate lines from the file directly. Each chunk is an independent DataFrame yielded through a generator, so memory usage stays bounded regardless of file size.
2. PiiScanner
For each column in a chunk, the scanner applies three strategies in order:
- Column heuristics: The column name is matched against regex patterns to get a quick classification (SKIP, PII type hint, or unknown).
- Presidio NER: Each non-skipped cell is analyzed by Presidio's
AnalyzerEngineusing the configured spaCy model. This catches PII that the column name alone wouldn't reveal. - Confidence boosting: When the heuristic and Presidio agree on a type, the confidence score is boosted by 0.25. When the heuristic suggests a type but Presidio found nothing, a synthetic result is created with score 0.4.
Results are filtered by the configured threshold (default 0.35). Only cells that meet or exceed the threshold are marked for replacement.
4. Replacer
For structured formats (CSV, TSV, one-value-per-line), the Replacer does whole-cell swaps: the entire cell value is replaced with a synthetic value via the SessionIndex.
For free-form text, the FreeformReplacer does substring replacement: Presidio returns character offsets for each PII span, and only those substrings are swapped while surrounding text is preserved. Replacements happen right-to-left so earlier character offsets remain valid after each substitution.
5. Writer
Writes each processed chunk to the output file in append mode. The writer matches the input format: CSV uses csv.QUOTE_ALL quoting, TSV uses tab separation without quoting, line-per-value and freeform write plain text lines.
Architecture: Multiply Pipeline
The multiply pipeline is simpler than anonymize because it does not use NLP. It supports CSV and TSV inputs and rejects free-form text and one-value-per-line formats:
The entire input file is read at once (not chunked) because the multiplier needs to analyze value distributions for the sampling step. TSV files are read with sep="\t". For each column, the Multiplier class calls infer_pii_type() to classify it into one of three buckets: PII, ID, or Sample.
ID column detection
A column classified as SKIP by the heuristic is further checked for ID characteristics: it must be numeric, contain unique values, and be monotonically increasing. If all three conditions hold, it is treated as an auto-incrementing ID column. Otherwise, it falls into the Sample bucket.
File Format Support
Lethe auto-detects the file format from the extension and, for .txt files, from content heuristics. The format determines which reader, writer, and replacement strategy are used.
| Format | Extension | Detection | Column Structure | Replacement |
|---|---|---|---|---|
| CSV | .csv |
Extension match | Standard CSV columns with header row | Whole-cell swap per column |
| TSV | .tsv or .txt |
Extension, or consistent tab counts in first 20 lines | Tab-separated columns with header row | Whole-cell swap per column |
| One-value-per-line | .txt |
Short lines (avg < 60 chars, max 4 words per line) | Single column named "value" |
Whole-cell swap, full NLP on every cell |
| Free-form text | .txt |
Default for .txt when other heuristics don't match |
Single column named "text" |
Substring swap (inline PII replacement) |
Detection algorithm for .txt files
When the file extension is .txt, Lethe reads the first 20 lines and applies heuristics in order:
- Tab check: Count tabs in each non-empty line. If every line has the same number of tabs (at least 1), the file is classified as TSV.
- Short-value check: Calculate the average character length and maximum word count across non-empty lines. If avg < 60 and max words ≤ 4, the file is classified as one-value-per-line.
- Default: Everything else is classified as free-form text.
Empty lines
- One-value-per-line: Empty lines are skipped (value lists shouldn't have blanks).
- Free-form text: Empty lines are preserved (they serve as paragraph separators).
Multiply command format support
The multiply command only supports CSV and TSV, since multiplying paragraphs or value lists is not meaningful. Attempting to multiply a free-form or one-value-per-line file will produce a clear error message.
NLP Engine (spaCy)
Lethe uses spaCy as the NLP backend inside Presidio for Named Entity Recognition (NER). spaCy is a local library. All processing happens on your machine. Nothing is sent to any external API or cloud service.
Model comparison
| Property | --model trf (default) | --model sm |
|---|---|---|
| spaCy model | en_core_web_trf | en_core_web_sm |
| Architecture | Transformer (RoBERTa-based) | CNN + token embeddings |
| Parameters | ~125M | ~12M |
| Download size | ~500 MB | ~12 MB |
| Accuracy | Higher, especially on ambiguous text | Good for structured/well-named data |
| Speed | Slower (GPU-accelerated if available) | Fast, CPU-friendly |
| Memory | ~1-2 GB | ~150 MB |
| Best for | Free-form text, notes, mixed content | Well-structured CSV/TSV with clear column names |
What the NLP model does
Both models perform the same task: Named Entity Recognition. For each text input, the model:
- Tokenizes the text into words and subwords
- Tags parts of speech (noun, verb, etc.)
- Labels named entities with types like PERSON, ORG, GPE (geo-political entity), DATE, etc.
Presidio then combines these NER labels with its own pattern-based recognizers (regex for SSNs, credit cards, IBANs, etc.) to produce the final list of PII detections. The NLP model catches entities that patterns alone would miss, such as recognizing "John Smith" as a person name in a free-text field.
What is this model, exactly?
The trf model is what would be called a "small language model" (SLM), a ~125M parameter transformer that runs entirely locally. It is comparable in scale to models used in tools like Ollama, but specialized for entity recognition rather than text generation. It does not generate text, predict next tokens, or respond to prompts. It only classifies tokens into entity types.
The sm model is not a transformer at all. It uses a convolutional neural network (CNN) over token embeddings, making it smaller and faster but less accurate on context-dependent entities.
first_name, email, ssn), the sm model is usually sufficient because column-name heuristics do most of the work. For free-form text, notes fields, or data with ambiguously named columns, use trf for better accuracy.
PII Detection Strategy
Lethe uses a layered approach to PII detection, combining fast heuristics with deep NLP analysis:
Layer 1: Column-Name Heuristics
Regex patterns match column names like first_name, email, ssn to known PII types. Also identifies columns to skip entirely (IDs, timestamps, status fields). This layer is instant and requires no NLP.
Layer 2: Presidio NLP Analysis
Each cell in non-skipped columns is analyzed by Presidio using a spaCy language model. This catches PII that column names don't reveal, such as a name in a column called notes. Custom pattern recognizers extend Presidio for IBAN, UK NINO, and Dutch BSN.
Layer 3: Confidence Boosting
When heuristics and NLP agree, the confidence score is boosted by 0.25 (capped at 1.0). When heuristics suggest PII but NLP found nothing, a synthetic detection is created at score 0.4. This reduces false negatives for well-named columns.
multiply command only uses Layer 1 (heuristics). The full three-layer detection is only available through anonymize.
Column Heuristics Reference
The infer_pii_type() function classifies columns by matching their name against two sets of regex patterns. SKIP patterns are checked first; if none match, PII patterns are checked. If neither matches, the column is classified as unknown.
Skip patterns
These columns are never scanned for PII:
| Category | Matches |
|---|---|
| ID | id, _id, pk, key, uuid, guid, row_num |
| Timestamp | Columns ending in created_at, updated_on, _date, _time, timestamp |
| Boolean | Columns starting with is_, has_, was_, can_, active, enabled |
| Numeric | count, total, amount, price, cost, rate, score, age, etc. |
| Status | status, state, type, category, tier, level, role, kind |
| Currency | currency, currency_code |
| Country Code | country_code, lang, language, locale |
PII hint patterns
| Entity Type | Column Name Patterns |
|---|---|
PERSON | name, first_name, last_name, full_name, surname, customer_name, user_name |
EMAIL_ADDRESS | email, mail, email_addr |
PHONE_NUMBER | phone, mobile, cell, tel, fax, contact_number |
LOCATION | address, street, city, state, zip, postal, country, location |
DATE_TIME | birth, dob, date_of_birth |
US_SSN | ssn, social_security |
CREDIT_CARD | credit_card, card_num, cc_num, pan |
US_DRIVER_LICENSE | driver_license, dl_num |
IBAN_CODE | iban |
IP_ADDRESS | ip_addr, ip, remote_ip, client_ip |
US_PASSPORT | passport |
Session Mapping
The SessionIndex maintains a dictionary mapping (entity_type, original_value) tuples to fake replacements. This ensures:
- Consistency: "John Smith" always maps to the same fake name within a session.
- Type awareness: The same string mapped as
PERSONandLOCATIONwill produce different fake values. - Cross-table integrity: If you anonymize
customers.csvandorders.csvin the same session with the same seed, foreign key references remain valid.
When the index encounters a new (type, value) pair, it generates a fake value using the Faker method mapped to that entity type and caches it. Subsequent lookups return the cached value.
multiply command generates a fresh Faker value for every row, even when the same entity type appears. This is by design: multiplied data is not trying to preserve referential relationships, it is trying to create volume.
Determinism & Reproducibility
When you pass --seed, Lethe produces byte-identical output on every run with the same input. This is achieved by seeding two independent random sources:
| Random Source | Seeded By | Controls |
|---|---|---|
Faker.seed() |
--seed value |
All fake data generation: names, emails, phones, addresses, SSNs, IBANs, and every other replacement value |
random.seed() |
--seed value |
random.choice() calls used to sample non-PII column values during multiply |
How it works in practice
Consider a dataset with 5 records, multiplied by factor 200 to produce 1,000 rows:
lethe multiply customers.csv --factor 200 --seed 42 -o big_dataset.csv
With seed 42, this will always produce the exact same 1,000 rows. The same names, same emails, same sampled status values, in the same order. Run it again on another machine, same result.
Without --seed, both Faker and random use Python's default entropy source, so every run produces different output.
Session consistency
Within a single run (seeded or not), the SessionIndex guarantees that the same original value always maps to the same fake. If "John Smith" appears in 50 rows across your dataset, every occurrence is replaced with the same synthetic name. This is critical for preserving foreign-key relationships and data integrity.
Cross-table consistency
When anonymizing multiple related tables, use the same --seed value for all files. Shared values (like a customer name appearing in both customers.csv and orders.csv) will map to the same fake values, preserving referential integrity.
multiply command does not use the SessionIndex for consistency. It generates a fresh Faker value for every row, even when the same entity type appears. This is by design: multiplied data is meant to create volume, not preserve referential relationships. The anonymize command does use the SessionIndex.
Auditability
Seeded output is valuable for regulatory audits. You can re-run the anonymization at any time and verify that the results are identical, demonstrating that the process is controlled and repeatable. This supports GDPR Article 32 (security of processing) requirements.
Sanitizer
When using non-English locales, Faker may generate values containing accented characters, non-ASCII domain names, or other characters that break email and URL format conventions. The --sanitize flag applies post-processing to ensure compliance.
Email sanitization
Applied to columns classified as EMAIL_ADDRESS:
- Unicode transliteration: NFKD decomposition followed by ASCII encoding. Accented characters become their closest ASCII equivalent (
ewith accent becomese,uwith umlaut becomesu). - Space replacement: Whitespace in the local part becomes underscores. Whitespace in the domain becomes dashes.
- Invalid character stripping: Only
a-z A-Z 0-9 . _ + -are kept in the local part. Onlya-z A-Z 0-9 . -in the domain. - Edge case cleanup: Double dots are collapsed, leading/trailing dots and dashes are stripped. If the result is empty, fallbacks are applied (
userfor local,example.comfor domain).
URL sanitization
Applied to columns classified as URL:
- Unicode transliteration: Same NFKD decomposition as emails.
- Scheme enforcement: If no
http://orhttps://scheme is present,https://is prepended. - Space replacement: Whitespace becomes dashes.
- Character filtering: Only standard URL characters are kept.
Example: Ukrainian locale without and with sanitization
# Without --sanitize (may produce non-ASCII domains)
shevchenkovenedykt@baran-turkalo.укр # Cyrillic TLD
# With --sanitize (ASCII-only output)
shevchenkovenedykt@example.com # Invalid domain replaced
Cookbook: GDPR Compliance
Right to Erasure (Article 17)
When a data subject requests deletion, you can pseudo-anonymize their data as part of the process. This preserves the statistical value of your dataset while replacing all personal information with fakes. Note that pseudo-anonymization alone may not satisfy Article 17 obligations, since the output is still considered personal data under GDPR. Consult your DPO on whether deletion of the original data (and any session mapping) is also required.
# Anonymize with high sensitivity to catch all PII
lethe anonymize user_data.csv --threshold 0.2 -o user_data_clean.csv
Data Minimization (Article 5)
Before sharing data for analysis, anonymize it to only expose what is necessary:
# Anonymize customer data before sending to analytics team
lethe anonymize customers.csv --seed 42 -o customers_for_analytics.csv
Data Protection by Design (Article 25)
Integrate anonymization into your development workflow so that non-production environments never contain real personal data:
# Generate test data from production schema
lethe multiply production_sample.csv --factor 10 --seed 42 -o test_data.csv
Data Protection Impact Assessment
When your DPIA identifies that test environments process real personal data, Lethe provides a concrete technical measure to mitigate that risk. Document the use of seeded, reproducible anonymization in your DPIA as evidence of appropriate safeguards under Article 32.
Cookbook: Test Data Generation
Generate deterministic test fixtures
Use --seed to generate the exact same output every time. This is essential for tests that assert on specific values:
# Always produces identical output
lethe multiply seed_data.csv --factor 5 --seed 42 -o tests/fixtures/generated.csv
Scale test data for load testing
Start with a small representative sample and multiply it to the desired size:
# 50 rows -> 50,000 rows
lethe multiply sample.csv --factor 1000 -o load_test_data.csv
Generate locale-specific test data
Test your application with data from different regions:
# Dutch customers
lethe multiply template.csv --factor 100 --locale nl_NL --sanitize -o test_nl.csv
# German customers
lethe multiply template.csv --factor 100 --locale de_DE --sanitize -o test_de.csv
# French customers
lethe multiply template.csv --factor 100 --locale fr_FR --sanitize -o test_fr.csv
--sanitize when generating data with non-English locales if your application validates email addresses or URLs. Without it, some locales may produce non-ASCII characters in these fields.
Cookbook: Multi-Locale Data
Supported locales
Lethe supports all Faker locales. Common ones include:
| Locale | Region | Names | Addresses | Phones |
|---|---|---|---|---|
en_US | United States | English names | US format | US format |
nl_NL | Netherlands | Dutch names | Dutch format (street + postcode) | +31 format |
de_DE | Germany | German names (may include umlauts) | German format | +49 format |
fr_FR | France | French names (accents) | French format | +33 format |
ja_JP | Japan | Japanese characters | Japanese format | +81 format |
uk_UA | Ukraine | Cyrillic names | Ukrainian format | +380 format |
When to use --sanitize
The --sanitize flag is recommended when:
- Using any non-English locale and your downstream system validates email formats
- Generating data with
uk_UA,ru_RU, or other Cyrillic locales (these may produce non-ASCII TLDs) - Generating data with
de_DE,fr_FR, or other European locales where names contain accented characters that could appear in generated emails - The output will be imported into a system that only accepts ASCII email addresses
You can safely omit --sanitize when:
- Using
en_US(default) or other ASCII-safe locales - Your downstream system supports internationalized email addresses (RFC 6531)
- You do not have email or URL columns in your data
Cookbook: CI/CD Integration
Generate test data in a CI pipeline
# .github/workflows/test.yml
jobs:
test:
steps:
- uses: actions/checkout@v4
- run: pip install lethe-cli
- run: |
lethe multiply tests/seed_data.csv \
--factor 10 \
--seed 42 \
-o tests/fixtures/generated.csv
- run: pytest
Anonymize production data for staging
# Script to refresh staging database with anonymized production data
#!/bin/bash
set -euo pipefail
# Export production data
psql $PROD_DB -c "COPY customers TO STDOUT CSV HEADER" > /tmp/customers.csv
psql $PROD_DB -c "COPY orders TO STDOUT CSV HEADER" > /tmp/orders.csv
# Anonymize with consistent seed so foreign keys stay valid
lethe anonymize /tmp/customers.csv --seed 42 --model sm -o /tmp/customers_anon.csv
lethe anonymize /tmp/orders.csv --seed 42 --model sm -o /tmp/orders_anon.csv
# Import into staging
psql $STAGING_DB -c "COPY customers FROM STDIN CSV HEADER" < /tmp/customers_anon.csv
psql $STAGING_DB -c "COPY orders FROM STDIN CSV HEADER" < /tmp/orders_anon.csv
# Clean up
rm /tmp/customers*.csv /tmp/orders*.csv
--seed value so that shared values (like customer names appearing in both tables) map to the same fake values, preserving referential integrity.
Cookbook: Third-Party Data Sharing
Share data with a vendor
# Anonymize before sending to an analytics vendor
lethe anonymize sales_data.csv --threshold 0.2 -o sales_data_safe.csv
# Verify by inspecting a few rows
head sales_data_safe.csv
Create demo datasets
# Start with a realistic schema and multiply
lethe multiply schema_sample.csv --factor 50 --locale en_US --seed 1 -o demo_data.csv
Prepare training data for ML
# Multiply a small labeled dataset for model training
lethe multiply labeled_samples.csv --factor 20 -o training_data.csv
# Non-PII columns (labels, categories, scores) are sampled from the original
# distribution, preserving class balance
Cookbook: Scaling Large Files
Reduce memory usage
The anonymize command streams data in chunks. Reduce the chunk size for lower memory usage at the cost of more processing overhead:
# Default: 5,000 rows per chunk
lethe anonymize huge_file.csv
# Lower memory: 500 rows per chunk
lethe anonymize huge_file.csv --chunk-size 500
# Higher throughput: 20,000 rows per chunk
lethe anonymize huge_file.csv --chunk-size 20000
Speed vs. accuracy trade-off
# Fastest: small model + high threshold
lethe anonymize data.csv --model sm --threshold 0.7
# Most accurate: transformer model + low threshold
lethe anonymize data.csv --model trf --threshold 0.2
multiply command reads the entire file into memory. For very large files, consider splitting the input first, multiplying each part, and concatenating the results. However, note that distribution sampling will then be based on each part rather than the whole dataset.
Cookbook: Multiply Use Cases
Populate a dev database
# Start with 10 seed records, produce 10,000
lethe multiply seed_customers.csv --factor 1000 --seed 42 -o dev_customers.csv
# Import into database
psql dev_db -c "COPY customers FROM STDIN CSV HEADER" < dev_customers.csv
Stress test an import pipeline
# Generate 1 million rows from 100 samples
lethe multiply sample.csv --factor 10000 -o stress_test.csv
Anonymize without adding rows
# factor=1 replaces all PII but keeps the same row count
# Faster than 'lethe anonymize' because it skips NLP
lethe multiply customers.csv --factor 1 -o customers_clean.csv
--factor 1, the multiply command only replaces PII based on column-name heuristics. It will not detect PII in unexpectedly-named columns (e.g., a column called notes containing names). For those cases, use lethe anonymize.
Generate localized demo data
# Create demo data for a Dutch client presentation
lethe multiply template.csv --factor 50 \
--locale nl_NL --sanitize --seed 1 -o demo_nl.csv
Process Schematic: Anonymize (Structured Data)
Complete processing flow for CSV, TSV, and one-value-per-line files through the anonymize pipeline.
Process Schematic: Anonymize (Free-form Text)
Free-form text uses a different replacement strategy that preserves surrounding text. Instead of replacing entire cells, only the PII substrings are swapped.
Process Schematic: Multiply
The multiply pipeline reads the entire file, classifies columns, then generates synthetic rows.
Process Schematic: PII Detection
Detailed flow of how a single cell is evaluated for PII during the anonymize pipeline. This runs for every cell in every non-skipped column.
Reference: PII Entity Types
These entity types are recognized by both the heuristic and NLP layers:
| Entity Type | Description | Example Original | Example Replacement |
|---|---|---|---|
PERSON | Full name, first name, last name | John Smith | Allison Hill |
EMAIL_ADDRESS | Email address | john@example.com | garzaanthony@example.org |
PHONE_NUMBER | Phone/mobile/fax number | +1-555-0101 | +1-555-9876 |
LOCATION | Street address, city, state, zip | 742 Evergreen Terrace | 123 Oak Lane, Springfield |
DATE_TIME | Date of birth | 1990-05-15 | 1985-11-23 |
US_SSN | Social Security Number | 123-45-6789 | 987-65-4321 |
CREDIT_CARD | Credit card number | 4111 1111 1111 1111 | 5234 5678 9012 3456 |
US_DRIVER_LICENSE | Driver's license number | DL12345678 | AB9876543 |
IBAN_CODE | International bank account | NL91ABNA0417164300 | DE89370400440532013000 |
IP_ADDRESS | IPv4 address | 192.168.1.100 | 10.45.67.89 |
US_PASSPORT | Passport number | 123456789 | 987654321 |
UK_NINO | UK National Insurance Number | AB123456C | CD987654A |
NL_BSN | Dutch Citizen Service Number | 123456789 | 987654321 |
URL | Web URL | https://example.com | https://fake-site.net |
CRYPTO | Cryptographic hash | a1b2c3d4... | e5f6g7h8... |
Reference: Skip Patterns
Columns matching these patterns are never scanned for PII. In the multiply command, they are either treated as ID columns (if the values are unique, monotonic integers) or as sample columns (values are randomly drawn from the existing distribution).
| Pattern | Regex | Examples |
|---|---|---|
| ID | ^(id|_id|pk|key|uuid|guid|row_?num)$ | id, pk, uuid |
| Timestamp | (created|updated|...|_at|_on|_date|_time)$ | created_at, updated_on |
| Boolean | ^(is_|has_|was_|can_|should_|active|enabled|flag) | is_active, has_paid |
| Numeric | ^(count|total|amount|price|cost|rate|score|age|...)$ | amount, price |
| Status | ^(status|state|type|category|tier|level|role|kind)$ | status, role |
| Currency | ^(currency|currency_code)$ | currency |
| Country Code | ^(country_code|lang|language|locale)$ | country_code |
Reference: Faker Generators
Each PII entity type is mapped to a specific Faker method for generating realistic replacements:
| Entity Type | Faker Method | Extra Arguments |
|---|---|---|
PERSON | faker.name() | |
EMAIL_ADDRESS | faker.email() | |
PHONE_NUMBER | faker.phone_number() | |
LOCATION | faker.address() | |
US_SSN | faker.ssn() | |
CREDIT_CARD | faker.credit_card_number() | |
DATE_TIME | faker.date() | |
IP_ADDRESS | faker.ipv4() | |
IBAN_CODE | faker.iban() | |
US_DRIVER_LICENSE | faker.bothify() | text="??#######" |
US_PASSPORT | faker.bothify() | text="#########" |
UK_NINO | faker.bothify() | text="??######?" |
NL_BSN | faker.numerify() | text="#########" |
URL | faker.url() | |
CRYPTO | faker.sha1() |
Reference: Custom Recognizers
Lethe extends Presidio with three custom pattern recognizers for entity types not well-covered by the defaults:
IBAN Recognizer
Matches International Bank Account Numbers. Pattern: [A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}... with score 0.7. Context words: iban, account, bank.
UK NINO Recognizer
Matches UK National Insurance Numbers (e.g., AB 12 34 56 C). Excludes known-invalid prefixes (BG, GB, NK, etc.) with score 0.7. Context words: nino, national insurance, ni number.
Dutch BSN Recognizer
Matches Dutch Citizen Service Numbers (Burgerservicenummer). Pattern: 9-digit number with a low base score of 0.3 (since bare 9-digit numbers are ambiguous). Context words: bsn, burgerservicenummer, sofi.
Reference: Configuration
All configuration is passed via CLI flags. There is no configuration file. The LetheConfig dataclass holds all settings:
| Setting | Type | Default | Description |
|---|---|---|---|
model | string | "trf" | spaCy model alias. "trf" maps to en_core_web_trf, "sm" to en_core_web_sm |
threshold | float | 0.35 | Minimum confidence score to classify a cell as PII. Lower = more sensitive. |
chunk_size | int | 5000 | Rows per chunk in streaming mode. Only affects anonymize. |
locale | string | "en_US" | Faker locale. Affects names, addresses, phone formats, etc. |
seed | int or null | null | Random seed for reproducibility. Same seed = same output. |
Threshold tuning guide
| Threshold | Behavior | Use When |
|---|---|---|
0.2 | Very sensitive, catches edge cases, higher false positive rate | Maximum privacy, GDPR compliance, "better safe than sorry" |
0.35 (default) | Balanced sensitivity | General-purpose anonymization |
0.5 | Moderate, fewer false positives | Well-structured data with clear column names |
0.7+ | Conservative, only high-confidence detections | When preserving data fidelity is critical |
Reference: Output Formats
Lethe preserves the input file format in the output. The output extension matches the input extension.
CSV output
All CSV output uses csv.QUOTE_ALL quoting. Every field is enclosed in double quotes, regardless of whether it contains special characters. This ensures that:
- Multiline values (e.g., addresses with newlines) are correctly enclosed and parseable by any standard CSV reader
- Commas in values do not break field boundaries
- Empty strings are distinguishable from null/missing values
- Downstream tools (databases, spreadsheets, data pipelines) receive unambiguous input
"id","first_name","last_name","email","status"
"1","Allison Hill","Roman","garzaanthony@example.org","active"
"2","Lindsey Cameron","Maddox","smiller@example.net","active"
"3","Melinda Henderson","Jordan","richard13@example.net","inactive"
Input files do not need to be quoted. Lethe reads both quoted and unquoted CSV.
TSV output
TSV output uses tab separation with no quoting (standard TSV convention). Fields are separated by a single tab character.
id first_name last_name email status
1 Allison Hill Roman garzaanthony@example.org active
2 Lindsey Cameron Maddox smiller@example.net active
TXT output (one-value-per-line)
Each value is written on its own line, with no quoting or formatting. Empty lines are skipped on input and not produced on output.
garzaanthony@example.org
smiller@example.net
richard13@example.net
TXT output (free-form text)
Lines are output as-is with PII substrings replaced inline. Empty lines (paragraph separators) are preserved.
Dear Allison Hill,
We are writing to confirm your appointment at our Springfield office. Your account
manager, Lindsey Cameron, will be available to assist you on Monday.
Please bring your ID and proof of address to 123 Oak Lane, Springfield.
You can reach us at 555-9876 or email garzaanthony@example.org for any questions.
Reference: Destructive Actions
By default, Lethe never modifies or deletes your input files. All output is written to a separate file. The --clean flag changes this behavior by deleting the original input file after successful processing.
The --clean flag
When passed to lethe anonymize or lethe multiply, the --clean flag deletes the original input file after the output file has been written successfully. This is useful in pipelines where you want to ensure the original PII-containing file does not persist on disk.
--clean flag requires --confirm-clean to actually take effect. Using --clean alone will produce an error and exit without processing.
# This will fail with an error
lethe anonymize data.csv --clean
# This works: anonymizes, then deletes data.csv
lethe anonymize data.csv --clean --confirm-clean
Safety guarantees
- Success-only deletion: The input file is only deleted after the pipeline completes without errors. If the pipeline fails or raises an exception, the input file is untouched.
- Explicit confirmation: The
--confirm-cleanflag exists solely to prevent accidental deletion. It has no effect on its own. - What gets deleted: Only the input file specified as the argument. Nothing else is touched.
Convention: --confirm-<action>
Lethe follows a general pattern for destructive CLI actions: any flag that causes irreversible data loss requires a paired confirmation flag. The naming convention is --confirm-<action>. This makes destructive operations safe for scripting and CI/CD, where interactive prompts are not practical. If you are running Lethe in an automated pipeline, include both flags explicitly.
Reference: Anonymization vs. Pseudo-anonymization
These terms are often used interchangeably, but they have distinct legal and technical meanings, particularly under GDPR.
| Anonymization | Pseudo-anonymization | |
|---|---|---|
| What happens to PII | Irreversibly removed, deleted, or aggregated. The original values are gone permanently. | Replaced with realistic fake values. The data structure is preserved, but the real PII is swapped for generated alternatives. |
| Reversibility | Impossible. No party, including the one who performed the anonymization, can recover the original data. | Theoretically reversible if someone has access to the mapping table (e.g., the SessionIndex). Without the mapping, re-identification is practically infeasible. |
| GDPR status | Not personal data (Recital 26). Falls entirely outside the scope of GDPR. No consent, lawful basis, or data subject rights obligations apply. | Still personal data (Article 4(5)). GDPR obligations still apply, but pseudo-anonymization is recognized as an appropriate safeguard (Recital 28, Article 25, Article 32) that may simplify compliance. |
| Data utility | Lower. Removing or aggregating values reduces the data's usefulness for development, testing, and analysis. | Higher. Replaced values are structurally identical to the originals (same types, formats, distributions), so the data remains useful for development, testing, analytics, and ML training. |
| Techniques | Deletion, aggregation (averages, counts), generalization (exact age to age range, city to country), k-anonymity, differential privacy. | Value substitution with consistent fakes (what Lethe does), tokenization, encryption with key management. |
| Use cases | Publishing open datasets, permanent data retention after deletion requests, removing GDPR obligations entirely for a dataset. | Dev/test environments, third-party data sharing, CI/CD fixtures, demo environments, analytics, debugging with realistic data. |
What Lethe does
Lethe performs pseudo-anonymization. It replaces PII with Faker-generated values via the SessionIndex mapping. It does not delete, aggregate, or generalize data. The output preserves full structural fidelity (foreign keys, column distributions, data types) while replacing real personal information with realistic fakes.
Lethe does not persist the SessionIndex mapping to disk. Once the process exits, the mapping exists only in the output file (as the replacement values). Without the original input, recovering the original PII from the output is not feasible. However, if the original input file is retained alongside the output, a determined party could theoretically reconstruct the mapping by comparing the two.
When you need true anonymization
If your use case requires data that falls entirely outside GDPR scope, you need true anonymization rather than pseudo-anonymization. This means irreversibly removing all identifying information with no possibility of re-identification. Techniques include aggregation (replacing individual records with statistical summaries), generalization (broadening values, e.g. exact age to age range), or complete deletion of PII columns. Lethe is not the right tool for true anonymization.
Reference: Definitions
Key technologies and concepts used by Lethe.
Technologies
| Term | Description |
|---|---|
| spaCy | Open-source NLP library for tokenization and named entity recognition (NER). Runs entirely on your machine, no network calls. |
| RoBERTa | The transformer architecture used by en_core_web_trf. A ~125M parameter model from Facebook/Meta AI, fine-tuned by Explosion (spaCy) for English NER. Not a generative model, it only classifies tokens. |
| Presidio | Microsoft's open-source PII detection framework. Combines NLP models with pattern recognizers (regex). Lethe uses its AnalyzerEngine but not its AnonymizerEngine, replacement logic is handled by Lethe's own mapping layer. |
| Faker | Python library for generating realistic fake data (names, emails, addresses, phone numbers). Supports 50+ locales for locale-appropriate output. |
| Pandas | DataFrame library used for chunked file I/O and data manipulation. Handles CSV/TSV reading and writing. |
| Typer | CLI framework built on Click. Handles argument parsing, help text, and command routing. |
| Rich | Terminal formatting library for progress bars, tables, and colored output during processing. |
Data lifecycle
| Category | Details |
|---|---|
| What is kept | Output files (anonymized or multiplied), session mapping (in memory only, not persisted to disk), command history (standard shell history). |
| What is destroyed | With --clean --confirm-clean, the original input file is deleted after successful processing. Without --clean, nothing is ever deleted. PII values are replaced in the output file only, the input file is never modified. |