Lethe v0.2.0

Pseudo-anonymize and multiply structured data. Replace personal information with realistic synthetic values while preserving schema, types, and distributions.

Built for Regulatory Compliance

Data privacy regulations impose strict requirements on how organizations collect, store, process, and share personal data. Non-compliance carries severe financial penalties and reputational damage. Lethe helps you meet these obligations by providing reliable, auditable pseudo-anonymization, replacing PII with realistic fake values, that can be integrated directly into your data workflows. See Anonymization vs. Pseudo-anonymization for the important distinction.

GDPR (EU)

The General Data Protection Regulation requires organizations processing EU residents' data to implement appropriate technical measures to protect personal data. Lethe performs pseudo-anonymization as defined in Article 4(5): PII is replaced with fake values, reducing the risk of re-identification. Pseudo-anonymized data is still personal data under GDPR, but benefits from relaxed processing requirements (Recital 28, Article 25, Article 32). Lethe supports Article 5 (data minimization), Article 25 (data protection by design), and Article 32 (security of processing). For true anonymization (Recital 26, where data falls outside GDPR scope entirely), PII must be irreversibly removed, not replaced.

CCPA / CPRA (California)

The California Consumer Privacy Act and its successor, the California Privacy Rights Act, grant consumers the right to know what personal information is collected, the right to delete it, and the right to opt out of its sale. Deidentified data is exempt from CCPA obligations provided it cannot reasonably identify a consumer. Lethe's whole-cell replacement with Faker-generated values makes re-identification practically infeasible, though the output should be treated as pseudo-anonymized rather than fully deidentified unless reviewed by legal counsel.

HIPAA (US Healthcare)

The Health Insurance Portability and Accountability Act's Safe Harbor method requires removal of 18 specific identifiers from protected health information (PHI). Lethe detects and replaces names, addresses, dates, phone numbers, SSNs, email addresses, and IP addresses, covering the majority of Safe Harbor identifiers. When combined with domain-specific review, this supports compliant de-identification of healthcare data.

AVG (Netherlands)

The Algemene Verordening Gegevensbescherming is the Dutch implementation of the GDPR, enforced by the Autoriteit Persoonsgegevens. It applies the same principles with local enforcement. Lethe includes Dutch-specific capabilities: a BSN (Burgerservicenummer) recognizer, Dutch locale support for generating realistic NL addresses and names, and IBAN detection for Dutch bank accounts.

How Lethe Supports Compliance

Regulatory compliance is not just about having a tool. It requires a process. Lethe fits into that process at several points:

Important: Lethe performs pseudo-anonymization (PII replacement), not true anonymization (irreversible PII removal). Under GDPR, these are legally distinct: pseudo-anonymized data is still personal data (Article 4(5)), while truly anonymized data falls outside GDPR scope entirely (Recital 26). Lethe's whole-cell replacement with Faker-generated values makes re-identification practically infeasible, but you should consult your Data Protection Officer or legal counsel to confirm whether pseudo-anonymization meets the requirements of your specific regulatory context, or whether true anonymization is required.

Overview

Lethe is a command-line tool for data pseudo-anonymization and synthetic data generation. It reads CSV, TSV, TXT, and SQL dump files, detects personally identifiable information (PII) using a combination of column-name heuristics and NLP-based entity recognition, and replaces sensitive values with realistic fake data generated by Faker. This is pseudo-anonymization: PII is replaced, not removed, preserving the data's structure and utility while eliminating real personal information. All processing happens locally using spaCy language models. Nothing leaves your machine.

Lethe provides two core commands:

anonymize

Scan CSV, TSV, or TXT files with NLP and heuristics to detect PII. Replace detected values with consistent fake data. Processes files in chunks for bounded memory usage. Supports structured (columnar) and free-form text.

multiply

Expand a dataset by a given factor with synthetic rows. PII columns get fresh Faker values, ID columns get sequential integers, and non-PII columns preserve the original value distribution.

Key Properties

Installation

From PyPI

pip install lethe-cli

From source

git clone https://github.com/yourorg/lethe.git
cd lethe
pip install -e ".[dev]"

Download a spaCy model

The anonymize command requires a spaCy language model. Choose one:

# Transformer model (more accurate, slower, needs GPU for best performance)
python -m spacy download en_core_web_trf

# Small model (faster, lower accuracy, CPU-friendly)
python -m spacy download en_core_web_sm
Note: The multiply command does not require spaCy or Presidio. It uses column-name heuristics only, so it works without downloading any NLP models.

Dependencies

PackageVersionPurpose
presidio-analyzer>=2.2, <3NLP-based PII detection
presidio-anonymizer>=2.2, <3Presidio integration
spacy>=3.7, <4NLP engine for entity recognition
faker>=25.0, <35Synthetic data generation
typer[all]>=0.12, <1CLI framework
rich>=13.0, <14Terminal output and progress bars
pandas>=2.0, <3DataFrame-based CSV I/O

Quick Start

Anonymize a CSV file

lethe anonymize customers.csv
# Output: customers_anonymized.csv

Anonymize a TSV file

lethe anonymize data.tsv
# Output: data_anonymized.tsv (format auto-detected from extension)

Anonymize free-form text

lethe anonymize letter.txt --model sm
# PII is replaced inline: "Dear John Smith" -> "Dear Allison Hill"

Multiply a dataset 5x

lethe multiply customers.csv --factor 5
# 100 rows in -> 500 rows out

Generate Dutch data with sanitized output

lethe multiply customers.csv --factor 3 --locale nl_NL --sanitize --seed 42

Example input and output

Input (5 rows)

"id","first_name","last_name","email","phone","ssn","address","created_at","status"
"1","John","Smith","john.smith@example.com","+1-555-0101","123-45-6789","742 Evergreen Terrace","2024-01-15","active"
"2","Jane","Doe","jane.doe@testmail.com","+1-555-0102","987-65-4321","123 Main Street","2024-02-20","active"
"3","Bob","Johnson","bob.j@company.org","+1-555-0103","456-78-9012","456 Oak Avenue","2024-03-10","inactive"
"4","Alice","Williams","alice.w@domain.net","+1-555-0104","321-54-9876","789 Pine Road","2024-04-05","active"
"5","Charlie","Brown","charlie.b@email.com","+1-555-0105","654-32-1098","321 Elm Street","2024-05-12","active"

Output after lethe multiply --factor 3 (15 rows)

Each column is handled according to its type:

ColumnClassificationBehavior
idIDOriginals kept (1-5), synthetics continue sequentially (6-15)
first_namePII / PERSONEvery row gets a fresh Faker name
last_namePII / PERSONEvery row gets a fresh Faker name
emailPII / EMAILEvery row gets a fresh Faker email
phonePII / PHONEEvery row gets a fresh Faker phone number
ssnPII / SSNEvery row gets a fresh Faker SSN
addressPII / LOCATIONEvery row gets a fresh Faker address
created_atSKIPSampled randomly from original 5 values
statusSKIPSampled randomly, preserving 80/20 active/inactive ratio

Command: lethe anonymize

Scans a data file for PII using NLP (Presidio + spaCy) combined with column-name heuristics, then replaces detected values with consistent fake data. Supports CSV, TSV, and TXT files. The file format is auto-detected from the extension and content heuristics.

lethe anonymize INPUT_FILE [OPTIONS]

Options

OptionDefaultDescription
-o, --output PATH<input>_anonymized.<ext>Output file path (preserves original extension)
-m, --model [trf|sm]trfspaCy model. trf = transformer (accurate), sm = small (fast)
--mode [pseudo|redact|generalize|drop]pseudoAnonymization strategy. See Anonymization Modes below.
-t, --threshold FLOAT0.35Minimum confidence score to treat a cell as PII
--chunk-size INT5000Rows per chunk for streaming. Lower values use less memory.
--locale TEXTen_USFaker locale for generated replacement values
--seed INTnoneRandom seed for reproducible output
--cleanfalseDelete the original input file after successful anonymization
--confirm-cleanfalseRequired confirmation for --clean. Both flags must be present.

Examples

# Basic anonymization
lethe anonymize data/customers.csv

# Fast mode with small model
lethe anonymize data/customers.csv --model sm

# Higher sensitivity (catches more PII, may have more false positives)
lethe anonymize data/customers.csv --threshold 0.2

# Lower sensitivity (only high-confidence detections)
lethe anonymize data/customers.csv --threshold 0.7

# Reproducible output with seed
lethe anonymize data/customers.csv --seed 42

# Dutch locale for replacement values
lethe anonymize data/customers.csv --locale nl_NL -o output/customers_nl.csv

# Large file with smaller chunks for lower memory usage
lethe anonymize data/large_export.csv --chunk-size 1000

# Anonymize a TSV file (tab-separated)
lethe anonymize data/export.tsv

# Anonymize a list of email addresses (one per line)
lethe anonymize data/emails.txt --model sm

# Anonymize free-form text (inline PII replacement)
lethe anonymize data/letter.txt

# Redact mode: replace PII with [ENTITY_TYPE] tokens
lethe anonymize data/customers.csv --mode redact

# Generalize mode: reduce PII to partial forms (email domains, year-only dates)
lethe anonymize data/customers.csv --mode generalize

# Drop mode: remove PII columns entirely
lethe anonymize data/customers.csv --mode drop

Anonymization Modes

Lethe supports four anonymization strategies via the --mode flag. The default mode (pseudo) replaces PII with realistic Faker-generated fakes. The three additional modes provide true anonymization where the original data is irreversibly removed, meeting standards like GDPR Recital 26.

ModeStrategyOutput ExampleReversible?
pseudoReplace with realistic fakes (Faker)John SmithEmily DavisNo (but linkable via seed)
redactReplace with entity type tokensJohn Smith[PERSON]No
generalizeReduce to partial, less-specific formjohn@example.com***@example.comNo
dropRemove PII columns/spans entirelyColumn removed from outputNo

Generalization Rules

The generalize mode applies entity-type-specific reduction. When no useful partial form exists, it falls back to redaction ([ENTITY_TYPE]).

Entity TypeStrategyExample
EMAIL_ADDRESSKeep domain onlyjohn@example.com***@example.com
DATE_TIMEYear only2024-01-152024
IP_ADDRESSFirst two octets192.168.1.100192.168.x.x
URLScheme + hostnamehttps://example.com/pathhttps://example.com
CREDIT_CARDLast 4 digits4532015112830366************0366
PERSON, PHONE, LOCATION, etc.Redact (no safe partial form)[PERSON]

Drop Mode Behavior

In structured files (CSV, TSV), drop removes entire columns where PII was detected. In free-form text, it removes PII spans inline and cleans up leftover whitespace. In SQL dumps, it removes columns from INSERT statements and reconstructs the SQL with the reduced column set.

Command: lethe multiply

Expands a CSV or TSV dataset by a given factor, producing N * factor total rows. The first N rows are the originals with PII replaced, and the remaining rows are fully synthetic. No NLP model is required. Does not support free-form text or one-value-per-line TXT files.

lethe multiply INPUT_FILE [OPTIONS]

Options

OptionDefaultDescription
-f, --factor INT3Multiplication factor. Output rows = input rows * factor
-o, --output PATH<input>_multiplied.<ext>Output file path (preserves original extension)
--locale TEXTen_USFaker locale for generated PII values
--seed INTnoneRandom seed for reproducible output
--sanitizefalseValidate and fix emails and URLs to RFC-compliant ASCII
--cleanfalseDelete the original input file after successful multiplication
--confirm-cleanfalseRequired confirmation for --clean. Both flags must be present.

How multiplication works

The multiplier classifies every column in your input file into one of three buckets, then handles each bucket differently:

BucketDetectionOriginal rows (0..N)Synthetic rows (N+1..N*factor)
PII Column name matches a PII pattern (e.g. first_name, email, ssn) Replaced with fresh Faker value Fresh Faker value per row
ID Column name matches an ID pattern and values are unique, monotonic integers Kept as-is Sequential integers from max(existing) + 1
SAMPLE Heuristic returns SKIP (non-PII) or None (unknown) Kept as-is Randomly sampled from existing column values

Examples

# Triple a dataset
lethe multiply customers.csv --factor 3
# 100 rows -> 300 rows

# Generate 10x the data with Dutch names and addresses
lethe multiply customers.csv --factor 10 --locale nl_NL --sanitize

# Reproducible output for testing
lethe multiply customers.csv --factor 5 --seed 42 -o test_data.csv

# Just anonymize without adding rows (factor=1)
lethe multiply customers.csv --factor 1
Tip: Using --factor 1 gives you a quick, NLP-free anonymization that relies only on column-name heuristics. It is much faster than lethe anonymize but will only detect PII based on column names, not cell contents.

Architecture: Anonymize Pipeline

The anonymize command processes files through a streaming pipeline. The format is auto-detected, and the pipeline adapts accordingly:

detect_formatExtension + heuristics
ReaderChunked I/O
PiiScannerHeuristics + NLP
ReplacerWhole-cell swap
WriterFormat-matched

For free-form text, the pipeline uses a different replacer that preserves surrounding text:

detect_formatExtension + heuristics
FreeformReaderLine-by-line
FreeformReplacerSubstring swap
FreeformWriterPlain text

1. Format Detection

The file extension determines the initial format. For .txt files, the first 20 lines are inspected: consistent tab counts indicate TSV, short lines (avg < 60 chars, max 4 words) indicate one-value-per-line, everything else is free-form text. See File Format Support for details.

2. Reader

Reads the input file in configurable chunks (default 5,000 rows). CSV and TSV use pandas.read_csv with chunksize. Line-per-value and freeform readers iterate lines from the file directly. Each chunk is an independent DataFrame yielded through a generator, so memory usage stays bounded regardless of file size.

2. PiiScanner

For each column in a chunk, the scanner applies three strategies in order:

  1. Column heuristics: The column name is matched against regex patterns to get a quick classification (SKIP, PII type hint, or unknown).
  2. Presidio NER: Each non-skipped cell is analyzed by Presidio's AnalyzerEngine using the configured spaCy model. This catches PII that the column name alone wouldn't reveal.
  3. Confidence boosting: When the heuristic and Presidio agree on a type, the confidence score is boosted by 0.25. When the heuristic suggests a type but Presidio found nothing, a synthetic result is created with score 0.4.

Results are filtered by the configured threshold (default 0.35). Only cells that meet or exceed the threshold are marked for replacement.

4. Replacer

For structured formats (CSV, TSV, one-value-per-line), the Replacer does whole-cell swaps: the entire cell value is replaced with a synthetic value via the SessionIndex.

For free-form text, the FreeformReplacer does substring replacement: Presidio returns character offsets for each PII span, and only those substrings are swapped while surrounding text is preserved. Replacements happen right-to-left so earlier character offsets remain valid after each substitution.

5. Writer

Writes each processed chunk to the output file in append mode. The writer matches the input format: CSV uses csv.QUOTE_ALL quoting, TSV uses tab separation without quoting, line-per-value and freeform write plain text lines.

Architecture: Multiply Pipeline

The multiply pipeline is simpler than anonymize because it does not use NLP. It supports CSV and TSV inputs and rejects free-form text and one-value-per-line formats:

detect_formatCSV or TSV only
Read FileEntire file
Classify ColumnsHeuristics only
Generate RowsFaker + sampling
SanitizeOptional
WriterFormat-matched

The entire input file is read at once (not chunked) because the multiplier needs to analyze value distributions for the sampling step. TSV files are read with sep="\t". For each column, the Multiplier class calls infer_pii_type() to classify it into one of three buckets: PII, ID, or Sample.

ID column detection

A column classified as SKIP by the heuristic is further checked for ID characteristics: it must be numeric, contain unique values, and be monotonically increasing. If all three conditions hold, it is treated as an auto-incrementing ID column. Otherwise, it falls into the Sample bucket.

File Format Support

Lethe auto-detects the file format from the extension and, for .txt files, from content heuristics. The format determines which reader, writer, and replacement strategy are used.

FormatExtensionDetectionColumn StructureReplacement
CSV .csv Extension match Standard CSV columns with header row Whole-cell swap per column
TSV .tsv or .txt Extension, or consistent tab counts in first 20 lines Tab-separated columns with header row Whole-cell swap per column
One-value-per-line .txt Short lines (avg < 60 chars, max 4 words per line) Single column named "value" Whole-cell swap, full NLP on every cell
Free-form text .txt Default for .txt when other heuristics don't match Single column named "text" Substring swap (inline PII replacement)

Detection algorithm for .txt files

When the file extension is .txt, Lethe reads the first 20 lines and applies heuristics in order:

  1. Tab check: Count tabs in each non-empty line. If every line has the same number of tabs (at least 1), the file is classified as TSV.
  2. Short-value check: Calculate the average character length and maximum word count across non-empty lines. If avg < 60 and max words ≤ 4, the file is classified as one-value-per-line.
  3. Default: Everything else is classified as free-form text.

Empty lines

Multiply command format support

The multiply command only supports CSV and TSV, since multiplying paragraphs or value lists is not meaningful. Attempting to multiply a free-form or one-value-per-line file will produce a clear error message.

NLP Engine (spaCy)

Lethe uses spaCy as the NLP backend inside Presidio for Named Entity Recognition (NER). spaCy is a local library. All processing happens on your machine. Nothing is sent to any external API or cloud service.

Model comparison

Property--model trf (default)--model sm
spaCy modelen_core_web_trfen_core_web_sm
ArchitectureTransformer (RoBERTa-based)CNN + token embeddings
Parameters~125M~12M
Download size~500 MB~12 MB
AccuracyHigher, especially on ambiguous textGood for structured/well-named data
SpeedSlower (GPU-accelerated if available)Fast, CPU-friendly
Memory~1-2 GB~150 MB
Best forFree-form text, notes, mixed contentWell-structured CSV/TSV with clear column names

What the NLP model does

Both models perform the same task: Named Entity Recognition. For each text input, the model:

  1. Tokenizes the text into words and subwords
  2. Tags parts of speech (noun, verb, etc.)
  3. Labels named entities with types like PERSON, ORG, GPE (geo-political entity), DATE, etc.

Presidio then combines these NER labels with its own pattern-based recognizers (regex for SSNs, credit cards, IBANs, etc.) to produce the final list of PII detections. The NLP model catches entities that patterns alone would miss, such as recognizing "John Smith" as a person name in a free-text field.

What is this model, exactly?

The trf model is what would be called a "small language model" (SLM), a ~125M parameter transformer that runs entirely locally. It is comparable in scale to models used in tools like Ollama, but specialized for entity recognition rather than text generation. It does not generate text, predict next tokens, or respond to prompts. It only classifies tokens into entity types.

The sm model is not a transformer at all. It uses a convolutional neural network (CNN) over token embeddings, making it smaller and faster but less accurate on context-dependent entities.

When to use which: For structured data (CSV/TSV with clear column names like first_name, email, ssn), the sm model is usually sufficient because column-name heuristics do most of the work. For free-form text, notes fields, or data with ambiguously named columns, use trf for better accuracy.

PII Detection Strategy

Lethe uses a layered approach to PII detection, combining fast heuristics with deep NLP analysis:

Layer 1: Column-Name Heuristics

Regex patterns match column names like first_name, email, ssn to known PII types. Also identifies columns to skip entirely (IDs, timestamps, status fields). This layer is instant and requires no NLP.

Layer 2: Presidio NLP Analysis

Each cell in non-skipped columns is analyzed by Presidio using a spaCy language model. This catches PII that column names don't reveal, such as a name in a column called notes. Custom pattern recognizers extend Presidio for IBAN, UK NINO, and Dutch BSN.

Layer 3: Confidence Boosting

When heuristics and NLP agree, the confidence score is boosted by 0.25 (capped at 1.0). When heuristics suggest PII but NLP found nothing, a synthetic detection is created at score 0.4. This reduces false negatives for well-named columns.

Note: The multiply command only uses Layer 1 (heuristics). The full three-layer detection is only available through anonymize.

Column Heuristics Reference

The infer_pii_type() function classifies columns by matching their name against two sets of regex patterns. SKIP patterns are checked first; if none match, PII patterns are checked. If neither matches, the column is classified as unknown.

Skip patterns

These columns are never scanned for PII:

CategoryMatches
IDid, _id, pk, key, uuid, guid, row_num
TimestampColumns ending in created_at, updated_on, _date, _time, timestamp
BooleanColumns starting with is_, has_, was_, can_, active, enabled
Numericcount, total, amount, price, cost, rate, score, age, etc.
Statusstatus, state, type, category, tier, level, role, kind
Currencycurrency, currency_code
Country Codecountry_code, lang, language, locale

PII hint patterns

Entity TypeColumn Name Patterns
PERSONname, first_name, last_name, full_name, surname, customer_name, user_name
EMAIL_ADDRESSemail, mail, email_addr
PHONE_NUMBERphone, mobile, cell, tel, fax, contact_number
LOCATIONaddress, street, city, state, zip, postal, country, location
DATE_TIMEbirth, dob, date_of_birth
US_SSNssn, social_security
CREDIT_CARDcredit_card, card_num, cc_num, pan
US_DRIVER_LICENSEdriver_license, dl_num
IBAN_CODEiban
IP_ADDRESSip_addr, ip, remote_ip, client_ip
US_PASSPORTpassport

Session Mapping

The SessionIndex maintains a dictionary mapping (entity_type, original_value) tuples to fake replacements. This ensures:

When the index encounters a new (type, value) pair, it generates a fake value using the Faker method mapped to that entity type and caches it. Subsequent lookups return the cached value.

Important: The multiply command generates a fresh Faker value for every row, even when the same entity type appears. This is by design: multiplied data is not trying to preserve referential relationships, it is trying to create volume.

Determinism & Reproducibility

When you pass --seed, Lethe produces byte-identical output on every run with the same input. This is achieved by seeding two independent random sources:

Random SourceSeeded ByControls
Faker.seed() --seed value All fake data generation: names, emails, phones, addresses, SSNs, IBANs, and every other replacement value
random.seed() --seed value random.choice() calls used to sample non-PII column values during multiply

How it works in practice

Consider a dataset with 5 records, multiplied by factor 200 to produce 1,000 rows:

lethe multiply customers.csv --factor 200 --seed 42 -o big_dataset.csv

With seed 42, this will always produce the exact same 1,000 rows. The same names, same emails, same sampled status values, in the same order. Run it again on another machine, same result.

Without --seed, both Faker and random use Python's default entropy source, so every run produces different output.

Session consistency

Within a single run (seeded or not), the SessionIndex guarantees that the same original value always maps to the same fake. If "John Smith" appears in 50 rows across your dataset, every occurrence is replaced with the same synthetic name. This is critical for preserving foreign-key relationships and data integrity.

Cross-table consistency

When anonymizing multiple related tables, use the same --seed value for all files. Shared values (like a customer name appearing in both customers.csv and orders.csv) will map to the same fake values, preserving referential integrity.

Important: The multiply command does not use the SessionIndex for consistency. It generates a fresh Faker value for every row, even when the same entity type appears. This is by design: multiplied data is meant to create volume, not preserve referential relationships. The anonymize command does use the SessionIndex.

Auditability

Seeded output is valuable for regulatory audits. You can re-run the anonymization at any time and verify that the results are identical, demonstrating that the process is controlled and repeatable. This supports GDPR Article 32 (security of processing) requirements.

Sanitizer

When using non-English locales, Faker may generate values containing accented characters, non-ASCII domain names, or other characters that break email and URL format conventions. The --sanitize flag applies post-processing to ensure compliance.

Email sanitization

Applied to columns classified as EMAIL_ADDRESS:

  1. Unicode transliteration: NFKD decomposition followed by ASCII encoding. Accented characters become their closest ASCII equivalent (e with accent becomes e, u with umlaut becomes u).
  2. Space replacement: Whitespace in the local part becomes underscores. Whitespace in the domain becomes dashes.
  3. Invalid character stripping: Only a-z A-Z 0-9 . _ + - are kept in the local part. Only a-z A-Z 0-9 . - in the domain.
  4. Edge case cleanup: Double dots are collapsed, leading/trailing dots and dashes are stripped. If the result is empty, fallbacks are applied (user for local, example.com for domain).

URL sanitization

Applied to columns classified as URL:

  1. Unicode transliteration: Same NFKD decomposition as emails.
  2. Scheme enforcement: If no http:// or https:// scheme is present, https:// is prepended.
  3. Space replacement: Whitespace becomes dashes.
  4. Character filtering: Only standard URL characters are kept.

Example: Ukrainian locale without and with sanitization

# Without --sanitize (may produce non-ASCII domains)
shevchenkovenedykt@baran-turkalo.укр    # Cyrillic TLD

# With --sanitize (ASCII-only output)
shevchenkovenedykt@example.com          # Invalid domain replaced

Cookbook: GDPR Compliance

Right to Erasure (Article 17)

When a data subject requests deletion, you can pseudo-anonymize their data as part of the process. This preserves the statistical value of your dataset while replacing all personal information with fakes. Note that pseudo-anonymization alone may not satisfy Article 17 obligations, since the output is still considered personal data under GDPR. Consult your DPO on whether deletion of the original data (and any session mapping) is also required.

# Anonymize with high sensitivity to catch all PII
lethe anonymize user_data.csv --threshold 0.2 -o user_data_clean.csv

Data Minimization (Article 5)

Before sharing data for analysis, anonymize it to only expose what is necessary:

# Anonymize customer data before sending to analytics team
lethe anonymize customers.csv --seed 42 -o customers_for_analytics.csv

Data Protection by Design (Article 25)

Integrate anonymization into your development workflow so that non-production environments never contain real personal data:

# Generate test data from production schema
lethe multiply production_sample.csv --factor 10 --seed 42 -o test_data.csv

Data Protection Impact Assessment

When your DPIA identifies that test environments process real personal data, Lethe provides a concrete technical measure to mitigate that risk. Document the use of seeded, reproducible anonymization in your DPIA as evidence of appropriate safeguards under Article 32.

Cookbook: Test Data Generation

Generate deterministic test fixtures

Use --seed to generate the exact same output every time. This is essential for tests that assert on specific values:

# Always produces identical output
lethe multiply seed_data.csv --factor 5 --seed 42 -o tests/fixtures/generated.csv

Scale test data for load testing

Start with a small representative sample and multiply it to the desired size:

# 50 rows -> 50,000 rows
lethe multiply sample.csv --factor 1000 -o load_test_data.csv

Generate locale-specific test data

Test your application with data from different regions:

# Dutch customers
lethe multiply template.csv --factor 100 --locale nl_NL --sanitize -o test_nl.csv

# German customers
lethe multiply template.csv --factor 100 --locale de_DE --sanitize -o test_de.csv

# French customers
lethe multiply template.csv --factor 100 --locale fr_FR --sanitize -o test_fr.csv
Tip: Always use --sanitize when generating data with non-English locales if your application validates email addresses or URLs. Without it, some locales may produce non-ASCII characters in these fields.

Cookbook: Multi-Locale Data

Supported locales

Lethe supports all Faker locales. Common ones include:

LocaleRegionNamesAddressesPhones
en_USUnited StatesEnglish namesUS formatUS format
nl_NLNetherlandsDutch namesDutch format (street + postcode)+31 format
de_DEGermanyGerman names (may include umlauts)German format+49 format
fr_FRFranceFrench names (accents)French format+33 format
ja_JPJapanJapanese charactersJapanese format+81 format
uk_UAUkraineCyrillic namesUkrainian format+380 format

When to use --sanitize

The --sanitize flag is recommended when:

You can safely omit --sanitize when:

Cookbook: CI/CD Integration

Generate test data in a CI pipeline

# .github/workflows/test.yml
jobs:
  test:
    steps:
      - uses: actions/checkout@v4
      - run: pip install lethe-cli
      - run: |
          lethe multiply tests/seed_data.csv \
            --factor 10 \
            --seed 42 \
            -o tests/fixtures/generated.csv
      - run: pytest

Anonymize production data for staging

# Script to refresh staging database with anonymized production data
#!/bin/bash
set -euo pipefail

# Export production data
psql $PROD_DB -c "COPY customers TO STDOUT CSV HEADER" > /tmp/customers.csv
psql $PROD_DB -c "COPY orders TO STDOUT CSV HEADER" > /tmp/orders.csv

# Anonymize with consistent seed so foreign keys stay valid
lethe anonymize /tmp/customers.csv --seed 42 --model sm -o /tmp/customers_anon.csv
lethe anonymize /tmp/orders.csv --seed 42 --model sm -o /tmp/orders_anon.csv

# Import into staging
psql $STAGING_DB -c "COPY customers FROM STDIN CSV HEADER" < /tmp/customers_anon.csv
psql $STAGING_DB -c "COPY orders FROM STDIN CSV HEADER" < /tmp/orders_anon.csv

# Clean up
rm /tmp/customers*.csv /tmp/orders*.csv
Important: When anonymizing multiple related tables, always use the same --seed value so that shared values (like customer names appearing in both tables) map to the same fake values, preserving referential integrity.

Cookbook: Third-Party Data Sharing

Share data with a vendor

# Anonymize before sending to an analytics vendor
lethe anonymize sales_data.csv --threshold 0.2 -o sales_data_safe.csv

# Verify by inspecting a few rows
head sales_data_safe.csv

Create demo datasets

# Start with a realistic schema and multiply
lethe multiply schema_sample.csv --factor 50 --locale en_US --seed 1 -o demo_data.csv

Prepare training data for ML

# Multiply a small labeled dataset for model training
lethe multiply labeled_samples.csv --factor 20 -o training_data.csv
# Non-PII columns (labels, categories, scores) are sampled from the original
# distribution, preserving class balance

Cookbook: Scaling Large Files

Reduce memory usage

The anonymize command streams data in chunks. Reduce the chunk size for lower memory usage at the cost of more processing overhead:

# Default: 5,000 rows per chunk
lethe anonymize huge_file.csv

# Lower memory: 500 rows per chunk
lethe anonymize huge_file.csv --chunk-size 500

# Higher throughput: 20,000 rows per chunk
lethe anonymize huge_file.csv --chunk-size 20000

Speed vs. accuracy trade-off

# Fastest: small model + high threshold
lethe anonymize data.csv --model sm --threshold 0.7

# Most accurate: transformer model + low threshold
lethe anonymize data.csv --model trf --threshold 0.2
Note: The multiply command reads the entire file into memory. For very large files, consider splitting the input first, multiplying each part, and concatenating the results. However, note that distribution sampling will then be based on each part rather than the whole dataset.

Cookbook: Multiply Use Cases

Populate a dev database

# Start with 10 seed records, produce 10,000
lethe multiply seed_customers.csv --factor 1000 --seed 42 -o dev_customers.csv

# Import into database
psql dev_db -c "COPY customers FROM STDIN CSV HEADER" < dev_customers.csv

Stress test an import pipeline

# Generate 1 million rows from 100 samples
lethe multiply sample.csv --factor 10000 -o stress_test.csv

Anonymize without adding rows

# factor=1 replaces all PII but keeps the same row count
# Faster than 'lethe anonymize' because it skips NLP
lethe multiply customers.csv --factor 1 -o customers_clean.csv
Note: With --factor 1, the multiply command only replaces PII based on column-name heuristics. It will not detect PII in unexpectedly-named columns (e.g., a column called notes containing names). For those cases, use lethe anonymize.

Generate localized demo data

# Create demo data for a Dutch client presentation
lethe multiply template.csv --factor 50 \
  --locale nl_NL --sanitize --seed 1 -o demo_nl.csv

Process Schematic: Anonymize (Structured Data)

Complete processing flow for CSV, TSV, and one-value-per-line files through the anonymize pipeline.

INPUT | v detect_format(path) | |-- .csv ---------> TextFormat.CSV |-- .tsv ---------> TextFormat.TSV |-- .txt + tabs ---> TextFormat.TSV |-- .txt + short --> TextFormat.ONE_VALUE_PER_LINE |-- .txt + long ---> TextFormat.FREEFORM (see separate schematic) | v make_reader(path, format, chunk_size) |-- CSV: CsvReader (pandas read_csv, chunksize=5000) |-- TSV: TsvReader (pandas read_csv, sep="\t", chunksize=5000) |-- ONE_VALUE: LineReader (iterate lines, skip blanks, col="value") | v CHUNK LOOP (each chunk is a pd.DataFrame) | | +---------------------------+ | | PiiScanner.scan_chunk() | | | | | | For each column: | | | 1. infer_pii_type(name) | | | SKIP -> skip column | | | HINT -> boost later | | | None -> full scan | | | | | | 2. Per cell: | | | analyzer.analyze() | | | boost_heuristic() | | | pick_best() | | | filter by threshold | | | | | | Returns: | | | dict[col -> CellResults]| | +---------------------------+ | | +---------------------------+ | | Replacer.replace_chunk() | | | | | | For each detected cell: | | | SessionIndex | | | .get_or_create( | | | entity_type, | | | original_value | | | ) | | | | | | Whole-cell replacement: | | | "John Smith" -> | | | "Allison Hill" | | +---------------------------+ | v Writer.write_chunk(anonymized_df) |-- CSV: CsvWriter (QUOTE_ALL, append mode) |-- TSV: TsvWriter (tab-separated, no quoting) |-- ONE_VALUE: LineWriter (one value per line) | v OUTPUT (input_anonymized.ext)

Process Schematic: Anonymize (Free-form Text)

Free-form text uses a different replacement strategy that preserves surrounding text. Instead of replacing entire cells, only the PII substrings are swapped.

INPUT (letter.txt) | v detect_format(path) --> TextFormat.FREEFORM | v FreeformReader Read each line (including empty) into col "text" Empty lines preserved as paragraph separators Yield DataFrames in chunks of 5000 lines | v CHUNK LOOP | | +-----------------------------------------+ | | FreeformReplacer.replace_chunk() | | | | | | For each line in "text" column: | | | | | | 1. analyzer.analyze(text) | | | Returns spans with char offsets: | | | [RecognizerResult( | | | entity=PERSON, | | | start=5, end=15, | | | score=0.85 | | | )] | | | | | | 2. Filter by threshold | | | score >= 0.35 ? keep : discard | | | | | | 3. Deduplicate overlapping spans | | | Keep highest-scoring when overlap | | | | | | 4. Replace right-to-left | | | Sort spans by offset descending | | | Replace each substring: | | | | | | "Dear John Smith, call 555-1234" | | | | | | | | replace last span first | | | v | | | "Dear John Smith, call 555-9876" | | | | | | | then earlier spans | | | v | | | "Dear Allison Hill, call 555-9876" | | | | | | (Right-to-left keeps earlier offsets | | | valid after each substitution) | | +-----------------------------------------+ | v FreeformWriter Write each line as-is (one per line) No quoting, no CSV formatting | v OUTPUT (letter_anonymized.txt)
Key difference from structured data: The normal Replacer does whole-cell replacement (the entire cell becomes a fake value). The FreeformReplacer does substring replacement (only the detected PII span is swapped, surrounding text is preserved). This is why "Dear John Smith, welcome" becomes "Dear Allison Hill, welcome" rather than being entirely replaced with a synthetic string.

Process Schematic: Multiply

The multiply pipeline reads the entire file, classifies columns, then generates synthetic rows.

INPUT (customers.csv or data.tsv) | v detect_format(path) |-- CSV or TSV: proceed |-- FREEFORM or ONE_VALUE_PER_LINE: ERROR, exit | v pd.read_csv(path, sep) (entire file into memory) | v Multiplier.classify_columns() | | For each column, call infer_pii_type(column_name): | | returns PII hint -> PII bucket (e.g., first_name -> PERSON) | returns SKIP -> check further: | numeric + unique + monotonic? | yes -> ID bucket (auto-increment) | no -> SAMPLE bucket (random sample from originals) | returns None -> SAMPLE bucket | v Phase 1: Anonymize originals (rows 0..N-1) | | For each original row: | PII columns -> Faker.generate() (fresh value) | ID columns -> keep original value | SAMPLE columns -> keep original value | v Phase 2: Generate synthetic rows (rows N..N*factor-1) | | For each synthetic row: | PII columns -> Faker.generate() (fresh value per row) | ID columns -> sequential from max(original) + 1 | SAMPLE columns -> random.choice(original_values) | v Sanitize (optional, if --sanitize flag) | Fix emails: NFKD -> ASCII, strip invalid chars | Fix URLs: NFKD -> ASCII, ensure scheme | v Writer.write_chunk(result_df) |-- CSV: CsvWriter (QUOTE_ALL) |-- TSV: TsvWriter (tab-separated) | v OUTPUT (customers_multiplied.csv)

Process Schematic: PII Detection

Detailed flow of how a single cell is evaluated for PII during the anonymize pipeline. This runs for every cell in every non-skipped column.

Column: "notes" Cell value: "Contact John Smith at john@example.com" | v Layer 1: Column Heuristic | | infer_pii_type("notes") | | Check SKIP patterns: id, timestamp, boolean, status... | no match | | Check PII patterns: name, email, phone, ssn... | no match | | Result: None (unknown column, full NLP scan required) | v Layer 2: Presidio NLP Analysis | | analyzer.analyze( | text="Contact John Smith at john@example.com", | language="en" | ) | | spaCy NER identifies: | "John Smith" -> PERSON (score: 0.85) | | Presidio pattern recognizers identify: | "john@example.com" -> EMAIL_ADDRESS (score: 1.0) | v Layer 3: Confidence Boosting | | Heuristic was None (no hint), so: | No boost applied | No synthetic result created | Scores unchanged | | If heuristic matched (e.g., col "email"): | NLP agrees on type -> +0.25 boost (capped at 1.0) | NLP found nothing -> synthesize result at 0.4 | v pick_best(results) | | Select highest-scoring result | Compare against threshold (default 0.35) | | 0.85 >= 0.35 ? YES -> mark for replacement | v CellResult entity_type: "PERSON" score: 0.85 | v SessionIndex.get_or_create("PERSON", "John Smith") | | Key: ("PERSON", "John Smith") | Not in cache -> Faker.name() -> "Allison Hill" | Cache: {("PERSON","John Smith"): "Allison Hill"} | | Next occurrence of "John Smith" as PERSON | -> cache hit -> same "Allison Hill" | v RESULT Cell value replaced: "Allison Hill"
Note: For structured data (CSV/TSV), the scanner processes one column at a time and picks the single best entity type per cell. For free-form text, all detected entities in a line are replaced independently using character offsets.

Reference: PII Entity Types

These entity types are recognized by both the heuristic and NLP layers:

Entity TypeDescriptionExample OriginalExample Replacement
PERSONFull name, first name, last nameJohn SmithAllison Hill
EMAIL_ADDRESSEmail addressjohn@example.comgarzaanthony@example.org
PHONE_NUMBERPhone/mobile/fax number+1-555-0101+1-555-9876
LOCATIONStreet address, city, state, zip742 Evergreen Terrace123 Oak Lane, Springfield
DATE_TIMEDate of birth1990-05-151985-11-23
US_SSNSocial Security Number123-45-6789987-65-4321
CREDIT_CARDCredit card number4111 1111 1111 11115234 5678 9012 3456
US_DRIVER_LICENSEDriver's license numberDL12345678AB9876543
IBAN_CODEInternational bank accountNL91ABNA0417164300DE89370400440532013000
IP_ADDRESSIPv4 address192.168.1.10010.45.67.89
US_PASSPORTPassport number123456789987654321
UK_NINOUK National Insurance NumberAB123456CCD987654A
NL_BSNDutch Citizen Service Number123456789987654321
URLWeb URLhttps://example.comhttps://fake-site.net
CRYPTOCryptographic hasha1b2c3d4...e5f6g7h8...

Reference: Skip Patterns

Columns matching these patterns are never scanned for PII. In the multiply command, they are either treated as ID columns (if the values are unique, monotonic integers) or as sample columns (values are randomly drawn from the existing distribution).

PatternRegexExamples
ID^(id|_id|pk|key|uuid|guid|row_?num)$id, pk, uuid
Timestamp(created|updated|...|_at|_on|_date|_time)$created_at, updated_on
Boolean^(is_|has_|was_|can_|should_|active|enabled|flag)is_active, has_paid
Numeric^(count|total|amount|price|cost|rate|score|age|...)$amount, price
Status^(status|state|type|category|tier|level|role|kind)$status, role
Currency^(currency|currency_code)$currency
Country Code^(country_code|lang|language|locale)$country_code

Reference: Faker Generators

Each PII entity type is mapped to a specific Faker method for generating realistic replacements:

Entity TypeFaker MethodExtra Arguments
PERSONfaker.name()
EMAIL_ADDRESSfaker.email()
PHONE_NUMBERfaker.phone_number()
LOCATIONfaker.address()
US_SSNfaker.ssn()
CREDIT_CARDfaker.credit_card_number()
DATE_TIMEfaker.date()
IP_ADDRESSfaker.ipv4()
IBAN_CODEfaker.iban()
US_DRIVER_LICENSEfaker.bothify()text="??#######"
US_PASSPORTfaker.bothify()text="#########"
UK_NINOfaker.bothify()text="??######?"
NL_BSNfaker.numerify()text="#########"
URLfaker.url()
CRYPTOfaker.sha1()

Reference: Custom Recognizers

Lethe extends Presidio with three custom pattern recognizers for entity types not well-covered by the defaults:

IBAN Recognizer

Matches International Bank Account Numbers. Pattern: [A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}... with score 0.7. Context words: iban, account, bank.

UK NINO Recognizer

Matches UK National Insurance Numbers (e.g., AB 12 34 56 C). Excludes known-invalid prefixes (BG, GB, NK, etc.) with score 0.7. Context words: nino, national insurance, ni number.

Dutch BSN Recognizer

Matches Dutch Citizen Service Numbers (Burgerservicenummer). Pattern: 9-digit number with a low base score of 0.3 (since bare 9-digit numbers are ambiguous). Context words: bsn, burgerservicenummer, sofi.

Reference: Configuration

All configuration is passed via CLI flags. There is no configuration file. The LetheConfig dataclass holds all settings:

SettingTypeDefaultDescription
modelstring"trf"spaCy model alias. "trf" maps to en_core_web_trf, "sm" to en_core_web_sm
thresholdfloat0.35Minimum confidence score to classify a cell as PII. Lower = more sensitive.
chunk_sizeint5000Rows per chunk in streaming mode. Only affects anonymize.
localestring"en_US"Faker locale. Affects names, addresses, phone formats, etc.
seedint or nullnullRandom seed for reproducibility. Same seed = same output.

Threshold tuning guide

ThresholdBehaviorUse When
0.2Very sensitive, catches edge cases, higher false positive rateMaximum privacy, GDPR compliance, "better safe than sorry"
0.35 (default)Balanced sensitivityGeneral-purpose anonymization
0.5Moderate, fewer false positivesWell-structured data with clear column names
0.7+Conservative, only high-confidence detectionsWhen preserving data fidelity is critical

Reference: Output Formats

Lethe preserves the input file format in the output. The output extension matches the input extension.

CSV output

All CSV output uses csv.QUOTE_ALL quoting. Every field is enclosed in double quotes, regardless of whether it contains special characters. This ensures that:

"id","first_name","last_name","email","status"
"1","Allison Hill","Roman","garzaanthony@example.org","active"
"2","Lindsey Cameron","Maddox","smiller@example.net","active"
"3","Melinda Henderson","Jordan","richard13@example.net","inactive"

Input files do not need to be quoted. Lethe reads both quoted and unquoted CSV.

TSV output

TSV output uses tab separation with no quoting (standard TSV convention). Fields are separated by a single tab character.

id	first_name	last_name	email	status
1	Allison Hill	Roman	garzaanthony@example.org	active
2	Lindsey Cameron	Maddox	smiller@example.net	active

TXT output (one-value-per-line)

Each value is written on its own line, with no quoting or formatting. Empty lines are skipped on input and not produced on output.

garzaanthony@example.org
smiller@example.net
richard13@example.net

TXT output (free-form text)

Lines are output as-is with PII substrings replaced inline. Empty lines (paragraph separators) are preserved.

Dear Allison Hill,

We are writing to confirm your appointment at our Springfield office. Your account
manager, Lindsey Cameron, will be available to assist you on Monday.

Please bring your ID and proof of address to 123 Oak Lane, Springfield.
You can reach us at 555-9876 or email garzaanthony@example.org for any questions.

Reference: Destructive Actions

By default, Lethe never modifies or deletes your input files. All output is written to a separate file. The --clean flag changes this behavior by deleting the original input file after successful processing.

The --clean flag

When passed to lethe anonymize or lethe multiply, the --clean flag deletes the original input file after the output file has been written successfully. This is useful in pipelines where you want to ensure the original PII-containing file does not persist on disk.

Important: The --clean flag requires --confirm-clean to actually take effect. Using --clean alone will produce an error and exit without processing.
# This will fail with an error
lethe anonymize data.csv --clean

# This works: anonymizes, then deletes data.csv
lethe anonymize data.csv --clean --confirm-clean

Safety guarantees

Convention: --confirm-<action>

Lethe follows a general pattern for destructive CLI actions: any flag that causes irreversible data loss requires a paired confirmation flag. The naming convention is --confirm-<action>. This makes destructive operations safe for scripting and CI/CD, where interactive prompts are not practical. If you are running Lethe in an automated pipeline, include both flags explicitly.

Reference: Anonymization vs. Pseudo-anonymization

These terms are often used interchangeably, but they have distinct legal and technical meanings, particularly under GDPR.

AnonymizationPseudo-anonymization
What happens to PIIIrreversibly removed, deleted, or aggregated. The original values are gone permanently.Replaced with realistic fake values. The data structure is preserved, but the real PII is swapped for generated alternatives.
ReversibilityImpossible. No party, including the one who performed the anonymization, can recover the original data.Theoretically reversible if someone has access to the mapping table (e.g., the SessionIndex). Without the mapping, re-identification is practically infeasible.
GDPR statusNot personal data (Recital 26). Falls entirely outside the scope of GDPR. No consent, lawful basis, or data subject rights obligations apply.Still personal data (Article 4(5)). GDPR obligations still apply, but pseudo-anonymization is recognized as an appropriate safeguard (Recital 28, Article 25, Article 32) that may simplify compliance.
Data utilityLower. Removing or aggregating values reduces the data's usefulness for development, testing, and analysis.Higher. Replaced values are structurally identical to the originals (same types, formats, distributions), so the data remains useful for development, testing, analytics, and ML training.
TechniquesDeletion, aggregation (averages, counts), generalization (exact age to age range, city to country), k-anonymity, differential privacy.Value substitution with consistent fakes (what Lethe does), tokenization, encryption with key management.
Use casesPublishing open datasets, permanent data retention after deletion requests, removing GDPR obligations entirely for a dataset.Dev/test environments, third-party data sharing, CI/CD fixtures, demo environments, analytics, debugging with realistic data.

What Lethe does

Lethe performs pseudo-anonymization. It replaces PII with Faker-generated values via the SessionIndex mapping. It does not delete, aggregate, or generalize data. The output preserves full structural fidelity (foreign keys, column distributions, data types) while replacing real personal information with realistic fakes.

Lethe does not persist the SessionIndex mapping to disk. Once the process exits, the mapping exists only in the output file (as the replacement values). Without the original input, recovering the original PII from the output is not feasible. However, if the original input file is retained alongside the output, a determined party could theoretically reconstruct the mapping by comparing the two.

When you need true anonymization

If your use case requires data that falls entirely outside GDPR scope, you need true anonymization rather than pseudo-anonymization. This means irreversibly removing all identifying information with no possibility of re-identification. Techniques include aggregation (replacing individual records with statistical summaries), generalization (broadening values, e.g. exact age to age range), or complete deletion of PII columns. Lethe is not the right tool for true anonymization.

Reference: Definitions

Key technologies and concepts used by Lethe.

Technologies

TermDescription
spaCyOpen-source NLP library for tokenization and named entity recognition (NER). Runs entirely on your machine, no network calls.
RoBERTaThe transformer architecture used by en_core_web_trf. A ~125M parameter model from Facebook/Meta AI, fine-tuned by Explosion (spaCy) for English NER. Not a generative model, it only classifies tokens.
PresidioMicrosoft's open-source PII detection framework. Combines NLP models with pattern recognizers (regex). Lethe uses its AnalyzerEngine but not its AnonymizerEngine, replacement logic is handled by Lethe's own mapping layer.
FakerPython library for generating realistic fake data (names, emails, addresses, phone numbers). Supports 50+ locales for locale-appropriate output.
PandasDataFrame library used for chunked file I/O and data manipulation. Handles CSV/TSV reading and writing.
TyperCLI framework built on Click. Handles argument parsing, help text, and command routing.
RichTerminal formatting library for progress bars, tables, and colored output during processing.

Data lifecycle

CategoryDetails
What is keptOutput files (anonymized or multiplied), session mapping (in memory only, not persisted to disk), command history (standard shell history).
What is destroyedWith --clean --confirm-clean, the original input file is deleted after successful processing. Without --clean, nothing is ever deleted. PII values are replaced in the output file only, the input file is never modified.