Metadata-Version: 2.4
Name: gtfs-digester
Version: 0.3.0
Summary: Canonicalization, fingerprinting, and change detection for GTFS Schedule feeds
Author: Chris Alfano
Author-email: Chris Alfano <chris@jarv.us>
License-Expression: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: blake3>=1.0.8
Requires-Dist: click>=8.3.2
Requires-Dist: fsspec>=2026.3.0
Requires-Dist: gcsfs>=2026.3.0
Requires-Dist: polars>=1.39.3
Requires-Dist: pyarrow>=23.0.1
Requires-Python: >=3.11
Project-URL: Homepage, https://github.com/JarvusInnovations/gtfs-digester
Project-URL: Repository, https://github.com/JarvusInnovations/gtfs-digester
Project-URL: Issues, https://github.com/JarvusInnovations/gtfs-digester/issues
Description-Content-Type: text/markdown

# gtfs-digester

Canonicalization, fingerprinting, and change detection for GTFS Schedule feeds.

## What It Does

Takes a GTFS zip and produces:

- A **content fingerprint** (BLAKE3 merkle hash) — identical for semantically identical feeds regardless of zip metadata, file ordering, CSV whitespace, or time formatting
- **Canonical Arrow tables** for each file — normalized, sorted by primary key, all values as strings
- **Per-file hashes** for efficient hierarchical change detection
- **Archive diffs** — added/removed/modified files and rows by primary key
- **Exploded parquet storage** — write/read from local or cloud (GCS, S3) via fsspec

All files and columns are preserved, including non-standard ones.

## Install

```bash
pip install gtfs-digester
# or
uv add gtfs-digester
```

## CLI

```bash
# Fingerprint a feed
gtfs-digester digest google_transit.zip
# Fingerprint: v1:abc123...
# Files: 11
#   agency.txt        1 rows
#   stops.txt      9245 rows
#   ...

# JSON output or just the hash
gtfs-digester digest --json google_transit.zip
gtfs-digester digest --quiet google_transit.zip

# Diff two feeds (exit code 1 if different)
gtfs-digester diff old.zip new.zip

# Write as exploded parquet
gtfs-digester write google_transit.zip ./output --schedule-url https://example.com/gtfs.zip

# Produce a normalized GTFS zip
gtfs-digester normalize messy.zip clean.zip
```

## Python API

```python
from gtfs_digester import GTFSArchive

# Load and fingerprint
archive = GTFSArchive.from_zip("google_transit.zip")
print(archive.fingerprint.hex())  # v1:abc123...

# Access canonical tables
stops = archive.arrow_table("stops.txt")
print(stops.num_rows)

# Compare two feed versions
old = GTFSArchive.from_zip("old.zip")
new = GTFSArchive.from_zip("new.zip")
diff = old.diff(new)
print(diff.is_identical)
for f in diff.modified_files:
    fd = diff.file_diff(f)
    print(f"{f}: {fd.summary()}")

# Write as exploded parquet
from gtfs_digester import write_exploded
write_exploded(archive, "gs://bucket/schedules/feed-1", schedule_url="https://...")
```

## How Fingerprinting Works

1. Each `.txt` file is parsed to a PyArrow table (all strings)
2. Columns reordered per GTFS spec (unknown columns preserved, sorted alphabetically after spec columns)
3. Values normalized: whitespace stripped, times zero-padded (`9:05:00` → `09:05:00`)
4. Rows sorted by primary key (numeric columns sorted numerically, not lexicographically)
5. Canonical CSV serialized and BLAKE3 hashed per file
6. Archive fingerprint = BLAKE3 of sorted `filename:hash` pairs (merkle tree)

Unknown files (not in the GTFS spec) are preserved with lexicographic row sorting.

## Storage Layout

`write_exploded()` produces a version-first directory:

```
base_path/
  _feed_digest={v1:abc...}/
    agency.parquet
    stops.parquet
    routes.parquet
    trips.parquet
    stop_times.parquet
    ...
    metadata.json       # provenance + digester info, written last (commit marker)
```

DuckDB reads across versions with hive partitioning:

```sql
SELECT * FROM read_parquet('base_path/_feed_digest=*/stops.parquet', hive_partitioning=true);
```

## Development

```bash
uv sync --group dev
uv run pytest tests/ -k "not Real"   # unit tests (~0.5s)
uv run pytest tests/                  # includes real feed integration tests
```

## License

MIT
