Metadata-Version: 2.4
Name: schema-preflight
Version: 2.0.0
Summary: Pre-flight schema validator — powered by sqlglot. Checks field names and types against BigQuery, Snowflake, Redshift, Spark, Databricks, Hive, Postgres, MySQL, DuckDB, Trino, ClickHouse, Kafka/Avro. Pure Python. No API. Works offline.
Author-email: Amrutham Akshithraj <amruthamakshithraj@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/AmruthamAkshithraj/schema-preflight
Project-URL: Documentation, https://github.com/AmruthamAkshithraj/schema-preflight#readme
Project-URL: Repository, https://github.com/AmruthamAkshithraj/schema-preflight
Keywords: schema,data-engineering,bigquery,snowflake,redshift,spark,avro,kafka,duckdb,databricks,trino,clickhouse,preflight,sqlglot,reserved-keywords,type-mapping,data-quality
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Database
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sqlglot>=20.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# schema-preflight 🛫

[![PyPI](https://badge.fury.io/py/schema-preflight.svg)](https://pypi.org/project/schema-preflight/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Tests](https://github.com/AmruthamAkshithraj/schema-preflight/actions/workflows/publish.yml/badge.svg)](https://github.com/AmruthamAkshithraj/schema-preflight/actions)

**Check your schema against 12 systems simultaneously — before a single byte of data is written.**

Pure Python. No API server. No network calls. Works completely offline.
Powered by [sqlglot](https://github.com/tobymao/sqlglot)'s community-maintained dialect databases.

```
You name a field 'timestamp'. Kafka accepts it. 50 million
messages later — BigQuery is broken. TIMESTAMP is a reserved
keyword. The dashboard team calls at 2am.

schema-preflight catches this in milliseconds. Before you deploy.
```

---

## Install

```bash
pip install schema-preflight
```

One dependency: `sqlglot` (the SQL parser powering dbt, Airflow, and hundreds of data tools).

---

## Quickstart — 3 lines

```python
from schema_preflight import SchemaPreflight

pf = SchemaPreflight()  # no URL, no API, pure offline

report = pf.check(
    name="OrderEvent",
    source_format="avro",
    target_systems=["bigquery", "snowflake", "redshift", "spark", "postgres"],
    fields=[
        {"name": "order_id",    "type": "string"},    # ✅ safe
        {"name": "user_id",     "type": "string"},    # ✅ safe
        {"name": "timestamp",   "type": "string"},    # ❌ CRITICAL — 5 systems
        {"name": "select",      "type": "integer"},   # ❌ CRITICAL — 5 systems
        {"name": "total",       "type": "number"},    # ✅ safe
        {"name": "metadata",    "type": "map"},       # ❌ CRITICAL — BigQuery+Redshift
        {"name": "2fa_enabled", "type": "boolean"},  # ❌ CRITICAL — starts with digit
    ],
)

report.print_summary()
report.raise_if_critical()  # raises PreflightFailed if any CRITICAL found
```

Output:
```
════════════════════════════════════════════════════════════════════
  Schema Preflight v2  ·  OrderEvent
════════════════════════════════════════════════════════════════════
  Status   : ❌  FAILED — fix before writing data
  Systems  : 5 checked  0 passed  5 failed
  Issues   : 22 critical  0 warning  9 info

  🌍 Issues affecting ALL 5 target systems
  ──────────────────────────────────────────
  🔴 [RESERVED_KEYWORD] field: 'timestamp'
     'timestamp' is a reserved keyword in all 5 systems
     Fix : Rename to 'timestamp_value' | 'event_timestamp' | 'ts'

  🔴 [RESERVED_KEYWORD] field: 'select'
     'select' is a reserved SQL keyword in all 5 systems
     Fix : Rename to 'select_value' | 'select_col' | 'event_select'

  🔴 [STARTS_WITH_DIGIT] field: '2fa_enabled'
     Starts with digit — invalid in all SQL systems and Avro
     Fix : Rename to '_2fa_enabled' or 'col_2fa_enabled'

  💡 Suggested renames:
     'timestamp'   →  timestamp_value  |  event_timestamp  |  ts
     'select'      →  select_value     |  select_col
     '2fa_enabled' →  col_2fa_enabled  |  _2fa_enabled
════════════════════════════════════════════════════════════════════
```

---

## All the ways to use it

### Plain dicts
```python
report = pf.check("User", fields=[
    {"name": "user_id",     "type": "string"},
    {"name": "email",       "type": "string"},
    {"name": "timestamp",   "type": "string"},   # flagged
    {"name": "created_at",  "type": "timestamp"},
], target_systems=["bigquery", "snowflake"])
```

### SchemaField objects (IDE autocomplete + type safety)
```python
from schema_preflight import SchemaField

fields = [
    SchemaField("user_id",  "string",  required=True),
    SchemaField("email",    "string",  required=True, max_length=255),
    SchemaField("tags",     "array",   items=SchemaField("tag", "string")),
    SchemaField("address",  "object",  fields=[
        SchemaField("city",    "string"),
        SchemaField("country", "string"),
        SchemaField("select",  "string"),  # nested reserved word — still caught
    ]),
]
pf.check("User", fields).raise_if_critical()
```

### PySpark (reads metadata — no data scan)
```python
df = spark.read.parquet("s3://raw/clickstream/")
pf.check_spark_schema(df, "ClickEvent",
    target_systems=["bigquery", "snowflake", "redshift"]
).raise_if_critical()
df.write.parquet("s3://clean/clickstream/")
```

### pandas
```python
df = pd.read_csv("orders.csv")
pf.check_pandas_schema(df, "Orders").raise_if_critical()
```

### SQLAlchemy
```python
orders = Table("orders", meta,
    Column("order_id",  String),
    Column("timestamp", String),  # flagged
)
pf.check_sqlalchemy_table(orders).raise_if_critical()
```

### Avro schema file
```python
import json
with open("order.avsc") as f:
    schema = json.load(f)
pf.check_avro_schema(schema).raise_if_critical()
```

### JSON Schema (draft-07)
```python
pf.check_json_schema({
    "title": "Order",
    "type":  "object",
    "properties": {
        "order_id":  {"type": "string"},
        "timestamp": {"type": "string"},  # flagged
    }
}).raise_if_critical()
```

### Infer from a sample dict
```python
sample = {"order_id": "123", "timestamp": "2025-01-01", "total": 99.99}
pf.check_dict_sample(sample, "Order").print_summary()
```

### Quick single field check
```python
result = pf.check_name("timestamp")
# {"safe": False, "affected_systems": ["bigquery","snowflake",...],
#  "suggestions": ["timestamp_value","event_timestamp","ts"]}
```

### Get safe renames
```python
pf.suggest("timestamp")
# → ["timestamp_value", "event_timestamp", "ts", "data_timestamp"]
```

---

## CLI

```bash
# Full schema check
preflight check \
  --name OrderEvent \
  --format avro \
  --systems bigquery snowflake redshift spark \
  --fields order_id:string timestamp:string total:number select:integer

# Check from JSON file
preflight check --file schema.json --systems bigquery snowflake

# Single field name check
preflight check-name timestamp

# Get safe suggestions
preflight suggest select

# List all supported systems
preflight systems

# JSON output (for CI pipelines)
preflight check --json --file schema.json
```

---

## Use in CI/CD (GitHub Actions)

Block PRs that introduce schema breaking changes:

```yaml
- name: Schema preflight
  run: |
    pip install schema-preflight
    preflight check \
      --file schemas/order_event.json \
      --systems bigquery snowflake redshift spark
  # exits 1 and blocks merge if CRITICAL issues found
```

---

## What it checks — 24 edge case categories

All sourced from official system documentation.

| # | Check | Systems |
|---|---|---|
| 1 | Reserved SQL keywords (via sqlglot — 31 dialects, always current) | All |
| 2 | System type name collisions (TIMESTAMP, STRING, BOOLEAN as column names) | All |
| 3 | BigQuery reserved prefixes (`_TABLE_`, `_FILE_`, `_PARTITION_`, `_CHANGE_*`, `_SESSION_`) | BigQuery |
| 4 | Names starting with a digit | All |
| 5 | Special characters (spaces, hyphens, dots) | All |
| 6 | Leading / trailing whitespace | All |
| 7 | Empty field name | All |
| 8 | Column name length limits (BQ:300, SF:255, RS:127, PG:63, MySQL:64) | All |
| 9 | Duplicate names after case-folding | All |
| 10 | Unicode chars in systems that forbid them | BQ, Redshift, Hive |
| 11 | Double-underscore prefix (pseudo-column collision) | BQ, Redshift, Postgres |
| 12 | Avro name regex `[A-Za-z_][A-Za-z0-9_]*` | Kafka/Avro |
| 13 | Avro complex union types (more than `[null, T]`) | Kafka/Avro |
| 14 | Array of array — BigQuery hard limit | BigQuery |
| 15 | Max nesting depth 15 — BigQuery hard limit | BigQuery |
| 16 | Max 10,000 columns including nested — BigQuery hard limit | BigQuery |
| 17 | Max 1,600 columns — Redshift / Postgres hard limit | Redshift, Postgres |
| 18 | Type mapping safety per system | All |
| 19 | Decimal precision exceeds system maximum | All |
| 20 | MAP type with non-string key — crashes Redshift connector | Redshift |
| 21 | MAP type — unsupported in BigQuery (no native MAP type) | BigQuery |
| 22 | Array type — unsupported in Redshift Spectrum with Avro format | Redshift |
| 23 | SQL function name collisions (COUNT, SUM, RANK — work but cause confusion) | All |
| 24 | ClickHouse case-sensitivity (ClickHouse is case-sensitive, others are not) | ClickHouse |

---

## Supported systems

| ID | System | sqlglot dialect |
|---|---|---|
| `bigquery` | Google BigQuery | ✅ |
| `snowflake` | Snowflake | ✅ |
| `redshift` | Amazon Redshift | ✅ |
| `spark` | Apache Spark SQL | ✅ |
| `databricks` | Databricks / Delta Lake | ✅ |
| `hive` | Apache Hive | ✅ |
| `postgres` | PostgreSQL | ✅ |
| `mysql` | MySQL | ✅ |
| `duckdb` | DuckDB | ✅ |
| `trino` | Trino | ✅ |
| `clickhouse` | ClickHouse | ✅ |
| `kafka_avro` | Kafka / Avro | custom |

## Supported field types

`string` `integer` `long` `float` `double` `number` `boolean` `bytes`
`date` `time` `datetime` `timestamp` `array` `object` `record` `map`
`enum` `decimal` `null` `uuid` `json`

## Supported source formats

`json` `avro` `parquet` `protobuf` `csv` `orc`

---

## Why this is different from every other tool

| Tool | What it does | Gap |
|---|---|---|
| Confluent Schema Registry | Avro compatibility for Kafka | Single system, no cross-system check |
| AWS Glue Schema Registry | Schema for AWS ecosystem | AWS-only, no cross-system check |
| sqlglot | SQL parser, 31 dialects | Checks SQL queries — not schema field names |
| buf.build | Protobuf registry | Protobuf only |
| DataHub | Data contracts | Requires full platform, no offline check |
| **schema-preflight** | **Field name + type check across all systems** | **First offline cross-system pre-flight tool** |

---

## Publish to PyPI

```bash
pip install build twine
python3 -m build
python3 -m twine upload dist/*
```

---

## License

MIT © Amrutham Akshithraj
