Metadata-Version: 2.4
Name: data-dictionary-builder
Version: 0.1.8
Summary: Extract database metadata, generate dbt-compatible YAML, compare schemas, and deliver reports — all in a single Python import.
Author-email: Isaiah Johnson <d8aguy@mail.com>
License: MIT
Project-URL: Homepage, https://github.com/GraFreak0/data_dictionary_builder
Project-URL: Repository, https://github.com/GraFreak0/data_dictionary_builder
Project-URL: Issues, https://github.com/GraFreak0/data_dictionary_builder/issues
Keywords: database,metadata,dbt,yaml,data-engineering,schema,documentation
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML>=6.0.1
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: click>=8.0.0
Requires-Dist: typing-extensions>=4.9.0
Requires-Dist: python-dateutil>=2.8.2
Requires-Dist: colorlog>=6.8.0
Requires-Dist: psycopg2-binary>=2.9.9
Requires-Dist: PyMySQL>=1.1.0
Requires-Dist: clickhouse-connect>=0.7.0
Requires-Dist: google-cloud-spanner>=3.40.0
Requires-Dist: reportlab>=4.0.0
Requires-Dist: secure-smtplib>=0.1.1
Requires-Dist: numpy
Requires-Dist: slack-sdk>=3.41.0
Requires-Dist: clickhouse-driver>=0.2.10
Provides-Extra: postgres
Requires-Dist: psycopg2-binary>=2.9.9; extra == "postgres"
Provides-Extra: mysql
Requires-Dist: PyMySQL>=1.1.0; extra == "mysql"
Provides-Extra: clickhouse
Requires-Dist: clickhouse-connect>=0.7.0; extra == "clickhouse"
Provides-Extra: clickhouse-native
Requires-Dist: clickhouse-driver>=0.2.7; extra == "clickhouse-native"
Provides-Extra: clickhouse-all
Requires-Dist: clickhouse-connect>=0.7.0; extra == "clickhouse-all"
Requires-Dist: clickhouse-driver>=0.2.7; extra == "clickhouse-all"
Provides-Extra: spanner
Requires-Dist: google-cloud-spanner>=3.40.0; extra == "spanner"
Provides-Extra: oracle
Requires-Dist: oracledb>=2.0.0; extra == "oracle"
Provides-Extra: sqlserver
Requires-Dist: pymssql>=2.2.0; extra == "sqlserver"
Provides-Extra: mongodb
Requires-Dist: pymongo>=4.6.0; extra == "mongodb"
Provides-Extra: pdf
Requires-Dist: reportlab>=4.0.0; extra == "pdf"
Provides-Extra: email
Requires-Dist: secure-smtplib>=0.1.1; extra == "email"
Provides-Extra: slack
Requires-Dist: slack-sdk>=3.41.0; extra == "slack"
Provides-Extra: all
Requires-Dist: psycopg2-binary>=2.9.9; extra == "all"
Requires-Dist: PyMySQL>=1.1.0; extra == "all"
Requires-Dist: clickhouse-connect>=0.7.0; extra == "all"
Requires-Dist: clickhouse-driver>=0.2.7; extra == "all"
Requires-Dist: google-cloud-spanner>=3.40.0; extra == "all"
Requires-Dist: oracledb>=2.0.0; extra == "all"
Requires-Dist: pymssql>=2.2.0; extra == "all"
Requires-Dist: pymongo>=4.6.0; extra == "all"
Requires-Dist: reportlab>=4.0.0; extra == "all"
Requires-Dist: secure-smtplib>=0.1.1; extra == "all"
Requires-Dist: slack-sdk>=3.41.0; extra == "all"
Requires-Dist: numpy; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.3; extra == "dev"
Requires-Dist: black>=23.12.0; extra == "dev"
Requires-Dist: flake8>=6.1.0; extra == "dev"
Requires-Dist: mypy>=1.7.1; extra == "dev"
Requires-Dist: mongomock>=4.1.2; extra == "dev"
Dynamic: license-file

# data_dictionary_builder

A Python library that automates database documentation — extract live schema metadata, generate dbt-compatible YAML, compare schemas across environments, and deliver PDF reports, all in a single import.

[![PyPI](https://img.shields.io/pypi/v/data-dictionary-builder)](https://pypi.org/project/data-dictionary-builder/)
[![Python](https://img.shields.io/pypi/pyversions/data-dictionary-builder)](https://pypi.org/project/data-dictionary-builder/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

![Logo](https://github.com/GraFreak0/data_dictionary_builder/blob/main/static/logo.png)
---

> **What's new in v0.1.8**
>
> - **Consistent description field ordering** — `description` now appears as the second key in every generated YAML block, immediately after `name`. For columns this means the order is `name → description → data_type → meta → tests` instead of the previous `name → data_type → description`. Table and schema-level ordering is unchanged (both already placed `description` right after `name`/`version`). This makes it faster to scan and fill in documentation because the description placeholder is always in the same predictable position.

---

> **What's new in v0.1.7**
>
> - **Always-present description fields** — every schema, table, and column in generated YAML now always includes a `description:` key. When no description is available from the database the field is emitted as `description: null`, giving users a clear placeholder to fill in. Existing user-written descriptions are never overwritten. Re-running against older YAML files automatically backfills any missing `description` keys.
> - **Improved gap detection** — `get_tables_without_descriptions()` and `get_columns_without_descriptions()` now explicitly treat `description: null` and `description: ""` (empty string) in YAML as undocumented, matching the expectation that a placeholder null field still needs to be filled in.
> - **Spanner multi-database extraction** — `database_id` is now optional for the Spanner connector. Pass a `databases` list to extract multiple databases in one call, or omit both to auto-discover every database in the instance. Each database is surfaced as a separate schema in the output. Single-database behaviour (`database_id` set) is unchanged.

---

> **What's new in v0.1.6**
>
> - **Column exclusion** — pass `column_exclude=["contains:peerdb", "prefix:_dlt_"]` to `extract_all_schemas()` / `extract_schema()` / `extract_table()` to strip matching columns before any YAML is written. Accepts the same six pattern types as `schema_filter`: exact, glob, `prefix:`, `suffix:`, `contains:`, `regex:`.
> - **Table exclusion** — pass `table_exclude=["prefix:tmp_", "contains:staging"]` to drop entire tables from the extraction in the same way.
> - **View extraction** — pass `include_views=True` to any extraction method to include database views alongside base tables. Views appear in the YAML with `table_type: VIEW`. Supported on all connectors (Postgres, MySQL, SQLite, SQL Server, Oracle, ClickHouse — covers both `View` and `MaterializedView` engines).
> - **CLI additions** — `--exclude-column` / `-x`, `--exclude-table` / `-T`, and `--include-views` flags added to `ddgen extract`.

---

> **What's new in v0.1.5**
>
> - **Custom output paths** — `DDHelper` now supports explicit `models_dir` and `reports_dir` parameters. You can store your models and reports in completely separate locations, or redirect them to a temporary folder during CI/CD.
> - **MongoDB Support** — extract metadata from MongoDB collections. Fields and types are automatically inferred by sampling documents. Supports `_id` as primary key and standard auth mechanisms. Install with `pip install "data-dictionary-builder[mongodb]"`.

---

> **What's new in v0.1.4**
>
> - **Multi-recipient email delivery** — `email_to` now accepts a list of addresses (e.g. `["alice@example.com", "bob@example.com"]`) in addition to a single string. The `EMAIL_TO` environment variable supports comma-separated addresses (`EMAIL_TO=alice@example.com,bob@example.com`). All recipients receive the PDF attachment in one send.
> - **Multi-target Slack delivery** — `slack_target` now accepts a list of targets (e.g. `["#data-alerts", "U012AB3CD"]`). The `SLACK_NOTIFY_TARGET` environment variable supports comma-separated values. Each target receives the full Block Kit report and optional PDF upload independently.
> - **`send_notification` updated** — both `email_to` and `slack_target` parameters accept strings or lists; the method routes correctly for either form with no API change.

---

> **What's new in v0.1.3**
>
> - **Slack notifications** — deliver schema comparison reports directly to any Slack channel or DM alongside the existing email delivery. Use `notification_type="slack"` or `"both"` in `send_notification()`. Requires a Bot User OAuth Token (`xoxb-…`) and the `slack` extra: `pip install "data-dictionary-builder[slack]"`.
> - **Optimised metadata generation** — parallel extraction is faster with improved thread scheduling and reduced connection overhead across all supported databases.
> - **Smarter schema comparison** — type normalisation has been expanded to cover more cross-database equivalences, reducing false-positive mismatches in mixed-engine pipelines.
> - **Leaner exports** — JSON metadata exports are more compact, and the `to_dict()` / `from_dict()` round-trip is validated automatically to ensure safe use in Airflow XCom and downstream catalog APIs.

---

## Installation

**pip**

```bash
# Core library (SQLite works out of the box)
pip install data-dictionary-builder

# With the connectors you need
pip install "data-dictionary-builder[postgres]"
pip install "data-dictionary-builder[mysql]"
pip install "data-dictionary-builder[clickhouse]"         # ClickHouse HTTP/HTTPS
pip install "data-dictionary-builder[clickhouse-native]"  # ClickHouse native TCP
pip install "data-dictionary-builder[spanner]"
pip install "data-dictionary-builder[oracle]"
pip install "data-dictionary-builder[sqlserver]"
pip install "data-dictionary-builder[mongodb]"

# Everything at once
pip install "data-dictionary-builder[all]"
```

**uv** *(recommended — faster resolver, built-in virtual environments)*

```bash
# Install uv
pip install uv
# or on macOS/Linux: curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to your project
uv add data-dictionary-builder

# With specific connectors
uv add "data-dictionary-builder[postgres]"
uv add "data-dictionary-builder[mysql]"
uv add "data-dictionary-builder[clickhouse]"         # ClickHouse HTTP/HTTPS
uv add "data-dictionary-builder[clickhouse-native]"  # ClickHouse native TCP
uv add "data-dictionary-builder[oracle]"
uv add "data-dictionary-builder[sqlserver]"
uv add "data-dictionary-builder[spanner]"
uv add "data-dictionary-builder[mongodb]"

# Everything at once
uv add "data-dictionary-builder[all]"
```

Or use the CLI to install connectors after the fact:

```bash
ddgen install postgres
ddgen install clickhouse
ddgen install oracle
ddgen install sqlserver
ddgen install mongodb
ddgen install all
```

---

## Supported Databases

| Database | Extra | Driver |
|---|---|---|
| **SQLite** | *(built-in)* | `sqlite3` (stdlib) |
| **PostgreSQL** | `[postgres]` | `psycopg2-binary` |
| **MySQL / MariaDB** | `[mysql]` | `PyMySQL` |
| **ClickHouse** | `[clickhouse]` | `clickhouse-connect` (HTTP/HTTPS) · `clickhouse-driver` (native TCP, optional) |
| **Oracle Database** | `[oracle]` | `oracledb` (thin mode — no Oracle Client needed) |
| **SQL Server / Azure SQL** | `[sqlserver]` | `pymssql` |
| **Google Cloud Spanner** | `[spanner]` | `google-cloud-spanner` |
| **MongoDB** | `[mongodb]` | `pymongo` |

---

## 🚀 Quick Start

### 1. Installation

```bash
pip install data-dictionary-builder

# Or with specific connectors
pip install "data-dictionary-builder[postgres,mongodb]"
```

### 2. Basic Usage (Python)

```python
from data_dictionary_builder import DDHelper

# Initialize with custom paths
helper = DDHelper(
    models_dir="my_dbt_project/models", 
    reports_dir="logs/schema_reports"
)

# Extract and generate YAML
config = {
    "db_type": "postgres",
    "host": "localhost",
    "database": "my_db",
    "user": "admin",
    "password": "password"
}
helper.generate_yaml(config, schema_filter=["public"])
```

### 3. MongoDB Support

Extract metadata from MongoDB collections with automatic schema inference (via sampling):

```python
from data_dictionary_builder import DDHelper

helper = DDHelper()

mongo_config = {
    "db_type": "mongodb",
    "host": "localhost",
    "port": 27017,
    "database": "my_app_db"
}

# Extracts all collections in 'my_app_db'
# Fields and types are inferred by sampling 100 docs per collection
helper.generate_yaml(mongo_config)
```

---

## CLI

```bash
# Show all commands and supported databases
ddgen --help

# Full module and API reference
ddgen features

# Check which connectors are installed
ddgen connectors

# Install a connector
ddgen install postgres
ddgen install clickhouse
ddgen install all

# Extract metadata and generate YAML in one step
ddgen extract --db-type postgres --host prod.db.io --database mydb --user readonly

# Compare two environments
ddgen compare --source-host prod.db.io --dest-host staging.db.io --source-database mydb

# Show library version and connector summary
ddgen info

# Show version number
ddgen --version
```

---

## Schema Comparison

```python
from data_dictionary_builder import SchemaComparator, DDHelper

helper = DDHelper(".")
report = SchemaComparator(
    source_config={"db_type": "postgres", "host": "prod-db", ...},
    destination_config={"db_type": "postgres", "host": "staging-db", ...},
).compare_and_generate_report("public", include_yaml_gaps=True)

json_path = helper.save_report(report)
pdf_path  = helper.compile_pdf(source_json=json_path)

# Send via email, Slack, or both — credentials fall back to env vars
# email_to and slack_target accept a string or a list of recipients
helper.send_notification(
    notification_type="both",    # "email" | "slack" | "both"
    report=report,
    pdf_path=pdf_path,
    email_to=["alice@example.com", "bob@example.com"],
    slack_target=["#data-alerts", "#data-eng"],
)
```

---

## Airflow Integration

`DatabaseMetadata` serialises to/from plain dicts for XCom:

```python
@task
def extract():
    with MetadataExtractor(**config) as ext:
        return ext.extract_all_schemas(parallel_workers=8).to_dict()

@task
def generate_yaml(db_meta_dict):
    from data_dictionary_builder import DatabaseMetadata, YAMLGenerator
    YAMLGenerator("./models").generate_yaml_files(DatabaseMetadata.from_dict(db_meta_dict))
```

See [`tests/airflow_dag_example.py`](tests/airflow_dag_example.py) for a complete DAG.

---

## Key Features

- **Parallel extraction** — `ThreadPoolExecutor` with configurable workers; ClickHouse uses 2 queries and PostgreSQL uses 5 queries per schema regardless of table count
- **Dual ClickHouse transport** — HTTP/HTTPS via `clickhouse-connect` (default) or native TCP via `clickhouse-driver`; auto-detected, with dynamic port defaults based on transport and TLS
- **Schema filtering** — exact, glob, prefix, suffix, contains, regex — mix freely
- **Column exclusion** — strip columns matching any pattern before writing YAML (`contains:peerdb`, `prefix:_dlt_`, etc.)
- **Table exclusion** — drop entire tables matching any pattern from the extraction (`prefix:tmp_`, `contains:staging`, etc.)
- **View extraction** — opt-in view support (`include_views=True`) across all connectors; ClickHouse includes both `View` and `MaterializedView` engine types
- **Smart YAML merge** — re-running never overwrites descriptions you've written by hand
- **YAML-aware gap detection** — documentation coverage checks read from your existing YAML files, so descriptions you've added are always recognised
- **Cross-database comparison** — compare any two database types; type aliases normalised before diffing
- **PDF reports** — paginated, no row limits, table of contents (requires `reportlab`)
- **Unified notifications** — `send_notification(notification_type="email"|"slack"|"both", ...)` delivers PDF reports via SMTP email, Slack (channel or DM), or both simultaneously; all credentials fall back to environment variables
- **Email delivery** — SMTP with env-var credential fallback; PDF attached automatically
- **Slack delivery** — Block Kit–formatted comparison summaries; optional PDF file upload; supports `#channel`, `@user`, channel IDs, and user IDs
- **ExecutionTimer** — named task timing with a formatted summary table
- **Server mode** — omit `database` to scan all databases on a MySQL, ClickHouse, PostgreSQL, or MongoDB server
- **Rich CLI** — `ddgen extract`, `ddgen compare`, `ddgen features` (full API reference), `ddgen connectors`, `ddgen install`

---

## Environment Variables

Set these in a `.env` file (see `tests/.env.example`) or in your shell:

```bash
# ── Notification channel ───────────────────────────────────────────────
# "email" | "slack" | "both"  (default: email)
NOTIFICATION_TYPE=email

# ── SMTP — DDHelper.send_notification() / send_report_email() ──────────
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASSWORD=xxxx xxxx xxxx xxxx
EMAIL_TO=alice@example.com,bob@example.com   # comma-separated for multiple recipients

# ── Slack — DDHelper.send_notification() ───────────────────────────────
# Bot Token Scopes required: chat:write, files:write, channels:read,
#                            users:read, im:write
SLACK_BOT_TOKEN=xoxb-your-token-here
SLACK_NOTIFY_TARGET=#data-alerts,#data-eng   # comma-separated for multiple targets
```

---

## Documentation

Full user guide, API reference, and troubleshooting: [DOCUMENTATION.md](https://github.com/GraFreak0/data_dictionary_builder/blob/main/DOCUMENTATION.md)

---

## License

[MIT](https://github.com/GraFreak0/data_dictionary_builder/blob/main/LICENSE) — free to use, modify, and distribute in personal and commercial projects.
