Metadata-Version: 2.4
Name: das-layer
Version: 0.2.0
Summary: Contract-driven DAS-layer ingestion CLI built on dlt
License: MIT License
        
        Copyright (c) 2026 Mattias Thalén
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Requires-Python: >=3.12
Requires-Dist: dlt[az,deltalake,duckdb,ducklake,filesystem,parquet,sql-database]>=1.6.1
Requires-Dist: jinja2>=3.1
Requires-Dist: pydantic>=2.7
Requires-Dist: pyodata>=1.10
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.31
Requires-Dist: rich>=13.7
Requires-Dist: sqlalchemy>=2.0
Requires-Dist: typer>=0.12
Provides-Extra: excel
Requires-Dist: openpyxl>=3.1; extra == 'excel'
Provides-Extra: qvd
Requires-Dist: pyqvd>=2.3; extra == 'qvd'
Provides-Extra: xml
Requires-Dist: lxml>=5.2; extra == 'xml'
Description-Content-Type: text/markdown

# das-cli

Contract-driven ingestion for the raw→staging boundary, built on [dlt](https://dlthub.com/).

[![PyPI version](https://img.shields.io/pypi/v/das-layer.svg?v=2)](https://pypi.org/project/das-layer/)
[![Python versions](https://img.shields.io/pypi/pyversions/das-layer.svg?v=2)](https://pypi.org/project/das-layer/)
[![CI](https://github.com/mattiasthalen/das-cli/actions/workflows/ci.yml/badge.svg)](https://github.com/mattiasthalen/das-cli/actions/workflows/ci.yml)
[![License](https://img.shields.io/github/license/mattiasthalen/das-cli.svg)](LICENSE)

> Built on the **DAS** (Data As System) layer concept and contract-driven
> philosophy from Daana's writing — [The rise of the model-driven data
> engineer](https://blog.daana.dev/blog/the-rise-of-model-driven-data-engineer)
> and [Contract-driven data
> transformation](https://blog.daana.dev/blog/contract-driven-data-transformation).

## What it is

das-cli lands raw source data into a compressed JSONL landing zone, then
promotes it into a contract-enforced warehouse (DuckDB, DuckLake, or Delta).
ODCS v3
contracts are the single source of truth for schema, types, and keys —
schemas are *frozen and projected*, not inferred at load time.

It is scoped to the DAS layer: lossless landing and contract-projected
staging. It is *not* an ELT modeling tool. Transformations, business logic,
and downstream modeling belong in the layers above (dbt, SQLMesh, etc.).

## How it works

das-cli works in three steps — a one-time bootstrap, then two pipelines per
ingest:

1. **Bootstrap a contract** — point das-cli at an OData service or OpenAPI
   spec; it emits an editable ODCS v3 contract per resource, plus a
   `_source.yml` for connection details. (ADR-0005, ADR-0006, ADR-0015.)

2. **P1 — land** — extract raw records into a compressed JSONL landing zone
   under `./landing/`, hive-partitioned by extract date. Each row is stored
   as a `_data: json` blob — lossless, no schema applied. (ADR-0002,
   ADR-0003.)

3. **P2 — promote** — project the JSONL blobs into contract-enforced DuckDB
   staging tables. Schemas are frozen against the ODCS contract. Loads are
   append-only with row-hash change detection: each row's hash is compared
   against the latest hash per primary key in the destination, and only
   changed rows are appended. (ADR-0004, ADR-0008, ADR-0012.)

## Quickstart

```bash
uv tool install das-layer   # or: pip install das-layer / pipx install das-layer

mkdir my-data && cd my-data
das init

das source add adventureworks \
  --type odata \
  --url https://demodata.grapecity.com/adventureworks/odata/v1

das contract bootstrap adventureworks
das ingest adventureworks
```

For a SQL source (Oracle / SQL Server / Postgres / MySQL / SQLite), pass a
SQLAlchemy URL with no credentials — das-cli templates `env_var(...)` for
username / password into `_source.yml` and writes a matching `.env`
block. The driver wheel is your responsibility:

```bash
uv tool install das-layer --with oracledb     # or --with pyodbc

das source add finance \
  --type sql \
  --url "oracle+oracledb://oracle-host:1521/?service_name=XEPDB1"
# Edit .env: set FINANCE__USERNAME / FINANCE__PASSWORD.

das contract bootstrap finance
das ingest finance
```

For a CSV source (local files), point das at a directory of CSV files.
`das contract bootstrap` walks the directory and emits one contract per
file; widen `das.spec_ref.path` to a glob if you want monthly drops
combined into a single resource (ADR-0022):

```bash
das source add accounts \
  --type filesystem \
  --url ./data/accounts/

das contract bootstrap accounts
# Edit contracts/accounts/*.odcs.yaml — widen `path` to a glob and refine types.

das ingest accounts
```

For an Excel source (local `.xlsx` workbooks), point das at a directory
of workbooks. `das contract bootstrap` enumerates each workbook's
sheets and emits one contract per `(workbook, sheet)`. Each contract's
`das.spec_ref.path` is the workbook filename; `sheet` selects which
sheet to read. Use `inject: {col: sheet}` if you want the sheet name
projected as a column (ADR-0023).

```bash
uv tool install 'das-layer[excel]'

das source add q1-exports \
  --type filesystem \
  --url ./data/q1-exports/

das contract bootstrap q1-exports
# Edit contracts/q1-exports/*.odcs.yaml — set primary keys, refine types.

das ingest q1-exports
```

For an XML source (local files), provide an XPath to identify each
record element. The reader streams matched elements via lxml's
iterparse; column names come from the element's attributes + direct
child elements. Namespaced documents are supported by declaring
`namespaces` in the contract's `das.spec_ref`.

```bash
uv tool install 'das-layer[xml]'

das source add catalog \
  --type filesystem \
  --url ./data/catalog/

das contract bootstrap catalog --xpath '//customer'
# Edit contracts/catalog/*.odcs.yaml — set primary keys, refine types.

das ingest catalog
```

For a QVD source (local Qlik `.qvd` exports), point das at a directory
of QVD files. Unlike csv/excel/xml — where bootstrap defaults all
column types to `string` — QVD's embedded metadata gives das the real
types up front (integer/number/string/date/timestamp), so you usually
only need to set primary keys post-bootstrap.

```bash
uv tool install 'das-layer[qvd]'

das source add finance \
  --type filesystem \
  --url ./data/finance/

das contract bootstrap finance
# Edit contracts/finance/*.odcs.yaml — set primary keys (types already good).

das ingest finance
```

After `das ingest`:

- raw JSONL lands in `./landing/`, hive-partitioned by extract date
- contract-enforced staging tables live in `./das.duckdb`

## Example workspace

After `das init` and `das source add --type odata ...`, the workspace looks
like this — the `das.yaml` is the workspace config; `_source.yml` is the
per-source connection; the `.odcs.yaml` file is the contract you'd typically
hand-edit after `das contract bootstrap`.

```text
.
├── das.yaml
├── .env
└── contracts/
    └── adventureworks/
        ├── _source.yml
        └── products.odcs.yaml
```

**`das.yaml`** — workspace config written by `das init`. (For DuckLake or
Delta targets, see ADR-0016 / ADR-0017.)

```yaml
target:
  destination: duckdb
  credentials:
    database: "{{ env_var('DAS_TARGET_DB', './das.duckdb') }}"

landing:
  bucket_url: "{{ env_var('DAS_LANDING_URL', './landing') }}"
  credentials: {}

contracts:
  root: ./contracts

dataset:
  prefix: das          # set to null (or omit) for no prefix; dataset = source name

defaults:
  write_disposition: append
  change_detection: row_hash
  schema_contract:
    tables: evolve
    columns: freeze
    data_type: freeze
```

**`contracts/adventureworks/_source.yml`** — per-source connection written
by `das source add`. (ADR-0005 covers the contract/source split; ADR-0007
covers secrets via `env_var`.)

```yaml
type: odata
endpoint: https://demodata.grapecity.com/adventureworks/odata/v1
spec:
  kind: odata
  metadata_url: https://demodata.grapecity.com/adventureworks/odata/v1/$metadata
auth:
  type: none
ownership:
  team: data-platform
```

**`contracts/adventureworks/products.odcs.yaml`** — ODCS v3 contract,
post-edit. Defines the projected schema, primary key, and a `das.cursor`
for incremental extraction. (ADR-0011, ADR-0014 cover the `das.*`
namespace.)

```yaml
apiVersion: v3.0.0
kind: DataContract
id: adventureworks.products
name: Products
version: 0.1.0
status: draft
info: { title: Products, owner: data-platform }
schema:
  - name: products
    physicalName: products
    physicalType: table
    properties:
      - { name: product_id, physicalName: ProductId, logicalType: integer,
          physicalType: bigint, required: true, primaryKey: true }
      - { name: name, physicalName: Name, logicalType: string, required: true }
      - { name: modified_at, physicalName: ModifiedDate, logicalType: timestamp }
customProperties:
  - { property: das.cursor, value: modified_at }
```

### Other source shapes

Only `_source.yml` varies per source type. The `das.yaml` workspace config
and the contract shape are the same regardless of source.

**OpenAPI / REST** — produced by `das source add --type openapi --url
https://api.example.com/v1`:

```yaml
type: openapi
endpoint: https://api.example.com/v1
spec:
  kind: openapi
  swagger_url: https://api.example.com/v1/swagger.json
auth:
  type: none
ownership:
  team: data-platform
```

Resource contracts under OpenAPI sources need an extra `das.endpoint`
custom property so the extractor knows which path to hit and how to
unwrap the response. (ADR-0014, ADR-0015.)

```yaml
customProperties:
  - property: das.endpoint
    value:
      path: /Customers
      response_path: $.value
```

**SQL** — produced by `das source add --type sql --url
"oracle+oracledb://oracle-host:1521/?service_name=XEPDB1"`. Credentials
are templated into the URL via `env_var`; the matching `.env` block is
written by the same command. (ADR-0018.)

```yaml
type: sql
endpoint: "oracle+oracledb://{{ env_var('FINANCE__USERNAME') }}:{{ env_var('FINANCE__PASSWORD') }}@oracle-host:1521/?service_name=XEPDB1"
spec:
  kind: sql
auth:
  type: none
ownership:
  team: data-platform
```

## Concepts

- **Contract** — an [ODCS v3](https://bitol-io.github.io/open-data-contract-standard/)
  YAML file per resource. Defines schema, types, and primary keys. Edited by
  hand after `bootstrap`; frozen on ingest. (ADR-0005, ADR-0012.)
- **`_source.yml`** — per-source connection config (type, URL, auth). Kept
  separate from contracts so resource definitions stay portable. (ADR-0005.)
- **`das.*` custom properties** — the namespace for das-cli's extensions to
  ODCS, including `das.endpoint`, `das.hash_exclude`, and load-date hints.
  (ADR-0011, ADR-0014.)
- **Landing zone** — `./landing/<source>/<resource>/_extracted_on=YYYY-MM-DD/`,
  compressed JSONL, hive-partitioned. The lossless raw record. (ADR-0002,
  ADR-0013.)
- **P1 / P2 pipelines** — extract (source → landing) and load (landing →
  DuckDB). One P1 per source, one shared P2. (ADR-0002, ADR-0008.)
- **Secrets** — referenced in `_source.yml` via `{{ env_var("...") }}`. No
  `.dlt/secrets.toml`. (ADR-0007.)

## Sources & destinations

das-cli is built on [dlt](https://dlthub.com/), so in principle any dlt
[source](https://dlthub.com/docs/dlt-ecosystem/verified-sources) or
[destination](https://dlthub.com/docs/dlt-ecosystem/destinations) should
work. The e2e suite verifies *edges*, not the full matrix: each source is
exercised against DuckDB, and each destination is exercised against
AdventureWorks (OData). A row below means the corresponding edge has an
end-to-end test against a real instance.

### Verified sources (against DuckDB)

| Source              | Notes                                                  |
|---------------------|--------------------------------------------------------|
| OData v2 / v4       | OData metadata bootstrap (ADR-0006)                    |
| OpenAPI / REST      | Per-resource `das.endpoint` (ADR-0014, 0015)           |
| Oracle (oracledb)   | SQLAlchemy reflection (ADR-0018); Oracle Free          |
| SQL Server (pyodbc) | SQLAlchemy reflection (ADR-0018); ODBC Driver 18       |
| CSV (local)         | Filesystem source via dlt (ADR-0022); glob across files |
| Excel (local .xlsx) | openpyxl read-only streaming (ADR-0022); one resource per sheet |
| XML (local)         | lxml iterparse streaming (ADR-0022); xpath-driven records  |
| QVD (Qlik export)   | pyqvd chunked streaming (ADR-0022); types inferred from header |

### Verified destinations (against OData / AdventureWorks)

| Destination | Notes                                                  |
|-------------|--------------------------------------------------------|
| DuckDB      | Default; in-process, no container                      |
| DuckLake    | Local DuckDB-file catalog (ADR-0016)                   |
| Delta       | Local filesystem + Azure ADLS Gen2 + OneLake (write paths; read-back via `sql_client` not yet wired for OneLake) — ADR-0017, ADR-0020 |

### Want a combination verified?

[Open an issue](https://github.com/mattiasthalen/das-cli/issues/new)
describing the source or destination you'd like to use. "Verified" means
there's an end-to-end edge test against a real instance — source verified
against DuckDB, or destination verified against OData. We don't fan out the
full N×M matrix; if you need a specific (source, destination) pair
exercised, file an issue and we'll discuss whether it warrants its own test.

## Development

```bash
uv sync
uv run pytest                       # unit + integration
uv run pytest -m e2e                # opt-in; hits public AdventureWorks/Northwind APIs
uv run pre-commit run --all-files
```

For architectural context (decision records, design specs, implementation
plans), see the **Documentation** section below and `CLAUDE.md`.

## Documentation

- [`docs/adrs/`](docs/adrs/) — Architecture Decision Records: *why is it
  built this way?*
- [`docs/superpowers/specs/`](docs/superpowers/specs/) — design specs:
  *what's planned / shipped?*
- [`docs/superpowers/plans/`](docs/superpowers/plans/) — implementation
  plans: *how is it being built?*
- [`CLAUDE.md`](CLAUDE.md) — project rules and ADR lifecycle policy:
  *what are the rules?*

## License

See [`LICENSE`](LICENSE).
