Metadata-Version: 2.4
Name: chat-history-schema
Version: 0.1.0
Summary: Provider-organized schema + ingestion/cleanup pipeline for AI chat exports
License-File: LICENSE
Requires-Python: >=3.13
Requires-Dist: pydantic>=2.11.5
Description-Content-Type: text/markdown

# Chat History Schema

Provider-organized schema + ingestion/cleanup pipeline for chat history exports.

## Purpose

This repo is for cleaning up raw chat data exports.  
The final merged output can be used for downstream data analysis and visualizations across your past ChatGPT and Claude conversations.

## Structure

```text
chat-history-schema/
  model/
    openai/
    anthropic/
  scripts/
    combine_conversations.py
    parse_validate_clean.py
    merge_provider_histories.py
    run_pipeline.py
  data/
    raw/{openai,anthropic}/
    combined/{openai,anthropic}/
    merged/
```

## Workflow

1. Drop raw exports into:
   - `data/raw/openai/`
   - `data/raw/anthropic/`
2. Combine raw files by provider:
   - All providers: `python scripts/combine_conversations.py --provider all`
   - One provider only:
     - OpenAI: `python scripts/combine_conversations.py --provider openai`
     - Anthropic: `python scripts/combine_conversations.py --provider anthropic`
3. Validate by provider:
   - All providers: `python scripts/parse_validate_clean.py --provider all`
   - One provider only:
     - OpenAI: `python scripts/parse_validate_clean.py --provider openai`
     - Anthropic: `python scripts/parse_validate_clean.py --provider anthropic`
4. Merge provider histories with a `provider` field (this command validates before writing output):
   - All providers: `python scripts/merge_provider_histories.py --provider all`
   - One provider only:
     - OpenAI: `python scripts/merge_provider_histories.py --provider openai`
     - Anthropic: `python scripts/merge_provider_histories.py --provider anthropic`

## One-command end-to-end run

Run combine + validate + merge in one command:

- All providers:
  - `python scripts/run_pipeline.py --provider all`
- One provider only:
  - `python scripts/run_pipeline.py --provider openai`
  - `python scripts/run_pipeline.py --provider anthropic`

## Script outputs

- `scripts/combine_conversations.py`
  - Writes:
    - `data/combined/openai/conversations.json` (when provider includes `openai`)
    - `data/combined/anthropic/conversations.json` (when provider includes `anthropic`)
- `scripts/parse_validate_clean.py`
  - Writes: no files
  - Prints terminal validation results only
- `scripts/merge_provider_histories.py`
  - Writes:
    - If you do nothing: `data/merged/all-conversations-clean.pkl`
    - If you want a different file: pass `--output-pkl <your/path.pkl>`
  - Main argument:
    - `--provider all|openai|anthropic`
- `scripts/run_pipeline.py`
  - Writes:
    - Combined JSON files under `data/combined/<provider>/conversations.json`
    - Final merged pickle:
      - If you do nothing: `data/merged/all-conversations-clean.pkl`
      - If you want a different file: pass `--output-pkl <your/path.pkl>`

## Validation behavior (make it your own)

This schema is intentionally strict and **not** meant to be a final bulletproof version for every export.  
It is designed to surface structure differences early, so you can evolve it to match your own history.

- Models use strict validation (`extra='forbid'`) so unknown fields fail fast instead of being silently ignored.
- You may see validation errors because chat histories differ by provider, account features, tool usage, and export date.
- Treat validation as an iterative loop:
  1. Run validation.
  2. Read terminal errors.
  3. Update `model/openai/*` or `model/anthropic/*` to support the new shape.
  4. Rerun until validation passes.

If you want this repo to fit your data long-term, keep extending the provider schemas as your exports evolve.

## Inspiration

This repo was inspired by the original ChatGPT schema project:  
https://github.com/ryayoung/chatgpt-schema
