Metadata-Version: 2.4
Name: video-text-extractor
Version: 1.0.6
Summary: Reconstruct code, slides, and prose from YouTube videos as Markdown.
Author: jon-chun
License: MIT
Requires-Python: >=3.12
Requires-Dist: httpx>=0.28.1
Requires-Dist: imagehash>=4.3
Requires-Dist: jsonschema>=4.22
Requires-Dist: numpy>=1.26
Requires-Dist: opencv-python-headless>=4.10
Requires-Dist: pillow>=10.3
Requires-Dist: pydantic-settings>=2.4
Requires-Dist: pydantic>=2.7
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: python-dotenv>=1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7
Requires-Dist: scenedetect[opencv]>=0.6.4
Requires-Dist: structlog>=24.1
Requires-Dist: tenacity>=8.5
Requires-Dist: tree-sitter-language-pack>=0.7
Requires-Dist: tree-sitter>=0.22
Requires-Dist: typer>=0.12
Provides-Extra: align
Requires-Dist: faster-whisper>=1.0; extra == 'align'
Requires-Dist: whisperx>=3.1; extra == 'align'
Provides-Extra: asr
Requires-Dist: faster-whisper>=1.0; extra == 'asr'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.14; extra == 'dev'
Requires-Dist: pytest-xdist>=3.6; extra == 'dev'
Requires-Dist: pytest>=8.2; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: types-jsonschema; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Description-Content-Type: text/markdown

# video-text-extractor (vte)

Reconstruct **code**, **slides**, and **prose** shown in YouTube videos as a single Markdown document plus a machine-readable JSON manifest.

> **v1.0.0** ships the full 8-stage pipeline with PySceneDetect, VLM-fallback classifier, VLM-primary OCR, LLM topic segmentation, typing-progression code reconstruction with tree-sitter validation, Whisper ASR (opt-in via `[asr]` extra), WhisperX alignment (opt-in via `[align]`), `vte validate-manifest`, `vte models pull/list`, Dockerfile, and CI/CD workflows.

## Requirements

- Python 3.12+
- System binaries on PATH: `ffmpeg`, `ffprobe`, `tesseract`, `yt-dlp`
- Apple Silicon or NVIDIA GPU recommended for local VLM / Whisper (CPU fallback works)

## Install

### From PyPI (once published)

```bash
pip install video-text-extractor             # core
pip install "video-text-extractor[asr]"      # + Whisper ASR fallback
pip install "video-text-extractor[align]"    # + WhisperX word-level alignment
```

### From Docker

```bash
docker run --rm -v "$PWD":/work ghcr.io/jon-chun/vte extract <url> --out /work/out
```

### From source

```bash
git clone https://github.com/jon-chun/video-text-extractor
cd video-text-extractor
uv sync
uv run vte extract <url>
```

For local development with all optional extras:

```bash
uv sync --all-extras
```

## Quickstart

```bash
uv run vte extract https://www.youtube.com/watch?v=<ID> --out ./out
# or with a local file:
uv run vte extract path/to/video.mp4 --out ./out
```

Outputs per video:

```
out/<video-id>/
  metadata.json
  video.mp4
  audio.m4a
  transcript/
    raw_captions.vtt        # if YouTube provided captions
    asr_segments.json       # if Whisper ASR ran
    final.json
  keyframes/
    raw/000001.png ...      # all uniform samples
    selected/000001.png ... # post-dedup working set
    index.json
    classifications.json
  ocr/
    000001.json ...
    _done.marker
  narrative/
    segments.json
  output/
    reconstructed.md        # main artifact
    manifest.json           # schema-validated
    README.md               # human-readable run summary
    elements.json
    assets/
      000001.png            # downscaled slide
      000001.thumb.png      # 200px thumbnail
  vte.log                   # structured JSON log
```

## Subcommands

```text
vte version                                              # print version
vte extract <url-or-path> [options]                      # main pipeline
vte providers ping --provider <name> --model <m>         # provider smoke test
vte models list                                          # list configured providers
vte models pull                                          # ollama pull for each ollama-configured model
vte validate-manifest <path>                             # validate manifest.json against schema
```

`vte extract` flags:

```text
--out DIR                       output directory (default: ./out)
--config PATH                   per-job YAML override
--force                         re-run all stages even if outputs exist
--force-stage NAME              re-run a single stage
--log-level LEVEL               structlog level (default: info)
--respect-no-derivatives        abort (exit 5) on ND-licensed videos
```

Exit codes:

- `0` — success (possibly with downgrades)
- `1` — unexpected error
- `2` — config error (e.g. missing env var)
- `3` — pre-emit stage failed
- `4` — emit failed
- `5` — refused (`--respect-no-derivatives` and ND license)
- `6` — provider HTTP error (from `vte providers ping`)
- `7` — manifest schema validation failed (from `vte validate-manifest`)

## Providers

Configure LLM/VLM backends via the `ProviderSpec` shape. Smoke-test connectivity:

```bash
# Local Ollama:
uv run vte providers ping --provider ollama --model qwen2.5:14b-instruct

# OpenAI (requires OPENAI_API_KEY in env):
uv run vte providers ping --provider openai --model gpt-4o-mini \
  --api-key-env-var OPENAI_API_KEY

# Anthropic, OpenRouter — same shape; substitute provider and env var.
```

## Real Models (default still local)

When `cfg.classifier.vlm_fallback.enabled: true` and an Ollama (or hosted-API) provider is configured, ambiguous frames are reclassified by a VLM. When `cfg.ocr.primary.provider != "tesseract"`, the configured VLM does primary OCR with Tesseract running in parallel for an agreement score. When `cfg.narrative.segmentation.provider.provider != "fixed-window"`, the configured LLM generates real topic segments from the transcript.

**Default config is Ollama-first** — `qwen2.5vl:7b` (classifier VLM) and `llama3.1:8b` (narrative + code repair) drive the pipeline out of the box. Pull them once (`uv run vte models pull`) and you get the full v1.0 quality on the next extract. If Ollama isn't running or the models aren't pulled, every stage downgrades gracefully (recorded in `manifest.json → pipeline.stages[].downgrades[]`); the pipeline still exits 0 with M1-tier output. (v1.0.1 briefly defaulted to `qwen3-vl`, but its chain-of-thought reasoning mode produced empty/truncated structured output; v1.0.2 reverted to non-thinking models.)

To switch to a paid API instead of local Ollama, override the provider in your config:

```yaml
ocr:
  primary:
    provider: "openai"
    model: "gpt-4o-mini"
    api_key_env_var: "OPENAI_API_KEY"
narrative:
  segmentation:
    provider:
      provider: "anthropic"
      model: "claude-sonnet-4-5"
      api_key_env_var: "ANTHROPIC_API_KEY"
```

To revert to the pre-v1.0 heuristic-only behaviour (no model calls), set `classifier.vlm_fallback.enabled: false`, `narrative.segmentation.provider.provider: "fixed-window"`, and `reassembly.code.repair_provider.provider: "none"`.

## Code Reconstruction

Code-classified frames go through a four-stage reconstruction pipeline:

1. **Typing-progression detection** (`reasm/code_diff.py`) — consecutive code frames within a scene that look like incremental edits get collapsed; only the last frame's text is emitted.
2. **Language detection** (`reasm/language_detect.py`) — slide-hint lookup → Pygments `guess_lexer` → optional LLM tiebreaker.
3. **OCR repair** (`reasm/code_repair.py`) — LLM-driven fix of common OCR errors (`O/0`, `l/1`, broken indent), gated by tree-sitter validation: if the LLM rewrites too much, the repair is rejected.
4. **Validation** (`reasm/tree_sitter_validate.py`) — `ast.parse` for Python, `tree-sitter` for the other six v1 languages.

Default config: repair is OFF (`reassembly.code.repair_provider.provider: "none"`), but language detection + tree-sitter validation run on every code block. Each emitted `code` element carries `language`, `language_source`, `repair_status`, `validation`, and `agreement`.

To enable LLM repair against a local Ollama Qwen2.5-Coder:

```yaml
reassembly:
  code:
    repair_provider:
      provider: "ollama"
      model: "qwen2.5-coder:14b"
      temperature: 0.0
```

Supported languages (v1): Python, JavaScript, TypeScript, Go, Rust, Java, C, C++, Bash.

## Whisper ASR (opt-in)

`vte` falls back from captions to Whisper ASR when captions are missing or empty. Enable with the `[asr]` extra:

```bash
pip install "video-text-extractor[asr]"
```

Word-level alignment via WhisperX is a further opt-in:

```bash
pip install "video-text-extractor[align]"
```

Both extras are off by default to keep the core install lean.

## Configuration

Precedence (highest → lowest):

1. CLI flags
2. `--config <path.yaml>` (per-job)
3. `~/.config/vte/config.yaml` (user-wide)
4. Shipped `src/vte/defaults.yaml`

Secrets (for hosted-API providers) come from environment variables only — see `.env.example`. The `config_snapshot` in `manifest.json` redacts any key whose underscore-separated components include `key`, `token`, `secret`, or `password`.

See [docs/configuration.md](docs/configuration.md) for the full YAML reference.

## Legal / Fair Use

`vte` is for personal, fair-use, and educational reconstruction of video content. Users are responsible for confirming their use respects the source video's license and applicable local law. The tool does not bypass DRM. The `--respect-no-derivatives` flag aborts (exit code 5) on videos licensed under no-derivatives terms.

## Development

```bash
uv sync --all-extras
uv run pytest -v                                # full suite
uv run pytest -m "not live and not e2e" -v      # default CI subset (189 tests)
uv run pytest -m integration -v                 # integration tests
uv run ruff check src tests
uv run ruff format src tests
uv run mypy src/vte/
```

## Architecture

8-stage pipeline:

```
URL/path → acquire → transcript → keyframes → classify → ocr → narrative → reassembly → emit
```

| Stage | Implementation |
| --- | --- |
| acquire | yt-dlp + local-file passthrough; ND-license check via `--respect-no-derivatives` |
| transcript | VTT captions → Whisper ASR fallback (`[asr]`) → WhisperX alignment (`[align]`) |
| keyframes | PySceneDetect scene boundaries + uniform sampling within scenes + pHash dedup |
| classify | Heuristics (text coverage, hue counts, edge ratio) + VLM fallback for ambiguous frames |
| ocr | VLM primary + Tesseract sanity check (or Tesseract-only by config) |
| narrative | LLM topic segmentation (or fixed-window fallback) |
| reassembly | Diff-based code reconstruction + language detect + tree-sitter validation + segment interleave |
| emit | Manifest + Markdown + per-run README; schema-validated |

For the full data flow, module map, and extension points see:

- [Architecture](docs/architecture.md) — pipeline stages, module map, extension points.
- [Configuration](docs/configuration.md) — full YAML reference + provider spec.
- [FAQ](docs/faq.md) — troubleshooting + common gotchas.
- [Design plan](dev/design-plan.md) — cumulative *why* behind every decision.
- [Tech spec](dev/tech-spec_v1.md) — static v1 specification.

## License

MIT.
