Metadata-Version: 2.4
Name: batchfactory
Version: 0.2.0
Summary: Composable, cache-aware batch processing pipelines for LLMs, APIs, and dataset generation.
Project-URL: Homepage, https://github.com/fangzhangmnm/batchfactory
Project-URL: Issues, https://github.com/fangzhangmnm/batchfactory/issues
Author-email: fangzhangmnm <1561797062@qq.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.9
Requires-Dist: aiofiles
Requires-Dist: aiolimiter
Requires-Dist: jsonlines
Requires-Dist: openai
Requires-Dist: pydantic
Requires-Dist: tqdm
Description-Content-Type: text/markdown

# BatchFactory

Composable, cache‑aware pipelines for **parallel LLM workflows**, API calls, and dataset generation.

> **Status — `v0.2` alpha.**  Stable enough for prototypes; expect fast‑moving APIs.

---

## Install

```bash
pip install batchfactory            # latest tag
pip install --upgrade batchfactory  # grab the newest patch
```

---

## Quick‑start

```python
import batchfactory as bf
from batchfactory.op import *

project = bf.CacheFolder("quickstart", 1, 0, 0)
broker  = bf.brokers.ConcurrentLLMCallBroker(project["cache/llm_broker.jsonl"])

# Rewrite the first three passages of every *.txt file into four‑line poems.

g = (
    ReadMarkdownLines("./data/*.txt", key_field="keyword", directory_str_field="directory")
    | Shuffle(42)
    | TakeFirstN(3)
    | GenerateLLMRequest(
        'Rewrite the passage from "{directory}" titled "{keyword}" as a four‑line poem.',
        model="gpt-4o-mini@openai",
    )
    | ConcurrentLLMCall(project["cache/llm_call.jsonl"], broker)
    | ExtractResponseText()
    | WriteJsonl(project["out/poems.jsonl"], output_fields=["keyword", "text", "directory"])
    | Print()
)

g.compile().execute(dispatch_brokers=True)
```

Run it twice – everything after the first run is served from the on‑disk ledger.

---

## Why BatchFactory?  **Three killer moves**

| 📦 Mass data distillation & cleanup                                                                                                                                                                                                                             | 🎭 Multi‑agent, multi‑round workflows                                                                                                                                                                   | 🔥 Hierarchical spawning for long text                                                                                                                                                                                   |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Chain `GenerateLLMRequest → ConcurrentLLMCall → ExtractResponseText` behind keyword or file sources to **bulk‑create, filter, or refine datasets** (think millions of Q\&A rows, code explanations, translation pairs) with caching and cost tracking built‑in. | `Repeat` plus chat helpers let you spin up translation swarms, code‑review pairs, or tutoring agents in **5 minutes of code** – conversations live in `chat_history`, cost and revisions are automatic. | `SpawnFromList` explodes complex items into fine‑grained subtasks, runs them **in parallel**, then `CollectAllToList` stitches results back – perfect for beat→scene→arc analy­sis or any long, messy document pipeline. |

---

### Loop snippet (Role‑Playing)

```python
Teacher = Character("teacher_name", TEACHER_PROMPT)
Student = Character("student_name", STUDENT_PROMPT)

g = ( ReadMarkdownLines("story.txt", "keyword")
      | SetField({"teacher_name":"Alice", "student_name":"Bob"})
      | Teacher("老师，请先讲解课文", 0)
      | Repeat( Student("同学提问或回答", 1)
                | Teacher("回应或继续讲解", 2), 3)
      | Teacher("请总结", 3)
      | ChatHistoryToText() | Print() )
```

### Spawn snippet (chapter → paragraph → chapter synopsis)

```python
project = bf.CacheFolder("spawn_demo", 1, 0, 0)
broker  = bf.brokers.ConcurrentLLMCallBroker(project["cache/llm.jsonl"])

g = ( ReadMarkdownLines("novel/*.md", "chapter")                 # each entry = a chapter
      | SpawnFromList("paragraphs", "para")                         # fan‑out per paragraph
      | GenerateLLMRequest("Summarise:\n{para}", model="gpt-4o-mini@openai")
      | ConcurrentLLMCall(project["cache/para_sum.jsonl"], broker)
      | CollectAllToList("text", "chapter_summaries")               # wait until ALL paras done
      | GenerateLLMRequest("Chapter synopsis:\n{chapter_summaries}",
                           model="gpt-4o-mini@openai")
      | ConcurrentLLMCall(project["cache/ch_sum.jsonl"], broker) )
```

*pseudocode, please see examples for implementation detail*

---

## Core concepts (one‑liner view)

| Term             | Story in one sentence                                                                                                                               |
| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Entry**        | Tiny record with immutable `idx`, mutable `data`, auto‑incrementing `rev`.                                                                          |
| **Op**           | Atomic node; compose with `|`or explicit`wire()`.                                                                                                   |
| **GraphSegment** | A lightweight helper: `.to_segment()` just wraps a node so `|`and`wire()` work on it—each actual op still appears exactly once.                     |
| **execute()**    | High‑level driver that *resumes*, *pumps*, and *dispatches* brokers.                                                                                |
| **Broker**       | Pluggable engine handling expensive / async jobs (LLM, search, human labels).                                                                       |
| **Ledger**       | Append‑only JSONL cache behind every broker & graph enabling instant restart.                                                                       |

*(You *can* call `pump()` manually, but 99 % of users stick to `execute()`.)*

---

## Primitive index (short list)

| Family              | Node                                                           | Blurb                             |
| ------------------- | -------------------------------------------------------------- | --------------------------------- |
| **Sources**         | `ReadMarkdownLines`, `FromList`                                | ingest files or raw dicts         |
| **Transforms**      | `Apply`, `Filter`, `SetField`                                  | python‑powered field tweaks       |
| **Spawn / Collect** | `SpawnFromList`, `CollectAllToList`                            | map‑reduce with unique child ids  |
| **Control flow**    | `If`, `While`, `Repeat`                                        | branch, loop, iterate             |
| **LLM**             | `GenerateLLMRequest → ConcurrentLLMCall → ExtractResponseText` | prompt, call, harvest             |
| **Utilities**       | `CleanupLLMData`, `PrintTotalCost`                             | tidy temporary fields, audit cost |

*(Shared‑idx ops `Replicate` / `Collect` are deprecated and vanish in v0.3.)*

---

## Example gallery

| ✨ Example                       | Demonstrates                                          |
| ------------------------------- | ----------------------------------------------------- |
| **01 – Basic pipeline**         | linear LLM transform & caching                        |
| **02 – Role‑playing loop**      | concise multi‑agent RPG using `Repeat` + chat helpers |
| **03 – Split & summarise**      | fan‑out/fan‑in summarisation (deprecated style)       |
| **04 – Long‑text segmentation** | Spawn + CollectAll power pattern                      |
| **05 – Math ops (unit)**        | loop + conditional logic under pure Python            |

---

## Broker & cache highlights

* Each expensive call is hashed → `job_idx` — **duplicate prompts are free**.
* `BrokerFailureBehavior = RETRY | STAY | EMIT` lets you decide how failures propagate.
* On restart, `execute()` reuses cached results and sends *only* the missing jobs.

---

## Roadmap → v0.3

* Enforce **unique `idx`** end‑to‑end → new `JoinByParent` replaces deprecated shared‑idx ops.
* Built‑in vector‑store & semantic‑search nodes.
* Streamlined cost & progress reporting.
* More batteries‑included tutorials.

---

© 2025 · MIT License
