Metadata-Version: 2.1
Name: bespokelabs-curator
Version: 0.1.19
Summary: Bespoke Labs Curator
Home-page: https://github.com/bespokelabsai/curator
License: Apache-2.0
Keywords: ai,curator,bespoke
Author: Bespoke Labs
Author-email: company@bespokelabs.ai
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: code-execution
Provides-Extra: vllm
Requires-Dist: aiodocker (>=0.24.0,<0.25.0) ; extra == "code-execution"
Requires-Dist: aiofiles (>=22.0,<24.0)
Requires-Dist: anthropic (>=0.42.0,<0.43.0)
Requires-Dist: datasets (>=3.0.2,<4.0.0)
Requires-Dist: instructor (>=1.6.3,<2.0.0)
Requires-Dist: litellm (==1.55.4)
Requires-Dist: matplotlib (>=3.9.2,<4.0.0)
Requires-Dist: nest-asyncio (>=1.6.0,<2.0.0)
Requires-Dist: pandas (==2.2.2)
Requires-Dist: posthog (>=3.11.0,<4.0.0)
Requires-Dist: pydantic (>=2.9.2)
Requires-Dist: ray (>=2.41.0,<3.0.0) ; extra == "code-execution"
Requires-Dist: rich (>=13.7.0,<14.0.0)
Requires-Dist: tiktoken (>=0.7.0)
Requires-Dist: tqdm (>=4.67.0,<5.0.0)
Requires-Dist: vertexai (==1.71.1)
Requires-Dist: vllm (>=0.6.3,<0.7.0) ; extra == "vllm"
Requires-Dist: xxhash (>=3.5.0,<4.0.0)
Project-URL: Repository, https://github.com/bespokelabsai/curator
Description-Content-Type: text/markdown

<p align="center">
  <a href="https://bespokelabs.ai/" target="_blank">
    <picture>
      <source media="(prefers-color-scheme: light)" width="100px" srcset="docs/Bespoke-Labs-Logomark-Red-crop.png">
      <img alt="Bespoke Labs Logo" width="100px" src="docs/Bespoke-Labs-Logomark-Red-crop.png">
    </picture>
  </a>
</p>

<h1 align="center">Bespoke Curator</h1>
<h3 align="center" style="font-size: 20px; margin-bottom: 4px">Data Curation for Post-Training & Structured Data Extraction</h3>
<br/>

<p align="center">
  <a href="https://docs.bespokelabs.ai/bespoke-curator/getting-started">
    <img alt="Static Badge" src="https://img.shields.io/badge/Docs-docs.bespokelabs.ai-blue?style=flat&link=https%3A%2F%2Fdocs.bespokelabs.ai">
  </a>
  <a href="https://bespokelabs.ai/">
    <img alt="Site" src="https://img.shields.io/badge/Site-bespokelabs.ai-blue?link=https%3A%2F%2Fbespokelabs.ai"/>
  </a>
  <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/bespokelabs-curator">
  <a href="https://twitter.com/bespokelabsai">
    <img src="https://img.shields.io/twitter/follow/bespokelabsai" alt="Follow on X" />
  </a>
  <a href="https://discord.gg/KqpXvpzVBS">
    <img alt="Discord" src="https://img.shields.io/discord/1230990265867698186">
  </a>
</p>
<div align="center">
[ English | <a href="README_zh.md">中文</a> ]
</div>

## 🎉 What's New 
* **[2025.01.30]** [Batch Processing Support for OpenAI, Anthropic, and other compatible APIs](https://www.bespokelabs.ai/blog/batch-processing-with-curator): Cut Token Costs in Half 🔥🔥🔥. Through our partnership with kluster.ai, new users using Curator can access open-source models like DeepSeek-R1 and receive a **$25 credit** (limits apply). Please [fill out this form](https://docs.google.com/forms/d/e/1FAIpQLSeBeBKA_19ljeUCkpwkUuUL5YXUUPKUExpjmBaSfmF2XAhwVA/viewform?usp=dialog) to claim your credit.
* **[2025.01.27]** We used Bespoke Curator to create [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k), a high-quality reasoning dataset (trending on HuggingFace).
* **[2025.01.22]** We used Bespoke Curator to create [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k), a high-quality reasoning dataset (trending on HuggingFace).
* **[2025.01.15]** Curator launched 🎉

## Overview

Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.

* Rich Python based library for generating and curating synthetic data.
* Interactive viewer to monitor data while it is being generated.
* First class support for structured outputs.
* Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
* Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.

![CLI in action](docs/curator-cli.gif)

Check out our full documentation for [getting started](https://docs.bespokelabs.ai/bespoke-curator/getting-started), [tutorials](https://docs.bespokelabs.ai/bespoke-curator/tutorials), [guides](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides) and detailed [reference](https://docs.bespokelabs.ai/bespoke-curator/api-reference/llm-api-documentation).

## 🛠️ Installation

```bash
pip install bespokelabs-curator
```

## 🚀 Quickstart

### Using `curator.LLM`

```python
from bespokelabs import curator
llm = curator.LLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem.to_pandas())
```
> [!NOTE]
> Retries and caching are enabled by default to help you rapidly iterate your data pipelines.
> So now if you run the same prompt again, you will get the same response, pretty much instantly.
> You can delete the cache at `~/.cache/curator` or disable it with `export CURATOR_DISABLE_CACHE=true`.


> [!IMPORTANT]
> Make sure to set your API keys as environment variables for the model you are calling. For example running `export OPENAI_API_KEY=sk-...` and `export ANTHROPIC_API_KEY=ant-...` will allow you to run the previous two examples. A full list of supported models and their associated environment variable names can be found [in the litellm docs](https://docs.litellm.ai/docs/providers).

You can also send a list of prompts to the LLM, or a HuggingFace Dataset object (see below for more details).

### Using structured outputs and custom prompting and parsing logic

Here's an example of using structured outputs and custom prompting and parsing logic.

```python
from typing import Dict
from pydantic import BaseModel, Field
from bespokelabs import curator
from datasets import Dataset

class Poem(BaseModel):
    title: str = Field(description="The title of the poem.")
    poem: str = Field(description="The content of the poem.")

class Poet(curator.LLM):
    response_format = Poem

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poem) -> Dict:
        return [{"title": response.title, "poem": response.poem}]

poet = Poet(model_name="gpt-4o-mini")
topics = Dataset.from_dict({'topic': ['Dreams of a Robot']})

poems = poet(topics)
print(poems.to_pandas())
```
Output:
```
    title                           poem
0   Dreams of a Robot: Awakening    In circuits deep, where silence hums, \nA dre..
1   Life of an AI Agent - Poem 1    In circuits woven, thoughts ignite,\nI dwell i...
```

In the `Poet` class:
* `response_format` is the structured output class we defined above.
* `prompt` takes the input (`input`) and returns the prompt for the LLM.
* `parse` takes the input (`input`) and the structured output (`response`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Note that `topics` can be created with another `LLM` class as well,
and we can scale this up to create tens of thousands of diverse poems.

```python

class Topics(BaseModel):
    topics_list: List[str] = Field(description="A list of topics.")

class TopicGenerator(curator.LLM):
  response_format = Topics

  def prompt(self, subject):
    return f"Return 3 topics related to {subject}"

  def parse(self, input: str, response: Topics):
    return [{"topic": t} for t in response.topics_list]

topic_generator = TopicGenerator(model_name="gpt-4o-mini")
topics = topic_generator("Mathematics")
poems = poet(topics)
```
Output:
```
 	title                     poem
0	The Language of Algebra	  In symbols and signs, truths intertwine,..
1	The Geometry of Space	  In the world around us, shapes do collide,..
2	The Language of Logic	  In circuits and wires where silence speaks,..
```

You can see more examples in the [examples](examples) directory.

See the [docs](https://docs.bespokelabs.ai/) for more details as well as
for troubleshooting information.

> [!TIP]
> If you are generating large datasets, you may want to use [batch mode](https://docs.bespokelabs.ai/bespoke-curator/tutorials/save-usdusdusd-with-batch-mode) to save costs. Currently batch APIs from [OpenAI](https://platform.openai.com/docs/guides/batch) and [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/message-batches) are supported. With curator this is as simple as setting `batch=True` in the `LLM` class.

### Anonymized Telemetry

We collect minimal, anonymized usage telemetry to help prioritize new features and improvements that benefit the Curator community. You can opt out by setting the `TELEMETRY_ENABLED` environment variable to `False`. 

## 📖 Providers
Curator supports a wide range of providers, including OpenAI, Anthropic, and many more. 

### OpenAI backend
```python
llm = curator.LLM(
    model_name="gpt-4o-mini",
)
```
For other models that support OpenAI-compatible APIs, you can use the `openai` backend:
```python
llm = curator.LLM(
    model_name="gpt-4o-mini",
    backend="openai",
    backend_params={
        "base_url": "https://your-openai-compatible-api-url",
        "api_key": <YOUR_OPENAI_COMPATIBLE_SERVICE_API_KEY>,
    },
)
```


### LiteLLM (Anthropic, Gemini, together.ai, etc.)
Here is an example of using Gemini with litellm backend:
```python
llm = curator.LLM(
    model_name="gemini/gemini-1.5-flash",
    backend="litellm",
    backend_params={
        "max_requests_per_minute": 2_000,
        "max_tokens_per_minute": 4_000_000
    },
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-litellm-for-diverse-providers)

### Ollama
```python
llm = curator.LLM(
    model_name="ollama/llama3.1:8b",  # Ollama model identifier
    backend_params={"base_url": "http://localhost:11434"},
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-ollama-with-curator#id-2.-configure-the-ollama-backend)

### vLLM

```python
llm = curator.LLM( 
    model_name="Qwen/Qwen2.5-3B-Instruct", 
    backend="vllm", 
    backend_params={ 
        "tensor_parallel_size": 1, # Adjust based on GPU count 
        "gpu_memory_utilization": 0.7 
    }
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-vllm-with-curator#id-3-initialize-and-use-the-generator)
### DeepSeek
DeepSeek offers an OpenAI-compatible API that you can use with the `openai` backend.
> [!IMPORTANT]
> The DeepSeek API is experiencing intermittent issues and will return empty responses during times of high traffic. We recommend
calling the DeepSeek API through the `openai` backend, with a high max retries so that we can retry failed requests upon empty
response and a reasonable max requests and tokens per minute so we don't retry too aggressively and overwhelm the API.

```python
llm = curator.LLM(
    model_name="deepseek-reasoner",
    generation_params={"temp": 0.0},
    backend_params={
        "max_requests_per_minute": 100,
        "max_tokens_per_minute": 10_000_000,
        "base_url": "https://api.deepseek.com/",
        "api_key": <YOUR_DEEPSEEK_API_KEY>,
        "max_retries": 50,
    },
    backend="openai",
)
```

### kluster.ai
```python
llm = curator.LLM(
    model_name="deepseek-ai/DeepSeek-R1", 
    backend="klusterai",
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-kluster.ai-for-batch-inference)
## 📦 Batch Mode
Several providers offer about 50% discount on token usage when using batch mode. Curator makes it easy to use batch mode with a wide range of providers.

Example with OpenAI ([docs reference](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-openai-for-batch-inference)):
```python
llm = curator.LLM(model_name="gpt-4o-mini", batch=True)
```

See documentation:
* [OpenAI batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-openai-for-batch-inference)
* [Anthropic batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-anthropic-for-batch-inference)
* [Gemini batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-gemini-for-batch-inference)
* [kluster.ai batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-kluster.ai-for-batch-inference)

## Bespoke Curator Viewer

![Viewer in action](docs/curator-viewer.gif)

To run the bespoke dataset viewer:

```bash
curator-viewer
```

This will pop up a browser window with the viewer running on `127.0.0.1:3000` by default if you haven't specified a different host and port.

The dataset viewer shows all the different runs you have made. Once a run is selected, you can see the dataset and the responses from the LLM.

Optional parameters to run the viewer on a different host and port:
```bash
>>> curator-viewer -h
usage: curator-viewer [-h] [--host HOST] [--port PORT] [--verbose]

Curator Viewer

options:
  -h, --help     show this help message and exit
  --host HOST    Host to run the server on (default: localhost)
  --port PORT    Port to run the server on (default: 3000)
  --verbose, -v  Enables debug logging for more verbose output
```

The only requirement for running `curator-viewer` is to install node. You can install them by following the instructions [here](https://nodejs.org/en/download/package-manager).

For example, to check if you have node installed, you can run:

```bash
node -v
```

If it's not installed, installing latest node on MacOS, you can run:

```bash
# installs nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash
# download and install Node.js (you may need to restart the terminal)
nvm install 22
# verifies the right Node.js version is in the environment
node -v # should print `v22.11.0`
# verifies the right npm version is in the environment
npm -v # should print `10.9.0`
```

## Contributing
Thank you to all the contributors for making this project possible!
Please follow [these instructions](CONTRIBUTING.md) on how to contribute.

## Citation
If you find Curator useful, please consider citing us!

```
@software{Curator: A Tool for Synthetic Data Creation,
  author = {Marten, Ryan* and Vu, Trung* and Ji, Charlie Cheng-Jie and Sharma, Kartik and Pimpalgaonkar, Shreyas and Dimakis, Alex and Sathiamoorthy, Maheswaran},
  month = jan,
  title = {{Curator}},
  year = {2025},
  howpublished = {\url{https://github.com/bespokelabsai/curator}}
}
```

