Metadata-Version: 2.3
Name: bespokelabs-curator
Version: 0.1.23.post1
Summary: Bespoke Labs Curator
Home-page: https://github.com/bespokelabsai/curator
License: Apache-2.0
Keywords: ai,curator,bespoke
Author: Bespoke Labs
Author-email: company@bespokelabs.ai
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: code-execution
Provides-Extra: vllm
Requires-Dist: aiodocker (>=0.24.0,<0.25.0) ; extra == "code-execution"
Requires-Dist: aiofiles (>=22.0,<24.0)
Requires-Dist: anthropic (>=0.47.2,<0.48.0)
Requires-Dist: datasets (>=3.0.2,<4.0.0)
Requires-Dist: e2b-code-interpreter (==1.0.3) ; extra == "code-execution"
Requires-Dist: instructor (>=1.6.3,<2.0.0)
Requires-Dist: litellm (==1.61.3)
Requires-Dist: matplotlib (>=3.9.2,<4.0.0)
Requires-Dist: mistralai (>=1.5.1,<2.0.0)
Requires-Dist: nest-asyncio (>=1.6.0,<2.0.0)
Requires-Dist: pandas (==2.2.2)
Requires-Dist: posthog (>=3.11.0,<4.0.0)
Requires-Dist: pydantic (>=2.9.2)
Requires-Dist: ray (>=2.41.0,<3.0.0) ; extra == "code-execution"
Requires-Dist: rich (>=13.7.0,<14.0.0)
Requires-Dist: tiktoken (>=0.7.0)
Requires-Dist: tqdm (>=4.67.0,<5.0.0)
Requires-Dist: vertexai (==1.71.1)
Requires-Dist: vllm (>=0.6.3,<0.7.0) ; extra == "vllm"
Requires-Dist: xxhash (>=3.5.0,<4.0.0)
Project-URL: Repository, https://github.com/bespokelabsai/curator
Description-Content-Type: text/markdown

<p align="center">
  <a href="https://bespokelabs.ai/" target="_blank">
    <picture>
      <source media="(prefers-color-scheme: light)" width="100px" srcset="https://github.com/bespokelabsai/curator/blob/main/docs/Bespoke-Labs-Logomark-Red-crop.png">
      <img alt="Bespoke Labs Logo" width="100px" src="https://github.com/bespokelabsai/curator/blob/main/docs/Bespoke-Labs-Logomark-Red-crop.png">
    </picture>
  </a>
</p>

<h1 align="center">Bespoke Curator</h1>
<h3 align="center" style="font-size: 20px; margin-bottom: 4px">Data Curation for Post-Training & Structured Data Extraction</h3>
<br/>

<div align="center">

[![Github](https://img.shields.io/badge/Curator-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/bespokelabsai/curator/) [![Twitter](https://img.shields.io/badge/@BespokeLabsai-white?style=for-the-badge&logo=X&logoColor=white&color=000)](https://x.com/bespokelabsai) [![Hugging Face](https://img.shields.io/badge/BespokeLabs-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor)](https://huggingface.co/bespokelabs) [![Discord](https://img.shields.io/badge/Bespoke_Labs-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/KqpXvpzVBS) 
<br>
[![Docs](https://img.shields.io/badge/Docs-docs.bespokelabs.ai-blue?style=for-the-badge&link=https%3A%2F%2Fdocs.bespokelabs.ai&labelColor=000)](https://docs.bespokelabs.ai/bespoke-curator/getting-started) [![Website](https://img.shields.io/badge/Site-bespokelabs.ai-blue?style=for-the-badge&link=https%3A%2F%2Fbespokelabs.ai&labelColor=000)](https://bespokelabs.ai/) [![PyPI](https://img.shields.io/pypi/v/bespokelabs-curator?style=for-the-badge&labelColor=000)](https://pypi.org/project/bespokelabs-curator/)
</div>

<div align="center">
[ English | <a href="docs/README_zh.md">中文</a> ]
</div>

## 🎉 What's New 
* **[2025.03.12]** [Gemini Batch support added](https://www.bespokelabs.ai/blog/effortless-gemini-batch-processing-with-curator): Gemini batch API is extremely challenging, and we made it much simpler! :)
* **[2025.03.05]** [Claude 3.7 Sonnet Thinking and batch mode support added](https://www.bespokelabs.ai/blog/claude-3-7-sonnet-thinking-mode-in-curator).
* **[2025.02.26]** [Code Execution Support added](https://www.bespokelabs.ai/blog/launching-code-executor): You can now run code (generated by Curator) using CodeExecutor. We support four backends: local (called multiprocessing), Ray, Docker and e2b.
* **[2025.02.06]** We used Bespoke Curator to create [s1K-1.1]( https://huggingface.co/datasets/simplescaling/s1K-1.1), a high-quality sample-efficient reasoning dataset.
* **[2025.01.30]** [Batch Processing Support for OpenAI, Anthropic, and other compatible APIs](https://www.bespokelabs.ai/blog/batch-processing-with-curator): Cut Token Costs in Half 🔥🔥🔥. Through our partnership with kluster.ai, new users using Curator can access open-source models like DeepSeek-R1 and receive a **$25 credit** (limits apply). EDIT: Promotion has come to an end.
* **[2025.01.27]** We used Bespoke Curator to create [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k), a high-quality reasoning dataset (trending on HuggingFace).
* **[2025.01.22]** We used Bespoke Curator to create [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k), a high-quality reasoning dataset (trending on HuggingFace).
* **[2025.01.15]** Curator launched 🎉

## Overview

Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.

* Rich Python based library for generating and curating synthetic data.
* Interactive viewer to monitor data while it is being generated.
* First class support for structured outputs.
* Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
* Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.

![CLI in action](https://github.com/bespokelabsai/curator/blob/main/docs/curator-cli.gif)

Check out our full documentation for [getting started](https://docs.bespokelabs.ai/bespoke-curator/getting-started), [tutorials](https://docs.bespokelabs.ai/bespoke-curator/tutorials), [guides](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides) and detailed [reference](https://docs.bespokelabs.ai/bespoke-curator/api-reference/llm-api-documentation).

## 🛠️ Installation

```bash
pip install bespokelabs-curator
```
## 📕 Examples

### Finetuning/Distillation
| **Task** | **Link(s)** | **Goal** |
|----------|--------------|-------------|
| **Product feature extraction** | <a target="_blank" href="https://colab.research.google.com/drive/1YoA23-cBcWpaSErULzBI2bo2LPGo37GQ"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | Finetuning a model to identify features of a product |
| **Sentiment analysis** | <a href="https://colab.research.google.com/drive/1Zfl3g7POsqqYQqkzXdyhYRSAymLhZugn?usp=sharing" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | Aspect-based sentiment analysis of restaurant reviews and finetuning using Together.ai |
| **RAFT for domain-specific RAG** | <a href="https://github.com/bespokelabsai/curator/tree/main/examples/blocks/raft" target="_blank">Code</a> | Implement Retrieval Augmented Fine-Tuning (RAFT) that processes domain-specific documents, generates questions, and prepares data for fine-tuning LLMs. |

### Data Generation
| **Task** | **Link(s)** | **Goal** |
|----------|--------------|-------------|
| **Reasoning dataset generation (Bespoke Stratos)** | <a href="https://github.com/bespokelabsai/curator/tree/main/examples/bespoke-stratos-data-generation" target="_blank">Code</a> | Generate the Bespoke-Stratos-17k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets. |
| **Reasoning dataset generation (Open Thoughts)** | <a href="https://github.com/open-thoughts/open-thoughts" target="_blank">Code</a> | Generate the Open-Thoughts-114k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.|
| **Multimodal** | <a href="https://github.com/bespokelabsai/curator/tree/main/examples/multimodal" target="_blank">Code</a> | Demonstrates multimodal capabilities by generating recipes from food images |
| **Ungrounded Question Answer generation** | <a href="https://github.com/bespokelabsai/curator/tree/main/examples/ungrounded-qa" target="_blank">Code</a> | Generate diverse question-answer pairs using techniques similar to the CAMEL paper |
| **Code Execution** | <a href="https://colab.research.google.com/drive/1YKj1-BC66-3LgNkf1m5AEPswIYtpOU-k" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>| Execute code generated with Curator |
| **3Blue1Brown video generation** | <a href="https://github.com/bespokelabsai/curator/tree/main/examples/code-execution/math-animation" target="_blank">Code</a> | Generate videos similar to 3Blue1Brown and render them using code execution! |
| **Synthetic charts** | <a href="https://github.com/bespokelabsai/curator/blob/main/examples/code-execution/chart-generation/charts.py" target="_blank">Code</a> | Generate charts synthetically.
| **Function calling** | <a href="https://github.com/bespokelabsai/curator/tree/main/examples/function-calling" target="_blank">Code</a> | Generate data for finetuning for function calling. |



## 🚀 Quickstart

### Using `curator.LLM`

```python
from bespokelabs import curator
llm = curator.LLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem.to_pandas())
```
> [!NOTE]
> Retries and caching are enabled by default to help you rapidly iterate your data pipelines.
> So now if you run the same prompt again, you will get the same response, pretty much instantly.
> You can delete the cache at `~/.cache/curator` or disable it with `export CURATOR_DISABLE_CACHE=true`.


> [!IMPORTANT]
> Make sure to set your API keys as environment variables for the model you are calling. For example running `export OPENAI_API_KEY=sk-...` and `export ANTHROPIC_API_KEY=ant-...` will allow you to run the previous two examples. A full list of supported models and their associated environment variable names can be found [in the litellm docs](https://docs.litellm.ai/docs/providers).

You can also send a list of prompts to the LLM, or a HuggingFace Dataset object (see below for more details).

### Using structured outputs and custom prompting and parsing logic

Here's an example of using structured outputs and custom prompting and parsing logic.

```python
from typing import Dict
from pydantic import BaseModel, Field
from bespokelabs import curator
from datasets import Dataset

class Poem(BaseModel):
    title: str = Field(description="The title of the poem.")
    poem: str = Field(description="The content of the poem.")

class Poet(curator.LLM):
    response_format = Poem

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poem) -> Dict:
        return [{"title": response.title, "poem": response.poem}]

poet = Poet(model_name="gpt-4o-mini")
topics = Dataset.from_dict({'topic': ['Dreams of a Robot']})

poems = poet(topics)
print(poems.to_pandas())
```
Output:
```
    title                           poem
0   Dreams of a Robot: Awakening    In circuits deep, where silence hums, \nA dre..
1   Life of an AI Agent - Poem 1    In circuits woven, thoughts ignite,\nI dwell i...
```

In the `Poet` class:
* `response_format` is the structured output class we defined above.
* `prompt` takes the input (`input`) and returns the prompt for the LLM.
* `parse` takes the input (`input`) and the structured output (`response`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Note that `topics` can be created with another `LLM` class as well,
and we can scale this up to create tens of thousands of diverse poems.

```python

class Topics(BaseModel):
    topics_list: List[str] = Field(description="A list of topics.")

class TopicGenerator(curator.LLM):
  response_format = Topics

  def prompt(self, subject):
    return f"Return 3 topics related to {subject}"

  def parse(self, input: str, response: Topics):
    return [{"topic": t} for t in response.topics_list]

topic_generator = TopicGenerator(model_name="gpt-4o-mini")
topics = topic_generator("Mathematics")
poems = poet(topics)
```
Output:
```
 	title                     poem
0	The Language of Algebra	  In symbols and signs, truths intertwine,..
1	The Geometry of Space	  In the world around us, shapes do collide,..
2	The Language of Logic	  In circuits and wires where silence speaks,..
```

You can see more examples in the [examples](examples) directory.

See the [docs](https://docs.bespokelabs.ai/) for more details as well as
for troubleshooting information.

> [!TIP]
> If you are generating large datasets, you may want to use [batch mode](https://docs.bespokelabs.ai/bespoke-curator/tutorials/save-usdusdusd-with-batch-mode) to save costs. Currently batch APIs from [OpenAI](https://platform.openai.com/docs/guides/batch) and [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/message-batches) are supported. With curator this is as simple as setting `batch=True` in the `LLM` class.

### Anonymized Telemetry

We collect minimal, anonymized usage telemetry to help prioritize new features and improvements that benefit the Curator community. You can opt out by setting the `TELEMETRY_ENABLED` environment variable to `False`. 

## 📖 Providers
Curator supports a wide range of providers, including OpenAI, Anthropic, and many more. 

### OpenAI backend
```python
llm = curator.LLM(
    model_name="gpt-4o-mini",
)
```
For other models that support OpenAI-compatible APIs, you can use the `openai` backend:
```python
llm = curator.LLM(
    model_name="gpt-4o-mini",
    backend="openai",
    backend_params={
        "base_url": "https://your-openai-compatible-api-url",
        "api_key": <YOUR_OPENAI_COMPATIBLE_SERVICE_API_KEY>,
    },
)
```


### LiteLLM (Anthropic, Gemini, together.ai, etc.)
Here is an example of using Gemini with litellm backend:
```python
llm = curator.LLM(
    model_name="gemini/gemini-1.5-flash",
    backend="litellm",
    backend_params={
        "max_requests_per_minute": 2_000,
        "max_tokens_per_minute": 4_000_000
    },
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-litellm-for-diverse-providers)

### Ollama
```python
llm = curator.LLM(
    model_name="ollama/llama3.1:8b",  # Ollama model identifier
    backend_params={"base_url": "http://localhost:11434"},
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-ollama-with-curator#id-2.-configure-the-ollama-backend)

### vLLM

```python
llm = curator.LLM( 
    model_name="Qwen/Qwen2.5-3B-Instruct", 
    backend="vllm", 
    backend_params={ 
        "tensor_parallel_size": 1, # Adjust based on GPU count 
        "gpu_memory_utilization": 0.7 
    }
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-vllm-with-curator#id-3-initialize-and-use-the-generator)
### DeepSeek
DeepSeek offers an OpenAI-compatible API that you can use with the `openai` backend.
> [!IMPORTANT]
> The DeepSeek API is experiencing intermittent issues and will return empty responses during times of high traffic. We recommend
calling the DeepSeek API through the `openai` backend, with a high max retries so that we can retry failed requests upon empty
response and a reasonable max requests and tokens per minute so we don't retry too aggressively and overwhelm the API.

```python
llm = curator.LLM(
    model_name="deepseek-reasoner",
    generation_params={"temp": 0.0},
    backend_params={
        "max_requests_per_minute": 100,
        "max_tokens_per_minute": 10_000_000,
        "base_url": "https://api.deepseek.com/",
        "api_key": <YOUR_DEEPSEEK_API_KEY>,
        "max_retries": 50,
    },
    backend="openai",
)
```

### kluster.ai
```python
llm = curator.LLM(
    model_name="deepseek-ai/DeepSeek-R1", 
    backend="klusterai",
)
```
[Documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-kluster.ai-for-batch-inference)
## 📦 Batch Mode
Several providers offer about 50% discount on token usage when using batch mode. Curator makes it easy to use batch mode with a wide range of providers.

Example with OpenAI ([docs reference](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-openai-for-batch-inference)):
```python
llm = curator.LLM(model_name="gpt-4o-mini", batch=True)
```

See documentation:
* [OpenAI batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-openai-for-batch-inference)
* [Anthropic batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-anthropic-for-batch-inference)
* [Gemini batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-gemini-for-batch-inference)
* [kluster.ai batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-kluster.ai-for-batch-inference)
* [Mistral batch mode](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-mistral-for-batch-inference)

## Bespoke Curator Viewer
The hosted curator viewer is a rich interface to visualize data -- and makes visually inspecting the data much easier.

You can enable it as follows:

Bash:

```shell
export CURATOR_VIEWER=1
```
Python/colab:
```python
import os
os.environ["CURATOR_VIEWER"]="1"
```

With this enabled, as curator generates data, it gets uploaded and you can see the responses streaming in the viewer. The URL for the viewer is displayed right next to the rich progress.

## Environment Variables

We support a range of environment variables to customize the behavior of Curator.

Here is a complete table of environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `CURATOR_VIEWER` | Enables the Curator viewer for visualizing data curation when `True`. | `False` |
| `CURATOR_DISABLE_CACHE` | Disables caching for `curator.LLM` generations when `True`. Useful for fresh runs. | `False` |
| `CURATOR_CACHE_DIR` | Sets the cache directory used for `curator.LLM` generations. | `~/.cache/curator` |
| `CURATOR_DISABLE_RICH_DISPLAY` | When `True`, disables [Rich CLI](https://github.com/Textualize/rich) output (and falls back to [tqdm](https://tqdm.github.io/) logging) for local data generation monitoring. This is useful when debugging with inline breakpoints or interactive debuggers like `pdb`, where Rich's dynamic output can interfere with terminal input. | `False` |
| `TELEMETRY_ENABLED` | Enable telemetry for curator usage tracking when `True` | `True` |

## Contributing
Thank you to all the contributors for making this project possible!
Please follow [these instructions](docs/CONTRIBUTING.md) on how to contribute.

## Citation
If you find Curator useful, please consider citing us!

```
@software{Curator: A Tool for Synthetic Data Creation,
  author = {Marten, Ryan* and Vu, Trung* and Ji, Charlie Cheng-Jie and Sharma, Kartik and Pimpalgaonkar, Shreyas and Dimakis, Alex and Sathiamoorthy, Maheswaran},
  month = jan,
  title = {{Curator}},
  year = {2025},
  howpublished = {\url{https://github.com/bespokelabsai/curator}}
}
```

