Metadata-Version: 2.1
Name: alpacagen
Version: 0.1.1
Summary: A tool for generating instruction-following datasets in Alpaca format
Home-page: https://github.com/qqandy/alpacagen
Author: Hsiang-An Chuang
Author-email: Hsiang-An Chuang <qqandy0120@gmail.com>
License: MIT License
        
        Copyright (c) 2024 AlpacaGen Contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/qqandy0120/alpacagen
Project-URL: Repository, https://github.com/qqandy0120/alpacagen.git
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: markitdown
Requires-Dist: langchain_text_splitters
Requires-Dist: openai>=1.0.0
Requires-Dist: tqdm
Requires-Dist: nest_asyncio

# AlpacaGen

AlpacaGen is a powerful tool for generating instruction-following datasets in the Alpaca format using large language models (LLMs). It can process both single files and entire directories, automatically chunking content and generating multiple instruction-input-output pairs for each chunk.

## Features

- Support for multiple LLM providers (Azure OpenAI, OpenAI)
- Multilingual support (Traditional Chinese, English)
- Automatic text chunking with customizable size and overlap
- Batch processing with progress bars
- Configurable number of QA pairs per chunk
- JSONL output format

## Installation

```bash
pip install alpacagen
```

## Usage

### Basic Usage

```python
from alpacagen import AlpacaGen

# Initialize AlpacaGen
ag = AlpacaGen(
    llm_client='azure',  # or 'openai'
    api_key='your-api-key',
    base_url='your-api-base-url'  # Required for Azure OpenAI
    llm_model='your-model-selection'  # defult using gpt-4o
)

# Generate dataset from a single file
chunks, dataset = ag.generate(
    input_path='your_file.txt',
    output_path='output.jsonl',
    language='zhtw',  # or 'en'
    entries_per_chunk=3
)
```

### Advanced Configuration

```python
# Process an entire directory with custom settings
chunks, dataset = ag.generate(
    input_path='your_directory/',
    output_path='output.jsonl',
    language='en',
    gen_prompt_path='custom_prompt.txt',  # Optional: Use custom prompt template
    chunk_size=4096,  # Customize chunk size
    entries_per_chunk=5  # Generate more QA pairs per chunk
)
```

## Understanding Chunks and Dataset

### Chunks

Chunks are sections of your input text that have been automatically split for processing. Each chunk contains:
- `content`: The actual text content
- `source`: The source file path
- `idx`: A formatted string showing the chunk's position (e.g., "01/17" means chunk 1 of 17)

Example chunks:
```python
# Example chunks from a technical document
[
    Chunk(
        content="Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence...",
        source="ml_guide.txt",
        idx="01/03"
    ),
    Chunk(
        content="Supervised Learning Methods\nIn supervised learning, algorithms learn from labeled data...",
        source="ml_guide.txt",
        idx="02/03"
    ),
    Chunk(
        content="Practical Applications\nMachine learning is used in various fields including...",
        source="ml_guide.txt",
        idx="03/03"
    )
]
```

### Dataset (QA Pairs)

The dataset consists of QA pairs generated from each chunk. Each QA pair contains:
- `instruction`: The task or question
- `input`: Additional context or input data
- `output`: The expected response or answer

Example QA pairs:
```python
[
    QAPair(
        instruction="Explain the basic concept of machine learning in simple terms",
        input="Consider the following introduction to machine learning",
        output="Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. It's similar to how humans learn from experience, but using data and algorithms instead."
    ),
    QAPair(
        instruction="What is the main characteristic of supervised learning?",
        input="Read about supervised learning methods",
        output="The main characteristic of supervised learning is that it uses labeled data for training. This means the algorithm learns from examples where the correct answers are already known, allowing it to make predictions on new, unseen data."
    ),
    QAPair(
        instruction="List three practical applications of machine learning",
        input="Based on the section about practical applications",
        output="Three practical applications of machine learning include: 1) Email spam filtering, 2) Medical diagnosis and image analysis, and 3) Recommendation systems in e-commerce platforms. These applications demonstrate how machine learning can solve real-world problems."
    )
]
```

## Configuration Options

- `llm_client`: Choose between 'azure' or 'openai'
- `llm_model`: Specify custom model (defaults available for each client)
- `chunk_size`: Control the size of text chunks (default: 4096)
- `entries_per_chunk`: Number of QA pairs to generate per chunk (default: 3)
- `language`: Choose between 'zhtw' (Traditional Chinese) or 'en' (English)

## Best Practices

1. Start with a small test file before processing large directories
2. Monitor the generated output quality
3. Adjust chunk size based on your content
4. Use appropriate language setting for your source material
5. Consider using custom prompts for specific use cases

## Limitations

- Requires valid API credentials for OpenAI or Azure OpenAI
- Processing speed depends on API rate limits
- Large directories may take significant time to process
- Memory usage scales with chunk size and batch size
