Metadata-Version: 2.4
Name: ouro-cache-fix
Version: 0.1.0
Summary: Custom cache implementation to fix KV cache bug in ByteDance/Ouro-1.4B
Author-email: Edwin Villacis <edwinvillacisa@yahoo.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Antizana/ouro-cache-fix
Project-URL: Documentation, https://github.com/Antizana/ouro-cache-fix#readme
Project-URL: Repository, https://github.com/Antizana/ouro-cache-fix.git
Project-URL: Issues, https://github.com/Antizana/ouro-cache-fix/issues
Keywords: ouro,universal-transformer,cache,huggingface,transformers,llm,bug-fix,kv-cache
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.36.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: examples
Requires-Dist: peft>=0.7.0; extra == "examples"
Requires-Dist: datasets>=2.14.0; extra == "examples"
Requires-Dist: tqdm>=4.65.0; extra == "examples"
Dynamic: license-file

# 🔧 Ouro Cache Fix

**Custom cache implementation to fix KV cache bug in ByteDance/Ouro-1.4B**

[![PyPI version](https://badge.fury.io/py/ouro-cache-fix.svg)](https://badge.fury.io/py/ouro-cache-fix)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://github.com/Antizana/ouro-cache-fix/actions/workflows/tests.yml/badge.svg)](https://github.com/Antizana/ouro-cache-fix/actions/workflows/tests.yml)
[![Code Quality](https://github.com/Antizana/ouro-cache-fix/actions/workflows/code-quality.yml/badge.svg)](https://github.com/Antizana/ouro-cache-fix/actions/workflows/code-quality.yml)
[![codecov](https://codecov.io/gh/Antizana/ouro-cache-fix/branch/main/graph/badge.svg)](https://codecov.io/gh/Antizana/ouro-cache-fix)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

## 🐛 Problem

ByteDance/Ouro-1.4B fails with `IndexError: list index out of range` when using `use_cache=True` during inference or training.

**Error:**

```python
IndexError: list index out of range
  at cache_position = current_ut * num_hidden_layers + layer_idx
```

## 🎯 Root Cause

Ouro uses **Universal Transformer** architecture with 4 iterative steps (`total_ut_steps=4`), requiring **96 cache indices** (4 steps × 24 layers).

HuggingFace's `DynamicCache` only creates **24 slots**, causing the error on the second Universal Transformer iteration.

## ✅ Solution

This package provides `UniversalTransformerCache` that correctly handles Ouro's multi-step architecture while preserving all 4 Universal Transformer loops.

### Performance

| Metric | Value |
|--------|-------|
| **Speed** | 1.9x faster than `use_cache=False` |
| **Architecture** | Preserves all 4 UT loops |
| **Compatibility** | Works with base and fine-tuned models |
| **Validation** | Tested on 1000+ samples |

## 📦 Installation

```bash
pip install ouro-cache-fix
```

Or from source:

```bash
git clone https://github.com/Antizana/ouro-cache-fix.git
cd ouro-cache-fix
pip install -e .
```

### Requirements

- Python >= 3.8
- PyTorch >= 2.0.0
- Transformers >= 4.36.0

### Tested Versions

This package has been tested and verified with:

| Package | Tested Versions |
|---------|----------------|
| **torch** | 2.0.0 - 2.5.0 |
| **transformers** | 4.36.0 - 4.47.0 |
| **Python** | 3.8 - 3.12 |

Newer versions should work but haven't been explicitly tested.

## 🚀 Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from ouro_cache_fix import UniversalTransformerCache
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "ByteDance/Ouro-1.4B",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("ByteDance/Ouro-1.4B")

# Create custom cache
cache = UniversalTransformerCache()

# Generate with cache enabled
inputs = tokenizer("What are the symptoms of diabetes?", return_tensors="pt")
outputs = model.generate(
    **inputs,
    past_key_values=cache,  # Use custom cache
    use_cache=True,         # Enable caching
    max_new_tokens=100
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## 📊 Benchmarks

### Speed Comparison

| Configuration | Speed | Quality |
|---------------|-------|---------|
| `use_cache=False` (original workaround) | 7.94 tok/s | ✅ Good |
| `use_cache=True` with disabled UT (1 loop) | 38.59 tok/s | ❌ **Gibberish** |
| **`UniversalTransformerCache` (4 loops)** | **14.82 tok/s** | **✅ Excellent** |

### Quality Metrics (1000 samples)

| Metric | Base Model | Fine-tuned + Custom Cache | Improvement |
|--------|------------|---------------------------|-------------|
| BLEU | 0.00 | 0.0217 | **+∞** |
| ROUGE-L | 0.06 | 0.15 | **+148%** |
| Perplexity | 17.95 | 4.95 | **-72%** |
| Token F1 | 0.09 | 0.19 | **+116%** |

## 🔬 Technical Details

### Architecture

Ouro's Universal Transformer processes each layer 4 times:

```python
for current_ut in range(4):  # UT steps
    for layer_idx in range(24):  # Layers
        cache_index = current_ut * 24 + layer_idx
        # Indices: 0-23, 24-47, 48-71, 72-95
```

`DynamicCache` only has 24 slots → **IndexError** at index 24.

### Solution

`UniversalTransformerCache` dynamically extends to 96 indices:

```python
class UniversalTransformerCache(Cache):
    def update(self, key_states, value_states, layer_idx, ...):
        # Extend cache as needed
        while len(self.key_cache) <= layer_idx:
            self.key_cache.append(None)
            self.value_cache.append(None)

        # Update cache at this index
        if self.key_cache[layer_idx] is None:
            self.key_cache[layer_idx] = key_states
        else:
            # Concatenate for sequential generation
            self.key_cache[layer_idx] = torch.cat([...])
```

## 📚 Examples

### Basic Inference

```python
from ouro_cache_fix import UniversalTransformerCache

cache = UniversalTransformerCache()
outputs = model.generate(..., past_key_values=cache, use_cache=True)
```

### Fine-tuning with LoRA

```python
from peft import LoraConfig, get_peft_model
from ouro_cache_fix import UniversalTransformerCache

# Apply LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)

# Generate with custom cache
cache = UniversalTransformerCache()
outputs = model.generate(..., past_key_values=cache, use_cache=True)
```

### Batch Generation

```python
from ouro_cache_fix import UniversalTransformerCache

prompts = ["Question 1", "Question 2", "Question 3"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)

# Create cache for batch
cache = UniversalTransformerCache()

outputs = model.generate(
    **inputs,
    past_key_values=cache,
    use_cache=True,
    max_new_tokens=50
)
```

## 🧪 Testing

Run tests:

```bash
pytest tests/
```

Run benchmarks:

```bash
python examples/benchmark.py
```

## 🤝 Contributing

Contributions welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 📖 Documentation

All documentation is contained in this README:

- **Installation**: See [Installation](#-installation) section above
- **Usage**: See [Quick Start](#-quick-start) section above
- **Technical Details**: See [Root Cause](#-root-cause) and [Solution](#-solution) sections
- **Performance**: See [Performance](#performance) table
- **API Reference**: See the code documentation in `ouro_cache_fix/universal_transformer_cache.py`
- **Changelog**: See [CHANGELOG.md](CHANGELOG.md) for version history

## 🐛 Known Issues

- ⚠️ Perplexity calculation may fail with PEFT models (non-critical, can be skipped)
- ⚠️ Memory usage increases by ~400-700MB when using cache (expected)

## 🗺️ Roadmap

- [ ] Publish to PyPI
- [ ] Add support for other Universal Transformer models
- [ ] Optimize memory usage
- [ ] Add streaming generation support
- [ ] Submit PR to HuggingFace Transformers

## 📄 License

MIT License - see [LICENSE](LICENSE) file

## 🙏 Acknowledgments

- ByteDance for Ouro-1.4B model
- HuggingFace for Transformers library
- Medical reasoning dataset contributors

## 📞 Citation

If you use this fix in your research, please cite:

```bibtex
@software{ouro_cache_fix,
  author = {Edwin Villacis},
  title = {Ouro Cache Fix: Custom Cache for Universal Transformers},
  year = {2025},
  url = {https://github.com/Antizana/ouro-cache-fix},
  version = {0.1.0}
}
```

## 🔗 Links

- [GitHub Repository](https://github.com/Antizana/ouro-cache-fix)
- [PyPI Package](https://pypi.org/project/ouro-cache-fix/)
- [HuggingFace Discussion](https://huggingface.co/ByteDance/Ouro-1.4B/discussions)
- [Technical Blog Post](#)

---

**Status:** ✅ Production Ready | **Version:** 0.1.0 | **Last Updated:** 2025-11-08
