Metadata-Version: 2.2
Name: llmlog-engine
Version: 1.0.0
Summary: High-performance columnar scan engine for LLM logs stored as JSONL
Author: llmlog-engine contributors
License: MIT
Project-URL: Repository, https://github.com/yuuichieguchi/llmlog_engine
Requires-Python: >=3.8
Requires-Dist: pandas>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pandas>=1.0.0; extra == "dev"
Description-Content-Type: text/markdown

# LLMLog Engine

A high-performance columnar scan engine for LLM logs stored as JSONL. Built in C++ with SIMD-friendly data structures, exposed via Python bindings.

## Overview

LLMLog Engine is a specialized, embedded columnar database designed specifically for analyzing LLM application logs. It provides:

- **Fast JSONL ingestion** into columnar format
- **Efficient filtering** on numeric and string columns
- **Group-by aggregations** (COUNT, SUM, AVG, MIN, MAX)
- **Dictionary encoding** for low-cardinality string columns
- **SIMD-friendly memory layout** for future performance optimization

The core is implemented in **C++17** with columnar storage, while the user-facing API is clean **Python** with pandas integration.

## Installation

### From Source (Development)

```bash
git clone <repo>
cd llmlog_engine
pip install -e .
```

Requires:
- Python 3.8+
- C++17 compiler
- cmake 3.15+
- pybind11 (installed via pip)

## Quick Start

```python
from llmlog_engine import LogStore

# Load JSONL logs
store = LogStore.from_jsonl("logs.jsonl")

# Create a query
result = (store.query()
    .filter(model="gpt-4.1", min_latency_ms=1000)
    .aggregate(
        by=["model", "route"],
        metrics={
            "count": "count",
            "avg_latency": "avg(latency_ms)",
            "avg_tokens_out": "avg(tokens_output)"
        }
    ))

print(result)
```

## Supported Fields

The engine expects JSONL records with these fields:

| Field | Type | Notes |
|-------|------|-------|
| `ts` | string | Timestamp (ISO 8601 or custom format) |
| `session_id` | string | Session identifier |
| `model` | string | Model name (dictionary-encoded) |
| `latency_ms` | int | Response latency in milliseconds |
| `tokens_input` | int | Input token count |
| `tokens_output` | int | Output token count |
| `route` | string | API route/endpoint (dictionary-encoded) |
| `status` | string | Response status: "ok", "error", etc. (dictionary-encoded) |
| `error_type` | string | Error category (optional) |
| `tags` | array | Metadata tags (future support) |

All fields are optional with sensible defaults.

## API Reference

### LogStore

Main table class for columnar storage.

#### `LogStore.from_jsonl(path: str) -> LogStore`
Load a JSONL file into the store.

```python
store = LogStore.from_jsonl("logs.jsonl")
```

#### `row_count() -> int`
Get number of loaded rows.

```python
n = store.row_count()
```

#### `basic_stats() -> dict`
Get basic statistics (min, max, avg latency; cardinalities).

```python
stats = store.basic_stats()
print(stats["latency_ms_min"])
```

#### `query() -> Query`
Create a new query builder.

```python
q = store.query()
```

### Query

Query builder for filtering and aggregation.

#### `filter(**kwargs) -> Query`
Add filter predicates. All filters are combined with AND logic.

**Supported filter parameters:**
- `model` (str): Exact match on model name
- `route` (str): Exact match on route
- `status` (str): Exact match on status
- `min_latency_ms` (int): Minimum latency
- `max_latency_ms` (int): Maximum latency
- `min_tokens_input` (int): Minimum input tokens
- `max_tokens_input` (int): Maximum input tokens
- `min_tokens_output` (int): Minimum output tokens
- `max_tokens_output` (int): Maximum output tokens

```python
q = store.query().filter(
    model="gpt-4.1",
    min_latency_ms=1000,
    route="chat"
)
```

#### `aggregate(by: list[str], metrics: dict[str, str]) -> pd.DataFrame`
Compute aggregations grouped by specified columns.

**Metric expressions:**
- `"count"` — Row count
- `"sum(column)"` — Sum of numeric column
- `"avg(column)"` — Average of numeric column
- `"min(column)"` — Minimum value
- `"max(column)"` — Maximum value

```python
result = q.aggregate(
    by=["model", "route"],
    metrics={
        "count": "count",
        "avg_latency": "avg(latency_ms)",
        "max_latency": "max(latency_ms)",
        "total_output": "sum(tokens_output)"
    }
)
# Returns pandas DataFrame
```

If `by` is omitted or empty, aggregates over all matched rows:

```python
result = store.query().aggregate(
    metrics={"count": "count", "avg_latency": "avg(latency_ms)"}
)
```

## Example Usage

### Filter and Group by Model

```python
from llmlog_engine import LogStore

store = LogStore.from_jsonl("production_logs.jsonl")

# Analyze slow responses by model
slow_by_model = (store.query()
    .filter(min_latency_ms=500)
    .aggregate(
        by=["model"],
        metrics={
            "count": "count",
            "avg_latency": "avg(latency_ms)",
            "min_latency": "min(latency_ms)",
            "max_latency": "max(latency_ms)"
        }
    ))

print(slow_by_model)
```

### Multi-Dimension Analysis

```python
# Analyze error rates by model and route
errors_by_model_route = (store.query()
    .filter(status="error")
    .aggregate(
        by=["model", "route"],
        metrics={"count": "count"}
    ))

print(errors_by_model_route)
```

### Summary Statistics

```python
# Overall stats
stats = store.basic_stats()
print(f"Total rows: {stats['row_count']}")
print(f"Avg latency: {stats['latency_ms_avg']:.1f}ms")
print(f"Max latency: {stats['latency_ms_max']}ms")
print(f"Unique models: {stats['model_cardinality']}")
```

## Performance

### Architecture Optimizations

1. **Columnar Storage**: Data organized by column, not row. Enables:
   - Efficient filtering on single columns
   - Better CPU cache utilization
   - Vectorization opportunities

2. **Dictionary Encoding**: Low-cardinality string columns (model, route, status) mapped to int32 IDs:
   - Faster equality comparisons
   - Smaller memory footprint
   - Consistent performance regardless of string length

3. **Contiguous Numeric Arrays**: `int32_t` columns stored as dense vectors:
   - SIMD-friendly layout
   - Efficient range filtering
   - Minimal memory overhead

### Benchmark Results

On a 100,000-row log file:

```
Pure Python:     0.8234s
C++ Engine:      0.1205s
Speedup:         6.8x faster
```

Query: Filter by model + latency, group by route, compute 6 metrics.

## Architecture

```
User Code
    │
    ├─ Python API (LogStore, Query)
    │  └─ pandas DataFrame output
    │
    └─ C++ Core (_llmlog_engine module)
       ├─ DictionaryColumn (strings + int32 IDs)
       ├─ NumericColumn<T> (contiguous arrays)
       └─ LogStore (main engine)
          ├─ ingest_from_jsonl()
          ├─ apply_filter() → boolean mask
          └─ aggregate() → grouped metrics
```

### Memory Layout

**Columnar format (after ingestion):**
```
Column: model       [0, 1, 0, 2, 0, ...]  (int32 IDs)
Column: route       [0, 1, 0, 1, 0, ...]  (int32 IDs)
Column: latency_ms  [423, 1203, 512, ...]  (int32)
Column: tokens_out  [921, 214, 512, ...]   (int32)

Dictionary: model   {0: "gpt-4.1-mini", 1: "gpt-4.1", 2: "gpt-4-turbo"}
Dictionary: route   {0: "chat", 1: "rag"}
```

## Limitations (v0)

- In-memory only (no persistence or external storage yet)
- No SQL-like expression parser (use Python kwargs for filters)
- No support for complex data types (arrays, nested objects)
- Single-threaded query execution
- No distributed processing

## Future Enhancements

1. On-disk columnar format (memory-mapped access)
2. Query expression parser for string-based filters
3. Parallel scan/aggregation with thread pool
4. SIMD micro-optimizations for filter loops
5. Compression for numeric columns
6. Support for timestamp parsing and range filters
7. Approximate aggregations for large datasets

## Development

### Build from Source

```bash
mkdir build && cd build
cmake ..
make
```

### Run Tests

```bash
pytest tests/test_basic.py -v
```

### Run Benchmarks

```bash
python tests/test_bench.py
```

## Implementation Notes

### Dictionary Encoding

String columns like `model`, `route`, and `status` are dictionary-encoded:
1. First occurrence of "gpt-4.1" gets ID 0, second occurrence also uses ID 0
2. Comparisons done on int32 IDs (much faster)
3. String storage is deduplicated

This is transparent to the user:
```python
# You write:
store.query().filter(model="gpt-4.1")

# The engine internally:
# 1. Looks up "gpt-4.1" in dictionary → ID 1
# 2. Compares integer column against 1
# 3. Returns matching rows
```

### Filter Predicates

Filters are applied using a boolean mask:
```cpp
std::vector<bool> mask(row_count_, true);  // Initially all true
for (const auto& predicate : predicates) {
    // For each row, evaluate predicate
    // Update mask: mask[i] &= matches_predicate(row_i)
}
// Now mask[i] = true if row_i matches ALL predicates (AND logic)
```

### Aggregation

Once the mask is computed, aggregations scan only matching rows:
```cpp
for (const auto& [group_key, row_indices] : groups) {
    for (size_t idx : row_indices) {
        // Sum, average, min, max operations
    }
}
```

## License

MIT

## Contributing

Pull requests welcome! Please include:
- Tests for new features
- Updated documentation
- Benchmark results for performance changes

## Contact

Questions or issues? Open a GitHub issue.
