Metadata-Version: 2.4
Name: ava-protocol
Version: 0.1.4
Summary: AI Visibility Anonymizer - Privacy-preserving middleware for LLMs
Project-URL: Homepage, https://github.com/ava-protocol/ava-protocol
Project-URL: Documentation, https://ava-protocol.readthedocs.io
Project-URL: Repository, https://github.com/ava-protocol/ava-protocol
Project-URL: Bug Tracker, https://github.com/ava-protocol/ava-protocol/issues
Author-email: Gerald Enrique Nelson Mc Kenzie <lordxmen2k@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai,anonymization,data-protection,gdpr,hipaa,llm,pii,presidio,privacy,security
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Python: >=3.9
Requires-Dist: httpx>=0.25.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: all
Requires-Dist: boto3>=1.28.0; extra == 'all'
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'all'
Requires-Dist: presidio-anonymizer>=2.2.0; extra == 'all'
Requires-Dist: spacy>=3.7.0; extra == 'all'
Provides-Extra: aws
Requires-Dist: boto3>=1.28.0; extra == 'aws'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: build>=1.0.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: twine>=4.0.0; extra == 'dev'
Provides-Extra: local
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'local'
Requires-Dist: presidio-anonymizer>=2.2.0; extra == 'local'
Requires-Dist: spacy>=3.7.0; extra == 'local'
Description-Content-Type: text/markdown

# AVA Protocol

**AI Visibility Anonymizer** — Privacy-preserving middleware for LLM interactions with reversible tokenization.

[![PyPI](https://img.shields.io/pypi/v/ava-protocol)](https://pypi.org/project/ava-protocol/)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)

**Author:** Gerald Enrique Nelson Mc Kenzie  
**DOI:** [10.5281/zenodo.19111004](https://doi.org/10.5281/zenodo.19111004)  
**Version:** 0.1.0 | March 2026

---

## What is AVA?

AVA Protocol sanitizes sensitive data (PII/PHI) before it reaches AI systems, maintains cryptographically-signed audit trails, and enables faithful restoration of original values in AI outputs.

**Key Innovation:** Reversible tokenization preserves both privacy AND data utility — the AI works with opaque tokens, and real values are restored only in the final output.

```python
import ava

client = ava.Client(engine="presidio", policy="healthcare_strict")

with client.session(reversibility=True) as session:
    # Original: "Patient John Smith, SSN 123-45-6789"
    safe = session.sanitize(text)
    # Sanitized: "Patient AVA_PERS_xK9mP2nQ, SSN AVA_SSN_fG5hI6jK"

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": safe}]
    )

    final = session.restore(response)  # Original values restored!
```

---

## Table of Contents

1. [Installation](#installation)
2. [Architecture](#architecture)
3. [Operating Modes](#operating-modes)
   - [Mode 1: Embedded (Local Presidio)](#mode-1-embedded-local-presidio)
   - [Mode 2: Gateway (Remote Client)](#mode-2-gateway-remote-client)
   - [Mode 3: Mock Engine (Testing)](#mode-3-mock-engine-testing)
   - [Mode 4: AWS Macie Adapter](#mode-4-aws-macie-adapter)
   - [Mode 5: Azure PII Adapter](#mode-5-azure-pii-adapter)
   - [Mode 6: Google DLP Adapter](#mode-6-google-dlp-adapter)
4. [Vault Types](#vault-types)
5. [Policies](#policies)
6. [Async API](#async-api)
7. [Production Workflows](#production-workflows)

---

## Installation

Choose your installation based on the modes you need:

```bash
# Gateway Client Only (Lightweight, ~50KB)
pip install ava-protocol

# Embedded with Local Presidio (~500MB, includes ML models)
pip install ava-protocol[local]

# AWS Macie integration
pip install ava-protocol[aws]

# Azure PII integration
pip install ava-protocol[azure]

# Google Cloud DLP integration
pip install ava-protocol[gcp]

# Everything (local + aws + azure + gcp + redis)
pip install ava-protocol[all]
```

> **Note:** Gateway mode requires no extras. Embedded mode requires `[local]` for Presidio ML models.

---

## Architecture

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Your App   │────▶│ AVA Client  │────▶│   Engine    │
│             │     │  (Embedded  │     │  (Presidio, │
│             │◀────│   or        │◀────│   AWS, etc) │
│             │     │   Gateway)  │     │             │
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────┴──────┐
                    │ Token Vault │
                    │ (Memory /   │
                    │  SQLite /   │
                    │  Redis)     │
                    └─────────────┘
```

---

## Operating Modes

### Mode 1: Embedded (Local Presidio)

Self-contained deployment. All PII detection happens locally with no external calls. Best for air-gapped or high-security environments.

**Install:**
```bash
pip install ava-protocol[local]
```

**Basic example:**
```python
import ava

client = ava.Client(
    engine="presidio",
    policy="healthcare_strict",
    vault_type="memory"
)

with client.session(reversibility=True, ttl=3600) as session:

    medical_record = """
    Patient: Maria Gonzalez
    DOB: 1985-03-15
    SSN: 123-45-6789
    Email: maria.g@healthmail.com
    Diagnosis: Hypertension
    """

    # Sanitize before AI processing — AI never sees real data
    sanitized = session.sanitize(medical_record)
    # Patient: AVA_PERS_xK9mP2nQ
    # DOB: AVA_DATE_aB3cD4eF
    # SSN: AVA_SSN_fG5hI6jK

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": sanitized}]
    )

    # Restore original values in the final output
    final = session.restore(response['choices'][0]['message']['content'])
```

**With SQLite vault (persistent storage):**
```python
client = ava.Client(
    engine="presidio",
    policy="financial_paranoid",
    vault_type="sqlite",
    vault_config={
        "db_path": "/secure/ava_vault.db",
        "encryption_key": os.environ["VAULT_KEY"]  # AES-256
    }
)
```

---

### Mode 2: Gateway (Remote Client)

Thin client that connects to a remote AVA Gateway server. No local ML dependencies — all detection is handled server-side.

**Install:**
```bash
pip install ava-protocol  # No extras needed
```

**Basic example:**
```python
import ava

client = ava.Client(
    gateway_url="https://ava-gateway.company.com",
    api_key="ava_sk_live_abc123xyz789",
    policy="general_moderate"
)

# Identical API to embedded mode
with client.session(reversibility=True) as session:

    customer_email = """
    Hi, this is Robert Chen from Acme Corp.
    My credit card ending in 4532 was charged twice.
    Please refund to robert.chen@acme.com.
    """

    safe_text = session.sanitize(customer_email)
    response = support_ai.process(safe_text)
    readable = session.restore(response)
```

**Environment-based config:**
```bash
# .env
AVA_GATEWAY_URL=https://ava.internal.company.com
AVA_API_KEY=ava_sk_live_xxx
AVA_POLICY=healthcare_strict
AVA_DEFAULT_TTL=1800
```

```python
# Loads automatically from environment
client = ava.Client.from_env()
```

**Running a Gateway server:**
```python
# gateway_server.py — deploy centrally for your organization
from ava.gateway import GatewayServer

server = GatewayServer(
    detection_engine="presidio",
    vault_type="redis",
    vault_config={"host": "redis.company.com", "port": 6379},
    policies_path="/etc/ava/policies/"
)

server.run(
    host="0.0.0.0",
    port=8443,
    tls_cert="/etc/ava/server.crt"
)
```

---

### Mode 3: Mock Engine (Testing)

Regex-based detection with zero dependencies. Designed for unit tests and CI/CD pipelines where you don't want to install heavyweight ML models.

> **Detects via regex only:** Emails, phone numbers, SSNs, credit card numbers. No NLP.

**Unit test example:**
```python
import ava
import pytest

@pytest.fixture
def mock_client():
    return ava.Client(
        engine="mock",
        policy="general_moderate",
        vault_type="memory"
    )

def test_email_detection(mock_client):
    with mock_client.session() as session:
        text = "Contact us at support@example.com"
        result = session.sanitize(text)
        assert "AVA_EMAI_" in result
        assert "support@example.com" not in result

def test_reversibility(mock_client):
    with mock_client.session(reversibility=True) as session:
        original = "Patient: John Doe"
        sanitized = session.sanitize(original)
        restored = session.restore(sanitized)
        assert restored == original
```

**CI/CD pipeline (GitHub Actions):**
```yaml
# .github/workflows/test.yml
name: AVA Tests

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install AVA (lightweight)
        run: pip install ava-protocol  # No [local] needed

      - name: Run tests with MockEngine
        run: pytest tests/ -v
        env:
          AVA_TEST_ENGINE: mock
```

---

### Mode 4: AWS Macie Adapter

Enterprise-grade PII detection using AWS Macie. Supports custom data identifiers for organization-specific patterns.

**Install:**
```bash
pip install ava-protocol[aws]
aws configure
```

**Example:**
```python
import ava

client = ava.Client(
    engine="aws_macie",
    policy="financial_paranoid",
    vault_type="memory",
    engine_config={
        "region": "us-east-1",
        "custom_data_identifiers": [
            "employee-id-pattern",
            "customer-account-pattern"
        ]
    }
)

with client.session(reversibility=True) as session:
    with open("customer_data.csv", "r") as f:
        content = f.read()

    sanitized = session.sanitize(content)
    insights = sagemaker_model.analyze(sanitized)
    report = session.restore(insights)
```

---

### Mode 5: Azure PII Adapter

Microsoft Azure AI Language PII detection. Supports domain filtering (e.g., healthcare PHI only).

**Install:**
```bash
pip install ava-protocol[azure]
export AZURE_LANGUAGE_ENDPOINT=https://your-resource.cognitiveservices.azure.com
export AZURE_LANGUAGE_KEY=your_api_key_here
```

**Example:**
```python
import ava

client = ava.Client(
    engine="azure_pii",
    policy="healthcare_strict",
    vault_type="redis",
    vault_config={"host": "redis.company.com"},
    engine_config={
        "endpoint": "https://ava-pii.cognitiveservices.azure.com",
        "domain_filter": "phi"  # Health data only
    }
)

with client.session(reversibility=True) as session:
    clinical_notes = """
    Dr. Sarah Johnson examined patient Michael Brown.
    Patient reports chest pain. Contact: 555-123-4567
    """

    sanitized = session.sanitize(clinical_notes)

    response = azure_openai.ChatCompletion.create(
        deployment_id="gpt-4",
        messages=[{"role": "user", "content": sanitized}]
    )

    final = session.restore(response['choices'][0]['message']['content'])
```

---

### Mode 6: Google DLP Adapter

Google Cloud Data Loss Prevention API with 150+ built-in detectors. Supports custom inspect templates for fine-grained control.

**Install:**
```bash
pip install ava-protocol[gcp]
gcloud auth application-default login
```

**Example:**
```python
import ava

client = ava.Client(
    engine="google_dlp",
    policy="legal_confidential",
    vault_type="memory",
    engine_config={
        "project_id": "my-gcp-project",
        "inspect_template": "projects/my-gcp-project/inspectTemplates/legal-template",
        "min_likelihood": "LIKELY"
    }
)

with client.session(reversibility=True) as session:
    legal_document = """
    ATTORNEY-CLIENT PRIVILEGED
    From: attorney@lawfirm.com
    Re: Merger Discussion
    """

    sanitized = session.sanitize(legal_document)
    summary = legal_ai.summarize(sanitized)
    privileged_summary = session.restore(summary)
```

---

## Vault Types

Vaults store the token-to-value mappings that make restoration possible. Choose based on your persistence and scale requirements.

### Memory Vault (Default)

```python
client = ava.Client(engine="presidio", vault_type="memory")
```

In-process dictionary storage. Data never touches disk and is auto-purged on session exit.

> Best for: Single-session flows, air-gapped environments, maximum security.

### SQLite Vault (Persistent)

```python
client = ava.Client(
    engine="presidio",
    vault_type="sqlite",
    vault_config={
        "db_path": "/secure/ava_vault.db",
        "encryption_key": os.environ["VAULT_KEY"],  # AES-256
        "journal_mode": "WAL"
    }
)
```

Survives process restarts. Sessions can be resumed by ID.

> Best for: Audit trails, long-running workflows, crash recovery.

### Redis Vault (Distributed)

```python
client = ava.Client(
    engine="presidio",
    vault_type="redis",
    vault_config={
        "host": "redis.company.com",
        "port": 6379,
        "password": os.environ["REDIS_PASSWORD"],
        "ssl": True
    }
)
```

Multiple services share tokens. Enables cross-machine session sharing.

> Best for: Microservices, load-balanced deployments, multi-stage pipelines.

---

## Policies

Policies control which entity types are detected, at what sensitivity, and how tokens are retained.

### Built-in Policies

```python
# HIPAA-compliant: all 18 PHI identifiers at sensitivity 5
client = ava.Client(policy="healthcare_strict")

# PCI-DSS level 1: one-time-use tokens for credit card numbers
client = ava.Client(policy="financial_paranoid")

# Attorney-client privilege: extended retention for matter files
client = ava.Client(policy="legal_confidential")

# Balanced business use: names/emails protected, dates preserved
client = ava.Client(policy="general_moderate")

# Scientific data sharing: irreversible hashing (true anonymization)
client = ava.Client(policy="research_anonymized")
```

### Custom YAML Policy

```yaml
# policies/enterprise_gdpr.yaml
name: enterprise_gdpr
entity_sensitivity:
  PERS: 5  # Always protected
  EMAI: 5
  PHON: 4
  DATE: 2
thresholds:
  min_confidence: 0.85
retention:
  session_ttl: 3600
  audit_retention: 90d
```

```python
client = ava.Client(policy="/path/to/policies/enterprise_gdpr.yaml")
```

---

## Async API

`ava.AsyncClient` supports concurrent sanitization, AI calls, and restoration using `asyncio.gather`.

```python
import asyncio
import ava

async def process_documents():
    client = ava.AsyncClient(
        engine="presidio",
        policy="general_moderate"
    )

    documents = ["Doc 1...", "Doc 2...", "Doc 3..."]

    async with client.session() as session:
        # Sanitize all concurrently
        sanitized = await asyncio.gather(*[
            session.sanitize(doc) for doc in documents
        ])

        # Send to AI concurrently
        responses = await asyncio.gather(*[
            call_llm(doc) for doc in sanitized
        ])

        # Restore all concurrently
        final = await asyncio.gather(*[
            session.restore(r) for r in responses
        ])

    return final

asyncio.run(process_documents())
```

---

## Production Workflows

### Healthcare AI Assistant (FastAPI)

```python
import ava
from fastapi import FastAPI

app = FastAPI()
client = ava.Client(engine="presidio", policy="healthcare_strict")

@app.post("/summarize-record")
async def summarize(record_id: str):
    record = ehr_system.get_record(record_id)

    with client.session(reversibility=True, ttl=1800) as session:
        # 1. Sanitize before sending to AI
        safe = session.sanitize(record)

        # 2. Send to OpenAI — PHI never leaves your environment
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": safe}]
        )

        # 3. Restore PHI in the summary
        summary = session.restore(
            response['choices'][0]['message']['content']
        )

        # 4. Store manifest for audit trail
        audit_log.store(session.manifest)

    return {"summary": summary, "manifest_id": session.manifest.id}
```

### Financial Customer Service Bot

```python
class CustomerServiceBot:
    def __init__(self):
        self.client = ava.Client(
            gateway_url="https://ava.bank.internal",
            api_key=os.environ["AVA_API_KEY"],
            policy="financial_paranoid"
        )

    async def handle(self, message: str):
        with self.client.session(reversibility=True) as session:
            # Customer input is sanitized before reaching AI
            # "My card 4532-1234-5678-9012 is wrong"
            # → "My card AVA_CRED_aB3cD4eF is wrong"
            safe = session.sanitize(message)

            ai_response = await claude.complete(f"Customer: {safe}")
            # "I'll check account AVA_CRED_aB3cD4eF"

            # Restore real values for the human agent (not the customer)
            agent_response = session.restore(ai_response)

            return {"to_agent": agent_response}
```

---

## License

MIT License — see [LICENSE](LICENSE)