Metadata-Version: 2.4
Name: graph-sieve
Version: 1.2.2
Summary: Full Spectrum Graph Sieve - Automated Technical Term Extraction and Relationship Mapping
Author-email: graph-sieve contributors <contributors@graph-sieve.org>
License: MIT
Keywords: knowledge-graph,llm,extraction,nlp,mcp,graph-database
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: markitdown[all]
Requires-Dist: openai
Requires-Dist: pydantic
Requires-Dist: pydantic-settings
Requires-Dist: tqdm
Requires-Dist: click
Requires-Dist: mcp
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: hyperextract
Requires-Dist: platformdirs
Requires-Dist: networkx
Requires-Dist: pyvis
Requires-Dist: tenacity
Requires-Dist: pyOneNote
Requires-Dist: chardet
Requires-Dist: scikit-learn
Dynamic: license-file

# Graph-Sieve 🕸️📊

**Full Spectrum Graph Sieve - Automated Technical Term Extraction and Relationship Mapping**

`graph-sieve` is a powerful knowledge management utility and service designed to extract high-fidelity, relationship-aware domain knowledge from unstructured documents (.docx, .pptx, .msg, .pdf, .one). Using a multi-gate verifiable pipeline, it builds a structured knowledge graph that preserves technical context and organizational links.

## ✨ Core Capabilities

- **🔍 Multi-Gate Pipeline**: A 5-gate extraction flow (Strategic Sieve -> Batch Extraction -> Multi-Source Validation -> Alias Resolution -> Global Synthesis) ensuring high-fidelity term capture with minimal hallucinations.
- **📄 Multi-Format Support**: Native handling of PDF, PPTX, DOCX, MSG, and **OneNote** (.one) files. Leverages Microsoft **MarkItDown** for deep document parsing and OCR.
- **🗺️ Relationship Mapping**: Beyond simple term lookup—automatically maps how terms relate (e.g., `SUPERSEDES`, `DEPENDS_ON`, `HAS_EXPERT`).
- **🌐 Global Synthesis**: Automatically clusters the graph into communities and generates executive summaries and a global project narrative.
- **🇮🇱 Hebrew & Mixed-Language Support**: Specialized Bi-Directional (BIDI) support for Hebrew-English technical documents, ensuring technical terms are correctly extracted from mixed-language contexts.
- **⚙️ Flexible LLM Backend**: Run locally with Ollama/vLLM for privacy, or use OpenAI for scale.
- **📈 Interactive Visualization**: Generate dynamic, relationship-aware graph visualizations via PyVis.
- **🤖 MCP Server**: Integrated Model Context Protocol (MCP) server for seamless integration with AI agents like Claude Desktop or Gemini CLI.

## 🚀 Quick Start

1.  **Configure Your LLM**:
    Create a `.env` file in your working directory:
    ```env
    LLM_PROVIDER=openai
    OPENAI_API_KEY=your_key_here
    MODEL_NAME=gpt-4o-mini
    ```
    *Or use local Ollama (default):*
    ```env
    LLM_PROVIDER=ollama
    OLLAMA_BASE_URL=http://localhost:11434
    MODEL_NAME=llama3
    ```

2.  **Scan a Directory**:
    ```bash
    graph-sieve-scan ./path/to/documents --db my_knowledge.db
    ```

3.  **Visualize the Results**:
    ```bash
    graph-sieve-visualize --db my_knowledge.db
    ```

## 🛠️ CLI Command Reference

- `graph-sieve-scan <path>`: Extract terms from a directory or file.
  - `--db <path>`: Path to the SQLite database (default: platform-standard data dir).
  - `--seed <path>`: High-authority documents to process first.
  - `--whitelist <path>`: Text file with terms to always include.
  - `--retry-failed`: Retry processing chunks from the Dead Letter Queue (DLQ).
- `graph-sieve-lookup <term>`: Query a term, its definition, and its graph context.
- `graph-sieve-visualize`: Generate an interactive HTML graph.
- `graph-sieve-mcp`: Launch the MCP server.
- `graph-sieve-whois <term>`: Identify experts, owners, and organizations responsible for a term.

## 📖 Advanced Workflow

### 💎 Seed Documents
Use the `--seed` flag to process "Golden" documents (specs, architecture docs) before general notes. This sets the ground truth for term definitions and relationships.

### 🔗 Alias Resolution & Canonicalization
Graph-Sieve automatically performs LLM-verified canonicalization. If it finds "AIP" and "AI Platform" in the same context, it will attempt to merge them into a single canonical entry with appropriate aliases.

### 🆘 Dead Letter Queue (DLQ)
If an LLM call fails or a chunk is too complex, it's pushed to the DLQ. Use `graph-sieve-scan --retry-failed` to re-process these chunks after updating your configuration or models.

## ⚙️ Configuration (Environment Variables)

| Variable | Description | Default |
|----------|-------------|---------|
| `LLM_PROVIDER` | `openai`, `ollama`, or `vllm` | `openai` |
| `OPENAI_API_KEY` | Required if using OpenAI | None |
| `OLLAMA_BASE_URL`| URL for Ollama API | `http://localhost:11434` |
| `MODEL_NAME` | Model to use for extraction | `gpt-4o-mini` |
| `STORAGE_DIR` | Directory for graph data | Platform-specific |

## 🧩 AI Agent Integration

Add Graph-Sieve to your MCP-compatible agent's configuration:

```json
{
  "mcpServers": {
    "graph-sieve": {
      "command": "graph-sieve-mcp",
      "args": []
    }
  }
}
```

## License

MIT License. See [LICENSE](LICENSE) for details.
