Metadata-Version: 2.3
Name: knowlang
Version: 0.2.1
Summary: AI-powered code understanding assistant that helps developers explore and understand complex codebases through semantic search and intelligent Q&A
License: Apache-2.0
Keywords: code-understanding,rag,llm,code-search,documentation,code-analysis,semantic-search,developer-tools
Author: gabhyun kim
Author-email: kimgb415@gmail.com
Requires-Python: >=3.10, <3.13
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Documentation
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Dist: chromadb (>=0.6.3,<0.7.0)
Requires-Dist: gitpython (>=3.1.44,<4.0.0)
Requires-Dist: gradio (>=5.13.1,<6.0.0)
Requires-Dist: mixpanel (>=4.10.1,<5.0.0)
Requires-Dist: ollama (>=0.4.7,<0.5.0)
Requires-Dist: pydantic-ai[logfire] (>=0.0.20,<0.0.21)
Requires-Dist: pydantic-settings (>=2.7.1,<3.0.0)
Requires-Dist: tree-sitter (>=0.24.0,<0.25.0)
Requires-Dist: tree-sitter-python (>=0.23.6,<0.24.0)
Requires-Dist: voyageai (>=0.3.2,<0.4.0)
Project-URL: Homepage, https://github.com/kimgb415/know-lang
Project-URL: Repository, https://github.com/kimgb415/know-lang
Description-Content-Type: text/markdown

---
title: KnowLangBot
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---

# KnowLang: Comprehensive Understanding for Complex Codebase

KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.

[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)

## Features

- 🔍 **Semantic Code Search**: Find relevant code snippets based on natural language queries
- 📚 **Contextual Q&A**: Get detailed explanations about code functionality and implementation details
- 🎯 **Smart Chunking**: Intelligent code parsing that preserves semantic meaning
- 🔄 **Multi-Stage Retrieval**: Combined embedding and semantic search for better results
- 🐍 **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support

## How It Works

### Code Parsing Pipeline

```mermaid
flowchart TD
    A[Git Repository] --> B[Code Files]
    B --> C[Code Parser]
    C --> D{Parse by Type}
    D --> E[Class Definitions]
    D --> F[Function Definitions]
    D --> G[Other Code]
    E --> H[Code Chunks]
    F --> H
    G --> H
    H --> I[LLM Summarization]
    H --> J
    I --> J[Embeddings]
    J --> K[(Vector Store)]
```

### RAG Chatbot Pipeline

```mermaid
flowchart LR
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]
    C --> D[Context Collection]
    D --> E[LLM Response Generation]
    E --> F[User Interface]
```


## Prerequisites

KnowLang uses [Ollama](https://ollama.com) as its default LLM and embedding provider. Before installing KnowLang:

1. Install Ollama:
```bash
# check the official download instructions from https://ollama.com/download
curl -fsSL https://ollama.com/install.sh | sh
```

2. Pull required models:
```bash
# For LLM responses
ollama pull llama3.2

# For code embeddings
ollama pull mxbai-embed-large
```

3. Verify Ollama is running:
```bash
ollama list
```

You should see both `llama3.2` and `mxbai-embed-large` in the list of available models.

Note: While Ollama is the default choice for easy setup, KnowLang supports other LLM providers through configuration. See our [Configuration Guide](configuration.md) for using alternative providers like OpenAI or Anthropic.

## Quick Start

### System Requirements

- **RAM**: Minimum 16GB recommended (Ollama models require significant memory)
- **Storage**: At least 10GB free space for model files
- **OS**: 
  - Linux (recommended)
  - macOS 12+ (Intel or Apple Silicon)
  - Windows 10+ with WSL2
- **Python**: 3.10 or higher


### Installation
You can install KnowLang via pip:
```bash
pip install knowlang
```
Alternatively, you can clone the repository and install it in editable mode:
```bash
git clone https://github.com/kimgb415/know-lang.git
cd know-lang
pip install -e .
```
This allows you to make changes to the source code and have them immediately reflected without reinstalling the package.

### Basic Usage

1. First, parse and index your codebase:
```bash
# For a local codebase
knowlang parse ./my-project

# For verbose output
knowlang -v parse ./my-project
```
> ⚠️ **Warning**  
> Make sure to setup the correct paths to include and exclude for parsing. Please refer to "Parser Settings" section in [Configuration Guide](configuration.md) for more information

2. Then, launch the chat interface:
```bash
knowlang chat
```

That's it! The chat interface will open in your browser, ready to answer questions about your codebase.

![Chat Interface](chat.png)

### Advanced Usage

#### Custom Configuration
```bash
# Use custom configuration file
knowlang parse --config my_config.yaml ./my-project

# Output parsing results in JSON format
knowlang parse --output json ./my-project
```

#### Chat Interface Options
```bash
# Run on a specific port
knowlang chat --port 7860

# Create a shareable link
knowlang chat --share

# Run on custom server
knowlang chat --server-name localhost --server-port 8000
```

### Example Session

```bash
# Parse the transformers library
$ knowlang parse ./transformers
Found 1247 code chunks
Processing summaries... Done!

# Start chatting
$ knowlang chat

💡 Ask questions like:
- How is tokenization implemented?
- Explain the training pipeline
- Show me examples of custom model usage
```

## Architecture

KnowLang uses several key technologies:

- **Tree-sitter**: For robust, language-agnostic code parsing
- **ChromaDB**: For efficient vector storage and retrieval
- **PydanticAI**: For type-safe LLM interactions
- **Gradio**: For the interactive chat interface

## Technical Details

### Code Parsing

Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:

1. Repository cloning and file identification
2. Semantic parsing with Tree-sitter
3. Smart chunking based on code structure
4. LLM-powered summarization
5. Embedding generation with mxbai-embed-large
6. Vector store indexing

### RAG Implementation

The RAG system uses a multi-stage retrieval process:

1. Query embedding generation
2. Initial vector similarity search
3. Context aggregation
4. LLM response generation with full context


## Roadmap

- [ ] Inter-repository semantic search
- [ ] Support for additional programming languages
- [ ] Automatic documentation maintenance
- [ ] Integration with popular IDEs
- [ ] Custom embedding model training
- [ ] Enhanced evaluation metrics

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.

## Citation

If you use KnowLang in your research, please cite:

```bibtex
@software{knowlang2025,
  author = KnowLang,
  title = {KnowLang: Comprehensive Understanding for Complex Codebase},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/kimgb415/know-lang}
}
```

## Support

For support, please open an issue on GitHub or reach out to us directly through discussions.
