Metadata-Version: 2.4
Name: astchunk
Version: 0.1.0
Summary: AST-based code chunking library for improved code analysis and processing
Author-email: "Yilin (Jason) Zhang" <jasonzh3@andrew.cmu.edu>, Xinran Zhao <xinranz3@andrew.cmu.edu>, Zora Zhiruo Wang <zhiruow@andrew.cmu.edu>, Chenyang Yang <cyang3@andrew.cmu.edu>, Jiayi Wei <jiayi@augmentcode.com>, Sherry Tongshuang Wu <sherryw@andrew.cmu.edu>
Maintainer-email: "Yilin (Jason) Zhang" <jasonzh3@andrew.cmu.edu>
License: MIT License
        
        Copyright (c) 2025 Yilin (Jason) Zhang
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/yilinjz/astchunk
Keywords: ast,chunking,code analysis,code indexing,code retrieval,code generation,tree-sitter,parsing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20.0
Requires-Dist: pyrsistent>=0.18.0
Requires-Dist: tree-sitter>=0.20.0
Requires-Dist: tree-sitter-python>=0.20.0
Requires-Dist: tree-sitter-java>=0.20.0
Requires-Dist: tree-sitter-c-sharp>=0.20.0
Requires-Dist: tree-sitter-typescript>=0.20.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: isort>=5.10.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=2.20.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=5.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "docs"
Requires-Dist: myst-parser>=0.18.0; extra == "docs"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-xdist>=2.5.0; extra == "test"
Dynamic: license-file

# ASTChunk

This repository contains code for AST-based code chunking that preserves syntactic structure and semantic boundaries. ASTChunk intelligently divides source code into meaningful chunks while respecting the Abstract Syntax Tree (AST) structure, making it ideal for code analysis, documentation generation, and machine learning applications.

<!-- Add paper citation when available -->
<!--
This work is described in the following paper:  
>[Paper Title](paper_url)  
> Author Names
> Conference/Journal, Year

Bibtex for citations:
```bibtex
@inproceedings{<citation_key>,
    title = "<Paper Title>",
    author = "<Authors>",
    booktitle = "<Conference>",
    year = "<Year>",
    url = "<URL>",
    pages = "<Pages>",
}
```
-->

<!--
## Features

- **Structure-aware chunking**: Respects AST boundaries to avoid breaking syntactic constructs
- **Multi-language support**: Python, Java, C#, and TypeScript
- **Configurable chunk sizes**: Based on non-whitespace character count for consistent sizing
- **Metadata preservation**: Maintains file paths, line numbers, and AST context
- **Overlapping support**: Optional overlapping between chunks for better context
- **Efficient processing**: O(1) chunk size lookup with preprocessing
-->

## Installation

From PyPI:
```bash
pip install astchunk
```

From source:
```bash
git clone git@github.com:yilinjz/astchunk.git
pip install -e .
```

ASTChunk depends on [tree-sitter](https://tree-sitter.github.io/tree-sitter/) for parsing. The required language parsers are automatically installed:

```bash
# Core dependencies (automatically installed)
pip install numpy pyrsistent tree-sitter
pip install tree-sitter-python tree-sitter-java tree-sitter-c-sharp tree-sitter-typescript
```

## Configuration Options

- **`max_chunk_size`**: Maximum non-whitespace characters per chunk
- **`language`**: Programming language for parsing
- **`metadata_template`**: Format for chunk metadata
- **`repo_level_metadata`** *(optional)*: Repository-level metadata (e.g., repo name, file path)
- **`chunk_overlap`** *(optional)*: Number of AST nodes to overlap between chunks
- **`chunk_expansion`** *(optional)*: Whether to perform chunk expansion (i.e., add metadata headers to chunks)

## Quick Start

```python
from astchunk import ASTChunkBuilder

# Your source code
code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

class Calculator:
    def add(self, a, b):
        return a + b
    
    def multiply(self, a, b):
        return a * b
"""

# Initialize the chunk builder
configs = {
    "max_chunk_size": 100,             # Maximum non-whitespace characters per chunk
    "language": "python",              # Supported: python, java, csharp, typescript
    "metadata_template": "default"     # Metadata format for output
}
chunk_builder = ASTChunkBuilder(**configs)

# Create chunks
chunks = chunk_builder.chunkify(code)

# Each chunk contains content and metadata
for i, chunk in enumerate(chunks):
    print(f"[Chunk {i+1}]")
    print(f"{chunk['content']}")
    print(f"Metadata: {chunk['metadata']}")
    print("-" * 50)
```

## Advanced Usage

### Customizing Chunk Parameters

```python

# Add repo-level metadata
configs['repo_level_metadata'] = {
    "filepath": "src/calculator.py"
}

# Enable overlapping between chunks
configs['chunk_overlap'] = 1

# Add chunk expansion (metadata headers)
configs['chunk_expansion'] = True

# NOTE: max_chunk_size apply to the chunks before overlapping or chunk expansion.
# The final chunk size after overlapping or chunk expansion may exceed max_chunk_size.


# Extend current code for illustration
code += """
def divide(self, a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

# This is a comment
# Another comment

def subtract(self, a, b):
    return a - b

def exponent(self, a, b):
    return a ** b
"""


# Create chunks
chunks = chunk_builder.chunkify(code, **configs)

for i, chunk in enumerate(chunks):
    print(f"[Chunk {i+1}]")
    print(f"{chunk['content']}")
    print(f"Metadata: {chunk['metadata']}")
    print("-" * 50)
```

### Working with Files

```python
# Process a single file
with open("example.py", "r") as f:
    code = f.read()

# Alternatively, you can also create single-use configs for the optional arguments for each chunkify() call
single_use_configs = {
    "repo_level_metadata": {
        "filepath": "example.py"
    },
    "chunk_expansion": True
}

chunks = chunk_builder.chunkify(code, **single_use_configs)

# Save chunks to separate files
for i, chunk in enumerate(chunks):
    with open(f"chunk_{i+1}.py", "w") as f:
        f.write(chunk['content'])
```

### Processing Multiple Languages

```python
# Python code
python_builder = ASTChunkBuilder(
    max_chunk_size=1500,
    language="python",
    metadata_template="default"
)

# Java code  
java_builder = ASTChunkBuilder(
    max_chunk_size=2000,
    language="java", 
    metadata_template="default"
)

# TypeScript code
ts_builder = ASTChunkBuilder(
    max_chunk_size=1800,
    language="typescript",
    metadata_template="default"
)
```

<!-- ### Metadata Templates

Different metadata templates for various use cases:

```python
# For repoeval
repoeval_builder = ASTChunkBuilder(
    max_chunk_size=2000,
    language="python",
    metadata_template="coderagbench-repoeval"
)

# For swebench-lite
swebench_builder = ASTChunkBuilder(
    max_chunk_size=2000,
    language="python",
    metadata_template="coderagbench-swebench-lite"
)
``` -->

<!-- ## Core Functions

### Preprocessing Functions

```python
from astchunk.preprocessing import preprocess_nws_count, get_nws_count, ByteRange

# Preprocess code for efficient size calculation
code_bytes = code.encode('utf-8')
nws_cumsum = preprocess_nws_count(code_bytes)

# Get non-whitespace character count for any byte range
byte_range = ByteRange(0, 100)  # First 100 bytes
char_count = get_nws_count(nws_cumsum, byte_range)
```

### Direct AST Processing

```python
from astchunk.astnode import ASTNode
from astchunk.astchunk import ASTChunk

# Work directly with AST nodes and chunks for custom processing
# (See API documentation for detailed usage)
``` -->

## Supported Languages

| Language   | File Extensions | Status |
|------------|----------------|---------|
| Python     | `.py`          | ✅ Full support |
| Java       | `.java`        | ✅ Full support |
| C#         | `.cs`          | ✅ Full support |
| TypeScript | `.ts`, `.tsx`  | ✅ Full support |

<!-- ## Contributing

We welcome contributions! Please see our [contributing guidelines](<CONTRIBUTING_URL>) for details. -->

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Version

Current version: 0.1.0
