Metadata-Version: 2.1
Name: ASTAligner
Version: 1.0.0
Summary: ASTAligner is designed to align tokens from source code snippets to Abstract Syntax Tree (AST) nodes using Tree-sitter for AST generation and various HuggingFace tokenizers for language tokenization. The library supports a wide range of programming languages and Fast tokenizers, enabling precise mapping between source code elements and their AST representations.
Home-page: https://github.com/csci-435-fall-2024/csci-435-24_p4_ast
Author: Semeru Lab
Author-email: svelascodimate@wm.edu
License: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: Flask==3.0.3
Requires-Dist: Flask-Cors==5.0.0
Requires-Dist: protobuf
Requires-Dist: sentencepiece==0.2.0
Requires-Dist: tokenizers==0.19.1
Requires-Dist: transformers==4.44.2
Requires-Dist: tree-sitter==0.23.0
Requires-Dist: tree-sitter-cpp==0.23.1
Requires-Dist: tree-sitter-java==0.23.2
Requires-Dist: tree-sitter-python==0.23.2
Requires-Dist: tree_sitter_c_sharp==0.23.1
Requires-Dist: tree_sitter_go==0.23.3
Requires-Dist: tree_sitter_haskell==0.23.1
Requires-Dist: tree_sitter_javascript==0.23.1
Requires-Dist: tree_sitter_kotlin==1.0.1
Requires-Dist: tree_sitter_rust==0.23.1
Requires-Dist: tree_sitter_html==0.23.2
Requires-Dist: tree_sitter_c==0.23.2
Requires-Dist: tree-sitter-ruby==0.23.1
Requires-Dist: concurrently

# AST-Alignment Tool

Aligns the tokens from a code snippet to their corresponding nodes in an AST representation.

## Description

A Large Language Model (LLM) is a type of AI model designed to understand
and generate human-like text based on vast amounts of data. Trained on diverse source code datasets, LLMs can automate Software Engineering tasks across various contexts, such as code translation, code summarization, test-case generation, and code completion. A critical component of LLMs is the tokenizer, which breaks down text into smaller units, typically words or subwords, that the model can process. The tokenizer's role is essential because it converts source code into a format the model can understand, ensuring efficient and accurate code processing and generation. In the context of Interpretability for AI, post-hoc techniques such as ASTScore, rely on alienation functions (phi) to match the tokens generated by an LLMâ€™s tokenizer with their corresponding nodes in the AST representation of a snippet. ASTAligner is designed to align tokens from source code snippets to Abstract Syntax Tree (AST) nodes using Tree-sitter for AST generation and various HuggingFace tokenizers for language tokenization. The library supports a wide range of programming languages and Fast tokenizers, enabling precise mapping between source code elements and their AST representations.

### Goals

This project has two goals: 

(1) Create a library for aligning the tokens from a code snippet to their corresponding nodes in the AST representation 

(2) Create a tool to visualize the alignment of the tokens with their matching AST. 

### Additional Information

For more information regarding this project's background and dependencies, please refer to these readings:

(1) [Evaluating and Explaining Large Language Models for Code Using
Syntactic Structures](https://arxiv.org/abs/2308.03873)

(2) [Tree-Sitter Programming Language Parser](https://github.com/tree-sitter/tree-sitter)

(3) [Hugging Face Tokenizer](https://huggingface.co/learn/nlp-course/en/chapter2/4)

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install the ASTAligner package.

```bash
pip install ASTAligner
```

## Supported Features

* 11 supported languages
    * Python
    * C
    * C++
    * C#
    * Java
    * JavaScript
    * Ruby
    * HTML
    * GO
    * Kotlin
    * Rust
* 6 Tokenizers
    * [Bert-Base-Uncased](https://huggingface.co/google-bert/bert-base-uncased)
    * [CodeLlama](https://huggingface.co/docs/transformers/main/en/model_doc/code_llama)
    * [GPT2](https://huggingface.co/docs/transformers/en/model_doc/gpt2)
    * [DialoGPT](https://huggingface.co/microsoft/DialoGPT-small)
    * [Roberta-Base](https://huggingface.co/FacebookAI/roberta-base)
    * [Qwen](https://huggingface.co/Qwen/Qwen2-7B-Instruct)

## Library Usage

### ASTalign
Using the ASTalign method asks that the user provide:
* A snippet of `code` as a string, or a filepath to a text file containing code.
* The `language` of the code snippet as one of the following strings:
    * python
    * c
    * cpp
    * csharp
    * java
    * javascript
    * ruby
    * html
    * go
    * kotlin
    * rust
    * haskell
* A `tokenizer` specification as one of the following strings:
    * codellama
    * gpt2
    * bert-base-uncased
    * roberta-base
    * dialogpt
    * qwen
* (Optional) `include_whitespace_and_special_tokens` flag. Set to False by default, this flag allows the user to specify whether or not to show whitespaces and special characters in the tokens. 

The method returns a dictionary of TSTree nodes to a list of tokens from the code snippet that overlap with those nodes. Example usage:

`alignments[node]` yields `[tok1, tok2, ... , tokn]`

### printAlignmentsTree

The printAlignmentsTree method recursively prints out an entire tree with the provided node as the root of the tree. The method prints the type of each node and the tokens that are aligned to it. This method returns nothing.

Using the method asks that the user provide:
* A node inside the tree (such as the root node)
* The alignments object returned by ASTalign

Example usage:

```python
test = r"""x = y + z"""
alignments = ASTalign(test, 'python', "bert-base-uncased")
root = getRootNode(alignments)
printAlignmentsTree(root, alignments)
```
Output:
```
-> 0  'module'
      ['x', '=', 'y', '+', 'z']

    -> 1  'expression_statement'
          ['x', '=', 'y', '+', 'z']

        -> 2  'assignment'
              ['x', '=', 'y', '+', 'z']

            -> 3  'identifier'
                  ['x']

            -> 3  '='
                  ['=']

            -> 3  'binary_operator'
                  ['y', '+', 'z']

                -> 4  'identifier'
                      ['y']

                -> 4  '+'
                      ['+']

                -> 4  'identifier'
                      ['z']
```


### printAlignmentsNode

The printAlignmentsNode method prints out the type of the provided node and the tokens that are aligned to it. This method returns nothing.

Using the method asks that the user provide:
* A node inside the tree (such as the root node)
* The alignments object returned by the ASTalign method

Example Usage:

```python
test = r"""x = y + z"""
alignments = ASTalign(test, 'python', "bert-base-uncased")
root = getRootNode(alignments)
printAlignmentsTree(root, alignments)
```
Output:
```
module
['x', '=', 'y', '+', 'z']
```
### getRootNode

The getRootNode method returns the root node of the tree from the provided alignments object.

Using the method asks that the user provide:
* An alignments object created by the ASTalign method

Example Usage:
```python
test = r"""x = y + z"""
alignments = ASTalign(test, 'python', "bert-base-uncased")
root = getRootNode(alignments)
```


### rangeFinder

The rangeFinder method returns an index range (start, end] for a TSTree node in a string. 

Using the method asks that the user provide:

* The `range` of a TSTree node.
* A snippet of `code` as a string, or a filepath to a text file containing code.

Example usage:

If a node `identifier_node` corresponds to `num` in the code string `snippet = "num = 1"`, then 
```
rangeFinder(identifier_node.range, snippet)
``` 
yields tuple `(0, 3)`.

### ASTtokenFinder

The ASTtokenFinder method takes an index range in a string of code (as tuple), a code snippet, a language, and a tokenizer, and returns a dictionary mapping nodes _whose text overlaps with the range_ to their tokens. 

Note that the method constructs an alignments dictionary from the provided code before selecting the target nodes from the resulting alignments.

Using the method asks that the user provide:

* An index `range` in a code string as (start, end].
* A snippet of `code` as a string, or a filepath to a text file containing code.
* The `language` of the code snippet as a string (see [ASTalign](#ASTalign) section for language strings).
* A `tokenizer` specification as a string (see [ASTalign](#ASTalign) section for tokenizer strings).

Example usage:

If a code string `snippet = "num = 1"` produces a tree of the form
```
| assignment_expr -> "num = 1"
--| identifier -> "num"
--| assignment_op -> "="
--| value -> "1"
```
then 
```
ASTtokenFinder((0,3), snippet, language, tokenizer)
``` 
may yield 

```
{assignment_exp : ['num', '=', '1'], identifier : ['num']}
```

as the text of the `assignment_exp` and `identifier` nodes overlap the range (0, 3] in the code string.


## Contributing
Semeru Lab ASTAligner Team: Lillie Ayer, Cassie Baker, Daniel Biedron, Peter Buddendeck,Cristian Charette-Lopez,and Stephen Ramotowski

## License

[MIT](https://choosealicense.com/licenses/mit/)
