Metadata-Version: 2.4
Name: GAICo
Version: 0.1.2
Summary: A Python library for reference-based metrics to compare generated LLM text with ground truth text
Project-URL: Homepage, https://github.com/ai4society/GenAIResultsComparator
Project-URL: Documentation, https://ai4society.github.io/projects/GenAIResultsComparator/index.html
Project-URL: Repository, https://github.com/ai4society/GenAIResultsComparator
Project-URL: Bug Tracker, https://github.com/ai4society/GenAIResultsComparator/issues
Author-email: Nitin Gupta <nitin1209@gmail.com>, Pallav Koppisetti <pallav.koppisetti5@gmail.com>, Biplav Srivastava <prof.biplav@gmail.com>
Maintainer-email: AI4Society Team <ai4societyteam@gmail.com>
License: MIT License
        
        Copyright (c) 2024 AI for Society Research Group
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: evaluation,generative-ai,llm,metrics,nlp,text-comparison
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: <3.13,>=3.10
Requires-Dist: bert-score>=0.3.13
Requires-Dist: levenshtein>=0.27.1
Requires-Dist: nltk>=3.9.1
Requires-Dist: numpy>=2.2.5
Requires-Dist: pandas>=2.2.3
Requires-Dist: rouge-score>=0.1.2
Requires-Dist: scikit-learn>=1.6.1
Requires-Dist: scipy>=1.15.3
Requires-Dist: sentence-transformers>=4.1.0
Provides-Extra: visualization
Requires-Dist: matplotlib>=3.10.3; extra == 'visualization'
Requires-Dist: seaborn>=0.13.2; extra == 'visualization'
Description-Content-Type: text/markdown

# GAICo: GenAI Results Comparator

_GAICo_ is a Python library providing evaluation metrics to compare generated texts, particularly useful for outputs from Large Language Models (LLMs), often against reference or ground truth texts.

View the documentation at [ai4society.github.io/projects/GenAIResultsComparator/index.html](https://ai4society.github.io/projects/GenAIResultsComparator/index.html).

## Quick Start

GAICo makes it easy to evaluate and compare LLM outputs. For detailed, runnable examples, please refer to our Jupyter Notebooks in the [`examples/`](examples/) folder:

- [`quickstart.ipynb`](examples/quickstart.ipynb): Rapid hands-on with the _Experiment_ sub-module.
- [`example-1.ipynb`](examples/example-1.ipynb): For fine-grained usage, this notebook focuses on comparing **multiple model outputs** using a **single metric**.
- [`example-2.ipynb`](examples/example-2.ipynb): For fine-grained usage, this notebook demonstrates evaluating a **single model output** across **all available metrics**.

## Streamlined Workflow with _`Experiment`_

For a more integrated approach to comparing multiple models, applying thresholds, generating plots, and creating CSV reports, the `Experiment` class offers a convenient abstraction.

### Quick Example

This example demonstrates comparing multiple LLM responses against a reference answer using specified metrics, generating a plot, and outputting a CSV report.

```python
from gaico import Experiment

# Sample data from https://arxiv.org/abs/2504.07995
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
    "Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."

# 1. Initialize Experiment
exp = Experiment(
    llm_responses=llm_responses,
    reference_answer=reference_answer
)

# 2. Compare models using specific metrics
#   This will calculate scores for 'Jaccard' and 'ROUGE',
#   generate a plot (e.g., radar plot for multiple metrics/models),
#   and save a CSV report.
results_df = exp.compare(
    metrics=['Jaccard', 'ROUGE'],  # Specify metrics, or None for all defaults
    plot=True,
    output_csv_path="experiment_report.csv",
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35} # Optional: override default thresholds
)

# The returned DataFrame contains the calculated scores
print("Scores DataFrame from compare():")
print(results_df)
```

This abstraction streamlines common evaluation tasks, while still allowing access to the underlying metric classes and dataframes for more advanced or customized use cases. More details in [`examples/quickstart.ipynb`](examples/quickstart.ipynb).

However, you might prefer to use the individual metric classes directly for more granular control or if you want to implement custom metrics. See the remaining notebooks in the [`examples`](examples) subdirectory.

<p align="center">
  <img src="https://raw.githubusercontent.com/ai4society/GenAIResultsComparator/refs/heads/main/examples/data/examples/example_2.png" alt="Sample Radar Chart showing multiple metrics for a single LLM" width="450"/>
  <br/><em>Example Radar Chart generated by the <code>examples/example-2.ipynb</code> notebook.</em>
</p>

## Description

The library provides a set of metrics for evaluating **2 text strings as inputs**. **Outputs are on a scale of 0 to 1** (normalized), where 1 indicates a perfect match between the two texts.

**_Class Structure:_** All metrics are implemented as classes, and they can be easily extended to add new metrics. The metrics start with the `BaseMetric` class under the `gaico/base.py` file.

Each metric class inherits from this base class and is implemented with **just one required method**: `calculate()`.

This `calculate()` method takes two parameters:

- `generated_texts`: Either a string or a Iterables of strings representing the texts generated by an LLM.
- `reference_texts`: Either a string or a Iterables of strings representing the expected or reference texts.

If the inputs are Iterables (lists, Numpy arrays, etc.), then the method assumes that there exists a one-to-one mapping between the generated texts and reference texts, meaning that the first generated text corresponds to the first reference text, and so on.

**_Note:_** While the library can be used to compare strings, it's main purpose is to aid with comparing results from various LLMs.

**_Inspiration_** for the library and evaluation metrics was taken from [Microsoft's
article on evaluating LLM-generated content](https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/working-with-llms/evaluation/list-of-eval-metrics). In the article, Microsoft describes 3 categories of evaluation metrics: **(1)** Reference-based metrics, **(2)** Reference-free metrics, and **(3)** LLM-based metrics. _The library currently supports reference-based metrics._

<p align="center">
  <img src="https://raw.githubusercontent.com/ai4society/GenAIResultsComparator/refs/heads/main/gaico.drawio.png" alt="GAICo Overview">
</p>
<p align="center">
  <em>Overview of the workflow supported by the <i>GAICo</i> library</em>
</p>

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Project Structure](#project-structure)
- [Development](#development)
- [Contributing](#contributing)
- [License](#license)
- [Acknowledgments](#acknowledgments)
- [Contact](#contact)

## Features

- Implements various metrics for text comparison:
  - N-gram-based metrics (BLEU, ROUGE, JS divergence)
  - Text similarity metrics (Jaccard, Cosine, Levenshtein, Sequence Matcher)
  - Semantic similarity metrics (BERTScore)
- Supports batch processing for efficient computation
- Optimized for different input types (lists, numpy arrays, pandas Series)
- Extendable architecture for easy addition of new metrics
- Testing suite

## Installation

You can install GAICo directly from PyPI using pip:

```shell
pip install GAICo
```

To include optional dependencies for visualization features (matplotlib, seaborn), install with:

```shell
pip install GAICo[visualization]
```

### For Developers (Installing from source)

If you want to contribute to GAICo or install it from source for development:

1. Clone the repository:

    ```shell
    git clone https://github.com/ai4society/GenAIResultsComparator.git
    cd GenAIResultsComparator
    ```

2. Set up a virtual environment and install dependencies:

    We recommend using [UV](https://docs.astral.sh/uv/#installation) for managing environments and dependencies.

    ```shell
    # Create a virtual environment (e.g., Python 3.12 recommended)
    uv venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install the package in editable mode with development and visualization extras
    uv pip install -e ".[dev,visualization]"
    ```

    _If you don't want to use `uv`,_ you can install the dependencies with the following commands:

    ```shell
    # Create a virtual environment (e.g., Python 3.12 recommended)
    python3 -m venv .venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install the package in editable mode with development and visualization extras
    pip install -e ".[dev,visualization]"
    ```

    *(Note: The `dev` extra includes dependencies for testing, linting, building, and documentation, as well as visualization dependencies.)*

3. Set up pre-commit hooks (optional but recommended for contributors):

    ```shell
    pre-commit install
    ```

## Project Structure

The project structure is as follows:

```shell
.
├── README.md
├── LICENSE
├── .gitignore
├── uv.lock
├── pyproject.toml
├── .pre-commit-config.yaml
├── gaico/        # Contains the library code
├── examples/     # Contains example scripts
├── tests/        # Contains test scripts
└── docs/         # Contains documentation files

```

### Code Style

We use `pre-commit` hooks to maintain code quality and consistency. The configuration for these hooks is in the `.pre-commit-config.yaml` file. These hooks run automatically on `git commit`, but you can also run them manually:

```
pre-commit run --all-files
```

## Running Tests

Navigate to the project root in your terminal and run:

```bash
uv run pytest
```

Or, for more verbose output:

```bash
uv run pytest -v
```

To skip the slow BERTScore tests:

```bash
uv run pytest -m "not bertscore"
```

To run only the slow BERTScore tests:

```bash
uv run pytest -m bertscore
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/FeatureName`)
3. Commit your changes (`git commit -m 'Add some FeatureName'`)
4. Push to the branch (`git push origin feature/FeatureName`)
5. Open a Pull Request

Please ensure that your code passes all tests and adheres to our code style guidelines (enforced by pre-commit hooks) before submitting a pull request.

## Citation

If you find this project useful, please consider citing it in your work:

```bibtex
@software{AI4Society_GAICo_GenAI_Results,
  author = {{Nitin Gupta, Pallav Koppisetti, Biplav Srivastava}},
  license = {MIT},
  title = {{GAICo: GenAI Results Comparator}},
  year = {2025},
  url = {https://github.com/ai4society/GenAIResultsComparator}
}
```

## Acknowledgments

- The library is developed by [Nitin Gupta](https://github.com/g-nitin), [Pallav Koppisetti](https://github.com/pallavkoppisetti), and [Biplav Srivastava](https://github.com/biplav-s). Members of [AI4Society](https://ai4society.github.io) contributed to this tool as part of ongoing discussions. Major contributors are credited.
- This library uses several open-source packages including NLTK, scikit-learn, and others. Special thanks to the creators and maintainers of the implemented metrics.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contact

If you have any questions, feel free to reach out to us at [ai4societyteam@gmail.com](mailto:ai4societyteam@gmail.com).
