Metadata-Version: 2.1
Name: air-benchmark
Version: 0.0.2
Summary: AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Author-email: BAAI <zhengliu1026@gmail.com>, Jina AI <research@jina.ai>
License: MIT License
        
        Copyright (c) 2024 AIR-Bench
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: homepage, https://github.com/AIR-Bench/AIR-Bench/tree/main
Project-URL: Huggingface Organization, https://huggingface.co/AIR-Bench
Project-URL: Leaderboard, https://huggingface.co/spaces/AIR-Bench/leaderboard
Keywords: embedding,benchmark,air-bench,reranker,information retrieval
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets >=2.18.0
Requires-Dist: mteb >=1.7.17
Requires-Dist: torch >=1.6.0
Requires-Dist: transformers >=4.33.0
Requires-Dist: rich >=13.7.1
Provides-Extra: dev
Requires-Dist: black ; extra == 'dev'
Requires-Dist: isort ; extra == 'dev'
Requires-Dist: pytest ; extra == 'dev'

<h1 align="center"> AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark </h1>

<h4 align="center">
    <p>
        <a href="#introduction">Introduction</a> |
        <a href="#documentation">Documentation</a> |
        <a href="https://huggingface.co/spaces/AIR-Bench/leaderboard">Leaderboard</a> |
        <a href="#citing">Citing</a>
    <p>
</h4>

<h3 align="center">
    <a href="https://huggingface.co/spaces/AIR-Bench/leaderboard"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="./docs/images/hf_logo.png" /></a>
</h3>

## Introduction

### Background & Motivation

Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question, (open-domain QA), MIRACL (Milti-lingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.

- **Incapability of dealing with new domains**. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users. 
- **Potential risk of over-fitting and data leakage**. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake. 

### Features of AIR-Bench

The new benchmark is highlighted for the following new features. 

- **Automated**. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
- **Heterogeneous** **and Dynamic**: The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.  
- **Retrieval and RAG-oriented**. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question. 

## Documentation

| Documentation                                                |                                                              |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| 🏭 [Pipeline](./docs/data_generation.md)                                               | The data generation pipeline of AIR-Bench                    |
| 📋 [Tasks](./docs/avaliable_tasks.md)                         | Overview of available tasks in AIR-Bench                     |
| 📈 [Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) | The interactive leaderboard of AIR-Bench                     |
| 🚀 [Submit](./docs/submit_to_leaderboard.md)                  | Information related to how to submit a model to AIR-Bench |
| 🤝 [Contributing](./docs/community_contribution.md)           | How to contribute to AIR-Bench                               |

## Avaliable Evaluation Results

Detailed avaliable results are avaliable [here](./docs/avaliable_evaluation_results.md).

Analysis about the results:

- **AIR-Bench performance scales with model size**. For example, `multilingual-e5-large` is better than `multilingual-e5-base` and `multilingual-e5-base` is better than `multilingual-e5-small`. This can also be observed in `bge-large-en-v1.5`, `bge-base-en-v1.5` and `bge-small-en-v1.5`.
- **The generated dataset maintains good consistency with the human-labeled dataset**. The Spearman correlation between the rankings on the original MSMARCO dataset and the generated MSMARCO dataset is 0.8945.
- **The performance of the model varies across different domains**. For example, `e5-mistral-7b-instruct` is better than `bge-m3` in the healthcare domain, but `e5-mistral-7b-instruct` is worse than `bge-m3` in the law domain.

## Future Work

- More datasets will be generated to cover more domains and languages in the future.

## Acknowledgement


## Citing

```
```
