Metadata-Version: 2.1
Name: airb
Version: 0.0.1
Summary: AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Home-page: https://github.com/AIR-Bench/AIR-Bench
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets >=2.18.0
Requires-Dist: mteb >=1.7.17
Requires-Dist: torch >=1.6.0
Requires-Dist: transformers >=4.33.0

<h1 align="center"> AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark </h1>

<h4 align="center">
    <p>
        <a href="#introduction">Introduction</a> |
        <a href="#documentation">Documentation</a> |
        <a href="https://huggingface.co/spaces/AIR-Bench/leaderboard">Leaderboard</a> |
        <a href="#citing">Citing</a>
    <p>
</h4>

<h3 align="center">
    <a href="https://huggingface.co/spaces/AIR-Bench/leaderboard"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="./docs/images/hf_logo.png" /></a>
</h3>

## Introduction

### Background & Motivation

Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question, (open-domain QA), MIRACL (Milti-lingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.

- **Incapability of dealing with new domains**. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users. 
- **Potential risk of over-fitting and data leakage**. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake. 

### Features of AIR-Bench

The new benchmark is highlighted for the following new features. 

- **Automated**. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
- **Heterogeneous** **and Dynamic**: The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.  
- **Retrieval and RAG-oriented**. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical infomration to answer the input question. 

## Documentation

| Documentation                                                |                                                              |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| 🏭 [Pipeline](./docs/data_generation.md)                                               | The data generation pipeline of AIR-Bench                    |
| 📋 [Tasks](./docs/avaliable_tasks.md)                         | Overview of available tasks in AIR-Bench                     |
| 📈 [Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) | The interactive leaderboard of AIR-Bench                     |
| 🚀 [Submit](./docs/submit_to_leaderboard.md)                  | Information related to how to submit a model to AIR-Bench |
| 🤝 [Contributing](./docs/community_contribution.md)           | How to contribute to AIR-Bench                               |

## Avaliable Evaluation Results

Detailed avaliable results are avaliable [here](./docs/avaliable_evaluation_results.md).

Analysis about the results:

- **AIR-Bench performance scales with model size**. For example, `multilingual-e5-large` is better than `multilingual-e5-base` and `multilingual-e5-base` is better than `multilingual-e5-small`. This can also be observed in `bge-large-en-v1.5`, `bge-base-en-v1.5` and `bge-small-en-v1.5`.
- **The generated dataset maintains good consistency with the human-labeled dataset**. The Spearman correlation between the rankings on the original MSMARCO dataset and the generated MSMARCO dataset is 0.8945.
- **The performance of the model varies across different domains**. For example, `e5-mistral-7b-instruct` is better than `bge-m3` in the healthcare domain, but `e5-mistral-7b-instruct` is worse than `bge-m3` in the law domain.

## Future Work

- More datasets will be generated to cover more domains and languages in the future.

## Acknowledgement


## Citing

```
```
