Metadata-Version: 2.4
Name: arize-phoenix-evals
Version: 0.27.0
Summary: LLM Evaluations
Project-URL: Documentation, https://arize.com/docs/phoenix/
Project-URL: Issues, https://github.com/Arize-ai/phoenix/issues
Project-URL: Source, https://github.com/Arize-ai/phoenix
Author-email: Arize AI <phoenix-devs@arize.com>
License: Elastic-2.0
License-File: IP_NOTICE
License-File: LICENSE
Keywords: Explainability,Monitoring,Observability
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: <3.14,>=3.8
Requires-Dist: pandas
Requires-Dist: pystache
Requires-Dist: tqdm
Requires-Dist: typing-extensions<5,>=4.5
Provides-Extra: dev
Requires-Dist: anthropic>0.18.0; extra == 'dev'
Requires-Dist: boto3; extra == 'dev'
Requires-Dist: litellm>=1.28.9; extra == 'dev'
Requires-Dist: mistralai>=1.0.0; extra == 'dev'
Requires-Dist: openai>=1.0.0; extra == 'dev'
Requires-Dist: vertexai; extra == 'dev'
Provides-Extra: test
Requires-Dist: anthropic>=0.18.0; extra == 'test'
Requires-Dist: boto3; extra == 'test'
Requires-Dist: lameenc; extra == 'test'
Requires-Dist: litellm>=1.28.9; extra == 'test'
Requires-Dist: mistralai>=1.0.0; extra == 'test'
Requires-Dist: nest-asyncio; extra == 'test'
Requires-Dist: openai>=1.0.0; extra == 'test'
Requires-Dist: openinference-semantic-conventions; extra == 'test'
Requires-Dist: pandas; extra == 'test'
Requires-Dist: pandas-stubs<=2.0.2.230605; extra == 'test'
Requires-Dist: respx; extra == 'test'
Requires-Dist: tqdm; extra == 'test'
Requires-Dist: types-tqdm; extra == 'test'
Requires-Dist: typing-extensions<5,>=4.5; extra == 'test'
Requires-Dist: vertexai; extra == 'test'
Description-Content-Type: text/markdown

# arize-phoenix-evals

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

- Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
- Data science rigor applied to the testing of model and template combinations
- Designed to run as fast as possible on batches of data
- Includes benchmark datasets and tests for each eval function

## Installation

Install the arize-phoenix-evals sub-package via `pip`

```shell
pip install arize-phoenix-evals
```

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

```shell
pip install 'openai>=1.0.0'
```

## Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

This example uses scikit-learn, so install it via `pip`
```shell
pip install scikit-learn
```

```python
import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
    model="o3-mini",
    temperature=0.0,
)

# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)

# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]

# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
    print(f"Class: {label} (count: {support[idx]})")
    print(f"  Precision: {precision[idx]:.2f}")
    print(f"  Recall:    {recall[idx]:.2f}")
    print(f"  F1 Score:  {f1[idx]:.2f}\n")
```

To learn more about LLM Evals, see the [LLM Evals documentation](https://arize.com/docs/phoenix/concepts/llm-evals/).
