Metadata-Version: 2.1
Name: blendsql
Version: 0.0.16
Summary: Query language to blend SQL logic and LLM reasoning across multi-modal data.
Home-page: https://github.com/parkervg/blendsql
Author: Parker Glenn
Author-email: parkervg5@gmail.com
License: Apache License 2.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: outlines
Requires-Dist: pyparsing==3.1.1
Requires-Dist: pandas>=2.0.0
Requires-Dist: bottleneck>=1.3.6
Requires-Dist: python-dotenv==1.0.1
Requires-Dist: sqlglot==18.13.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: huggingface_hub
Requires-Dist: lark
Requires-Dist: exrex
Requires-Dist: platformdirs
Requires-Dist: attrs
Requires-Dist: tqdm
Requires-Dist: colorama
Requires-Dist: tabulate
Requires-Dist: typeguard
Requires-Dist: rapidfuzz
Provides-Extra: llama-cpp
Requires-Dist: llama-cpp-python; extra == "llama-cpp"
Provides-Extra: ollama
Requires-Dist: ollama; extra == "ollama"
Provides-Extra: openai
Requires-Dist: openai>1.0.0; extra == "openai"
Provides-Extra: transformers
Requires-Dist: transformers>=4.0.0; extra == "transformers"
Requires-Dist: datasets; extra == "transformers"
Requires-Dist: torch>=2.3.0; extra == "transformers"
Provides-Extra: research
Requires-Dist: datasets==2.16.1; extra == "research"
Requires-Dist: nltk; extra == "research"
Requires-Dist: wikiextractor; extra == "research"
Requires-Dist: rouge_score; extra == "research"
Requires-Dist: rapidfuzz; extra == "research"
Requires-Dist: records; extra == "research"
Requires-Dist: recognizers-text; extra == "research"
Requires-Dist: recognizers-text-suite; extra == "research"
Requires-Dist: emoji==1.7.0; extra == "research"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: huggingface_hub; extra == "test"
Requires-Dist: pre-commit; extra == "test"
Provides-Extra: docs
Requires-Dist: mkdocs-material; extra == "docs"
Requires-Dist: mkdocstrings; extra == "docs"
Requires-Dist: mkdocs-section-index; extra == "docs"
Requires-Dist: mkdocstrings-python; extra == "docs"
Requires-Dist: mkdocs-jupyter; extra == "docs"
Provides-Extra: demo
Requires-Dist: chainlit; extra == "demo"

<div align="right">
<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" /></a>
<a><img src="https://img.shields.io/github/last-commit/parkervg/blendsql?color=green"/></a>
<a><img src="https://img.shields.io/badge/PRs-Welcome-Green"/></a>
<br>
</div>

<div align="center"><picture>
  <source media="(prefers-color-scheme: dark)" srcset="docs/img/logo_dark.png">
  <img alt="blendsql" src="docs/img/logo_light.png" width=350">
</picture>
<p align="center">
    <i> SQL 🤝 LLMs </i>
  </p>
<b><h3>Check out our <a href="https://parkervg.github.io/blendsql/" target="_blank">online documentation</a> for a more comprehensive overview.</h3></b>

<i>Results from the paper are available [here](https://github.com/parkervg/blendsql/tree/research-paper/research/paper-results)</i>

</div>
<br/>

```
pip install blendsql
```

BlendSQL is a *superset of SQLite* for problem decomposition and hybrid question-answering with LLMs. 

As a result, we can *Blend* together...

- 🥤 ...operations over heterogeneous data sources (e.g. tables, text, images)
- 🥤 ...the structured & interpretable reasoning of SQL with the generalizable reasoning of LLMs

It can be viewed as an inversion of the typical text-to-SQL paradigm, where a user calls a LLM, and the LLM calls a SQL program.

**Now, the user is given the control to oversee all calls (LLM + SQL) within a unified query language.**

![comparison](docs/img/comparison.jpg)

For example, imagine we have the following table titles `parks`, containing [info on national parks in the United States](https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States).

We can use BlendSQL to build a travel planning LLM chatbot to help us navigate the options below.


| **Name**        | **Image**                                                                       | **Location**       | **Area**                          | **Recreation Visitors (2022)** | **Description**                                                                                                                          |
|-----------------|---------------------------------------------------------------------------------|--------------------|-----------------------------------|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Death Valley    | ![death_valley.jpeg](./docs/img/national_parks_example/death_valley.jpeg)       | California, Nevada | 3,408,395.63 acres (13,793.3 km2) | 1,128,862                      | Death Valley is the hottest, lowest, and driest place in the United States, with daytime temperatures that have exceeded 130 °F (54 °C). |
| Everglades      | ![everglades.jpeg](./docs/img/national_parks_example/everglades.jpeg)           | Alaska             | 7,523,897.45 acres (30,448.1 km2) | 9,457                          | The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities.              |
| New River Gorge | ![new_river_gorge.jpeg](./docs/img/national_parks_example/new_river_gorge.jpeg) | West Virgina       | 7,021 acres (28.4 km2)            | 1,593,523                      | The New River Gorge is the deepest river gorge east of the Mississippi River.                                                            |
 | Katmai          | ![katmai.jpg](./docs/img/national_parks_example/katmai.jpg)                     | Alaska             |  3,674,529.33 acres (14,870.3 km2)                                 | 33,908 | This park on the Alaska Peninsula protects the Valley of Ten Thousand Smokes, an ash flow formed by the 1912 eruption of Novarupta.  |

BlendSQL allows us to ask the following questions by injecting "ingredients", which are callable functions denoted by double curly brackets (`{{`, `}}`).

_Which parks don't have park facilities?_
```sql
SELECT * FROM parks
    WHERE NOT {{
        LLMValidate(
            'Does this location have park facilities?',
            context=(SELECT "Name" AS "Park", "Description" FROM parks),
        )
    }}
```

_What does the largest park in Alaska look like?_

```sql
SELECT {{VQA('Describe this image.', 'parks::Image')}} FROM parks
    WHERE "Location" = 'Alaska'
    ORDER BY {{
        LLMMap(
            'Size in km2?',
            'parks::Area'
        )
    }} LIMIT 1
```

_Which park protects an ash flow formed by a volcano eruption?_

```sql
{{
    LLMQA(
      'Which park protects an ash flow formed by a volcano?',
      context=(SELECT "Name", "Description" FROM parks),
      options="parks::Name"
    ) 
}}
```

For in-depth descriptions of the above queries, check out our [documentation](https://parkervg.github.io/blendsql/).

### Features

- Supports many DBMS 💾
  - Currently, SQLite and PostgreSQL are functional - more to come! 
- Easily extendable to [multi-modal usecases](./examples/vqa-ingredient.ipynb) 🖼️
- Smart parsing optimizes what is passed to external functions 🧠
  - Traverses abstract syntax tree with [sqlglot](https://github.com/tobymao/sqlglot) to minimize LLM function calls 🌳
- Constrained decoding with [outlines](https://github.com/outlines-dev/outlines) 🚀
- LLM function caching, built on [diskcache](https://grantjenks.com/docs/diskcache/) 🔑

## Quickstart

```python
from blendsql import blend, LLMQA
from blendsql.db import SQLite
from blendsql.models import OpenaiLLM, TransformersLLM
from blendsql.utils import fetch_from_hub

blendsql = """
SELECT * FROM w
WHERE city = {{
    LLMQA(
        'Which city is located 120 miles west of Sydney?',
        (SELECT * FROM documents WHERE documents MATCH 'sydney OR 120'),
        options='w::city'
    )
}}
"""
# Make our smoothie - the executed BlendSQL script
smoothie = blend(
    query=blendsql,
    db=SQLite(fetch_from_hub("1884_New_Zealand_rugby_union_tour_of_New_South_Wales_1.db")),
    blender=OpenaiLLM("gpt-3.5-turbo"),
    # If you don't have OpenAI setup, you can use this small Transformers model below instead
    # blender=TransformersLLM("Qwen/Qwen1.5-0.5B"),
    ingredients={LLMQA},
    verbose=True
)
print(smoothie.df)
print(smoothie.meta.prompts)
```
<hr>

### Citation

```bibtex
@article{glenn2024blendsql,
      title={BlendSQL: A Scalable Dialect for Unifying Hybrid Question Answering in Relational Algebra},
      author={Parker Glenn and Parag Pravin Dakle and Liang Wang and Preethi Raghavan},
      year={2024},
      eprint={2402.17882},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

### Acknowledgements
Special thanks to those below for inspiring this project. Definitely recommend checking out the linked work below, and citing when applicable!

- The authors of [Binding Language Models in Symbolic Languages](https://arxiv.org/abs/2210.02875)
  - This paper was the primary inspiration for BlendSQL.
- The authors of [EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images](https://arxiv.org/pdf/2310.18652)
  - As far as I can tell, the first publication to propose unifying model calls within SQL 
  - Served as the inspiration for the [vqa-ingredient.ipynb](./examples/vqa-ingredient.ipynb) example
- The authors of [Grammar Prompting for Domain-Specific Language Generation with Large Language Models](https://arxiv.org/abs/2305.19234)
