Metadata-Version: 2.1
Name: blendsql
Version: 0.0.16
Summary: Query language to blend SQL logic and LLM reasoning across multi-modal data.
Home-page: https://github.com/parkervg/blendsql
Author: Parker Glenn
Author-email: parkervg5@gmail.com
License: Apache License 2.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: outlines
Requires-Dist: pyparsing ==3.1.1
Requires-Dist: pandas >=2.0.0
Requires-Dist: bottleneck >=1.3.6
Requires-Dist: python-dotenv ==1.0.1
Requires-Dist: sqlglot ==18.13.0
Requires-Dist: sqlalchemy >=2.0.0
Requires-Dist: huggingface-hub
Requires-Dist: lark
Requires-Dist: exrex
Requires-Dist: platformdirs
Requires-Dist: attrs
Requires-Dist: tqdm
Requires-Dist: colorama
Requires-Dist: tabulate
Requires-Dist: typeguard
Requires-Dist: rapidfuzz
Provides-Extra: demo
Requires-Dist: chainlit ; extra == 'demo'
Provides-Extra: docs
Requires-Dist: mkdocs-material ; extra == 'docs'
Requires-Dist: mkdocstrings ; extra == 'docs'
Requires-Dist: mkdocs-section-index ; extra == 'docs'
Requires-Dist: mkdocstrings-python ; extra == 'docs'
Requires-Dist: mkdocs-jupyter ; extra == 'docs'
Provides-Extra: llama-cpp
Requires-Dist: llama-cpp-python ; extra == 'llama-cpp'
Provides-Extra: ollama
Requires-Dist: ollama ; extra == 'ollama'
Provides-Extra: openai
Requires-Dist: openai >1.0.0 ; extra == 'openai'
Provides-Extra: research
Requires-Dist: datasets ==2.16.1 ; extra == 'research'
Requires-Dist: nltk ; extra == 'research'
Requires-Dist: wikiextractor ; extra == 'research'
Requires-Dist: rouge-score ; extra == 'research'
Requires-Dist: rapidfuzz ; extra == 'research'
Requires-Dist: records ; extra == 'research'
Requires-Dist: recognizers-text ; extra == 'research'
Requires-Dist: recognizers-text-suite ; extra == 'research'
Requires-Dist: emoji ==1.7.0 ; extra == 'research'
Provides-Extra: test
Requires-Dist: pytest ; extra == 'test'
Requires-Dist: huggingface-hub ; extra == 'test'
Requires-Dist: pre-commit ; extra == 'test'
Provides-Extra: transformers
Requires-Dist: transformers >=4.0.0 ; extra == 'transformers'
Requires-Dist: datasets ; extra == 'transformers'
Requires-Dist: torch >=2.3.0 ; extra == 'transformers'

<div align="right">
<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" /></a>
<a><img src="https://img.shields.io/github/last-commit/parkervg/blendsql?color=green"/></a>
<a><img src="https://img.shields.io/badge/PRs-Welcome-Green"/></a>
<br>
</div>

<div align="center"><picture>
  <source media="(prefers-color-scheme: dark)" srcset="docs/img/logo_dark.png">
  <img alt="blendsql" src="docs/img/logo_light.png" width=350">
</picture>
<p align="center">
    <i> SQL 🤝 LLMs </i>
  </p>
<b><h3>Check out our <a href="https://parkervg.github.io/blendsql/" target="_blank">online documentation</a> for a more comprehensive overview.</h3></b>

<i>Results from the paper are available [here](https://github.com/parkervg/blendsql/tree/research-paper/research/paper-results)</i>

</div>
<br/>

```
pip install blendsql
```

BlendSQL is a *superset of SQLite* for problem decomposition and hybrid question-answering with LLMs. 

As a result, we can *Blend* together...

- 🥤 ...operations over heterogeneous data sources (e.g. tables, text, images)
- 🥤 ...the structured & interpretable reasoning of SQL with the generalizable reasoning of LLMs

It can be viewed as an inversion of the typical text-to-SQL paradigm, where a user calls a LLM, and the LLM calls a SQL program.

**Now, the user is given the control to oversee all calls (LLM + SQL) within a unified query language.**

![comparison](docs/img/comparison.jpg)

For example, imagine we have the following table titles `parks`, containing [info on national parks in the United States](https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States).

We can use BlendSQL to build a travel planning LLM chatbot to help us navigate the options below.


| **Name**        | **Image**                                                                       | **Location**       | **Area**                          | **Recreation Visitors (2022)** | **Description**                                                                                                                          |
|-----------------|---------------------------------------------------------------------------------|--------------------|-----------------------------------|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Death Valley    | ![death_valley.jpeg](./docs/img/national_parks_example/death_valley.jpeg)       | California, Nevada | 3,408,395.63 acres (13,793.3 km2) | 1,128,862                      | Death Valley is the hottest, lowest, and driest place in the United States, with daytime temperatures that have exceeded 130 °F (54 °C). |
| Everglades      | ![everglades.jpeg](./docs/img/national_parks_example/everglades.jpeg)           | Alaska             | 7,523,897.45 acres (30,448.1 km2) | 9,457                          | The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities.              |
| New River Gorge | ![new_river_gorge.jpeg](./docs/img/national_parks_example/new_river_gorge.jpeg) | West Virgina       | 7,021 acres (28.4 km2)            | 1,593,523                      | The New River Gorge is the deepest river gorge east of the Mississippi River.                                                            |
 | Katmai          | ![katmai.jpg](./docs/img/national_parks_example/katmai.jpg)                     | Alaska             |  3,674,529.33 acres (14,870.3 km2)                                 | 33,908 | This park on the Alaska Peninsula protects the Valley of Ten Thousand Smokes, an ash flow formed by the 1912 eruption of Novarupta.  |

BlendSQL allows us to ask the following questions by injecting "ingredients", which are callable functions denoted by double curly brackets (`{{`, `}}`).

_Which parks don't have park facilities?_
```sql
SELECT * FROM parks
    WHERE NOT {{
        LLMValidate(
            'Does this location have park facilities?',
            context=(SELECT "Name" AS "Park", "Description" FROM parks),
        )
    }}
```

_What does the largest park in Alaska look like?_

```sql
SELECT {{VQA('Describe this image.', 'parks::Image')}} FROM parks
    WHERE "Location" = 'Alaska'
    ORDER BY {{
        LLMMap(
            'Size in km2?',
            'parks::Area'
        )
    }} LIMIT 1
```

_Which park protects an ash flow formed by a volcano eruption?_

```sql
{{
    LLMQA(
      'Which park protects an ash flow formed by a volcano?',
      context=(SELECT "Name", "Description" FROM parks),
      options="parks::Name"
    ) 
}}
```

For in-depth descriptions of the above queries, check out our [documentation](https://parkervg.github.io/blendsql/).

### Features

- Supports many DBMS 💾
  - Currently, SQLite and PostgreSQL are functional - more to come! 
- Easily extendable to [multi-modal usecases](./examples/vqa-ingredient.ipynb) 🖼️
- Smart parsing optimizes what is passed to external functions 🧠
  - Traverses abstract syntax tree with [sqlglot](https://github.com/tobymao/sqlglot) to minimize LLM function calls 🌳
- Constrained decoding with [outlines](https://github.com/outlines-dev/outlines) 🚀
- LLM function caching, built on [diskcache](https://grantjenks.com/docs/diskcache/) 🔑

## Quickstart

```python
from blendsql import blend, LLMQA
from blendsql.db import SQLite
from blendsql.models import OpenaiLLM, TransformersLLM
from blendsql.utils import fetch_from_hub

blendsql = """
SELECT * FROM w
WHERE city = {{
    LLMQA(
        'Which city is located 120 miles west of Sydney?',
        (SELECT * FROM documents WHERE documents MATCH 'sydney OR 120'),
        options='w::city'
    )
}}
"""
# Make our smoothie - the executed BlendSQL script
smoothie = blend(
    query=blendsql,
    db=SQLite(fetch_from_hub("1884_New_Zealand_rugby_union_tour_of_New_South_Wales_1.db")),
    blender=OpenaiLLM("gpt-3.5-turbo"),
    # If you don't have OpenAI setup, you can use this small Transformers model below instead
    # blender=TransformersLLM("Qwen/Qwen1.5-0.5B"),
    ingredients={LLMQA},
    verbose=True
)
print(smoothie.df)
print(smoothie.meta.prompts)
```
<hr>

### Citation

```bibtex
@article{glenn2024blendsql,
      title={BlendSQL: A Scalable Dialect for Unifying Hybrid Question Answering in Relational Algebra},
      author={Parker Glenn and Parag Pravin Dakle and Liang Wang and Preethi Raghavan},
      year={2024},
      eprint={2402.17882},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

### Acknowledgements
Special thanks to those below for inspiring this project. Definitely recommend checking out the linked work below, and citing when applicable!

- The authors of [Binding Language Models in Symbolic Languages](https://arxiv.org/abs/2210.02875)
  - This paper was the primary inspiration for BlendSQL.
- The authors of [EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images](https://arxiv.org/pdf/2310.18652)
  - As far as I can tell, the first publication to propose unifying model calls within SQL 
  - Served as the inspiration for the [vqa-ingredient.ipynb](./examples/vqa-ingredient.ipynb) example
- The authors of [Grammar Prompting for Domain-Specific Language Generation with Large Language Models](https://arxiv.org/abs/2305.19234)
