Metadata-Version: 2.1
Name: bmsdna-lakeapi
Version: 0.4.5
Summary: 
License: MIT
Author: DWH Team
Author-email: you@example.com
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Provides-Extra: auth
Provides-Extra: datafusion
Provides-Extra: polars
Provides-Extra: schema
Provides-Extra: useradd
Requires-Dist: aiocache (>=0.12.1,<0.13.0)
Requires-Dist: argon2-cffi (>=21.3.0,<22.0.0) ; extra == "auth"
Requires-Dist: datafusion (>=24.0.0,<25.0.0) ; extra == "datafusion"
Requires-Dist: deltalake (>=0.9.0,<0.10.0)
Requires-Dist: duckdb (>=0.8.0,<0.9.0)
Requires-Dist: fastapi (>=0.95.1,<0.96.0)
Requires-Dist: jsonschema (>=4.17.3,<5.0.0) ; extra == "schema"
Requires-Dist: polars (>=0.17.14,<0.18.0)
Requires-Dist: pyjwt (>=2.6.0,<3.0.0) ; extra == "auth"
Requires-Dist: pypika (>=0.48.9,<0.49.0)
Requires-Dist: python2jsonschema (>=0.8,<0.9) ; extra == "schema"
Requires-Dist: pyyaml (>=6.0,<7.0)
Requires-Dist: ruamel.yaml (>=0.17.26,<0.18.0) ; extra == "useradd"
Requires-Dist: xlsx2csv (>=0.8.1,<0.9.0) ; extra == "polars"
Requires-Dist: xlsxwriter (>=3.1.0,<4.0.0)
Description-Content-Type: text/markdown

# The BMS Lake API

<h1 align="center">
  <img src="assets\LakeAPI.drawio.png">
  <br>
</h1>

[![tests](https://github.com/bmsuisse/lakeapi/actions/workflows/python-test.yml/badge.svg?branch=main)](https://github.com/bmsuisse/lakeapi/actions/workflows/python-test.yml)

A FastAPI plugin that allows you to expose your data lake as an API, allowing several output formats such as Parquet, Csv, Json, Excel, ...

The Lake API also includes a minimal security layer for convenience (Basic Auth), but you can also bring your own.

Unlike [roapi](https://github.com/roapi/roapi), we intentionally do not expose most SQL by default, but limit the possible queries with a config. This makes it easy for you to control what happens to your data. If you want the SQL endpoint, you can enable it.

To run the application with the default config, just do it:

```python
app = FastAPI()
bmsdna.lakeapi.init_lakeapi(app)
```

To adjust the config, you can do like this:

```python
import dataclasses
import bmsdna.lakeapi

def_cfg = bmsdna.lakeapi.get_default_config() # Get default startup config
cfg = dataclasses.replace(def_cfg, enable_sql_endpoint=True, data_path="tests/data") # Use dataclasses.replace to set the properties you want
sti = bmsdna.lakeapi.init_lakeapi(app, cfg, "config_test.yml") # Enable it. The first parameter is the FastAPI instance, the 2nd one is the basic config and the third one the config of the tables
```

## Installation

[![PyPI version](https://badge.fury.io/py/bmsdna-lakeapi.svg)](https://pypi.org/project/bmsdna-lakeapi/)

Pypi Package `bmsdna-lakeapi` can be installed like any python package : `pip install bmsdna-lakeapi`

## OpenApi

Of course, everything works with Open API and FastAPI. Meaning you can add other FastAPI routes, you can use the /docs and /redoc endpoint.

## Engine

`DuckDB` is the default query engine. `Polars` and `Datafusion` are also supported, but lack some features. The query engine can be specified at the route level and at the query level with the hidden parameter $engine="duckdb|datafusion|polars". If you want polars or datafusion, add the required extra.

At the moment, DuckDB seems to have an advantage and performs the best. Also features like full text search are only available with `DuckDB`.

## Default Security

By Default, Basic Authentication is enabled. To add a user, simply run `add_lakeapi_user YOURUSERNAME --yaml-file config.yml`. This will add the user to your config yaml (argon2 encrypted).
The generated Password is printed. If you do not want this logic, you can overwrite the username_retriver of the Default Config

## Standalone Mode

If you just want to run this thing, you can run it with a webserver:

Uvicorn: `uvicorn bmsdna.lakeapi.standalone:app --host 0.0.0.0 --port 8080`

Gunicorn: `gunicorn bmsdna.lakeapi.standalone:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:80`

Of course you need to adjust your http options as needed. Also, you need to `pip install` uvicorn/gunicorn

You can still use environment variables for configuration

## Environment Variables

- CONFIG_PATH: The path of the config file, defaults to `config.yml`. If you want to split the config, you can specify a folder, too
- DATA_PATH: The path of the data files, defaults to `data`. Paths in `config.yml` are relative to DATA_PATH
- ENABLE_SQL_ENDPOINT: Set this to 1 to enable the SQL Endpoint

## Config File

The application by default relies on a Config file beeing present at the root of your project that's call `config.yml`.

The config file looks something like this, see also [our test yaml](config_test.yml):

```yaml
tables:
  - name: fruits
    tag: test
    version: 1
    api_method:
      - get
      - post
    params:
      - name: cars
        operators:
          - "="
          - in
      - name: fruits
        operators:
          - "="
          - in
    datasource:
      uri: delta/fruits
      file_type: delta

  - name: fruits_partition
    tag: test
    version: 1
    api_method:
      - get
      - post
    params:
      - name: cars
        operators:
          - "="
          - in
      - name: fruits
        operators:
          - "="
          - in
      - name: pk
        combi:
          - fruits
          - cars
      - name: combi
        combi:
          - fruits
          - cars
    datasource:
      uri: delta/fruits_partition
      file_type: delta
      select:
        - name: A
        - name: fruits
        - name: B
        - name: cars

  - name: fake_delta
    tag: test
    version: 1
    allow_get_all_pages: true
    api_method:
      - get
      - post
    params:
      - name: name
        operators:
          - "="
      - name: name1
        operators:
          - "="
    datasource:
      uri: delta/fake
      file_type: delta

  - name: fake_delta_partition
    tag: test
    version: 1
    allow_get_all_pages: true
    api_method:
      - get
      - post
    params:
      - name: name
        operators:
          - "="
      - name: name1
        operators:
          - "="
    datasource:
      uri: delta/fake
      file_type: delta

  - name: "*" # We're lazy and want to expose all in that folder. Name MUST be * and nothing else
    tag: startest
    version: 1
    api_method:
      - post
    datasource:
      uri: startest/* # Uri MUST end with /*
      file_type: delta

  - name: fruits # But we want to overwrite this one
    tag: startest
    version: 1
    api_method:
      - get
    datasource:
      uri: startest/fruits
      file_type: delta
```

## Partitioning for awesome performance

To use partitions, you can either

- Partition by a column that you filter on. Obviously
- partition on a special column called `columnname_md5_prefix_2`, which means that you're partitioning on the first two characters of your hex-coded md5 hash.
  of the hexadecimal md5 hash. If you now filter by `columnname`, this will greatly reduce the number of files you search for. The number of characters used is up to you, we found two to be useful
- partition on a special column called `columnname_md5_mod_NRPARTITIONS`, where your partition value is `str(int(hashlib.md5(COLUMNNAME).hexdigest(), 16) % NRPARTITIONS)`. This might look a bit complicated, but it's not that hard :) You're just doing a modulo on your md5 hash, which allows you to
  which allows you to set the exact number of partitions. Filtering is still done correctly on `columnname`.

Why partition by MD5 hash? Imagine you have a product id where most id's start with a 1 and some newer ones start with a 2. Most of the data will be in the first partition. If you use an MD5 hash, the data will be spread evenly across the partitions.

With this hack you can get sub-second results on very large data. 🚀🚀🚀

You need to use `deltalake` to use partitions, and you only need str partition columns for now.

[Z-ordering](https://docs.delta.io/latest/optimizations-oss.html#z-ordering-multi-dimensional-clustering) can also help a lot :). This approach should only be used for very large datasets.

## Even more features

- Built-in paging, you can use limit/offset to control what you get
- Full-text search using DuckDB's full-text search feature
- jsonify_complex parameter to convert structs/lists to json, the client cannot handle structs/lists
- Metadata endpoints to retrieve data types, string lengths and more
- Easily expose entire folders by using a "\*" wildcard in both the name and the datasource.uri config, see example in the config above
- Good test coverage

## Work in progress

Please note that this is a work in progress, changes may be made and things may break. Especially at this early stage.

## Contribution

Feel free to contribute, report bugs or request enhancements.

