Metadata-Version: 2.4
Name: easy_data_loader
Version: 0.1.3
Summary: Data transfer utilities between files and databases
Author-email: Bojoi Gabriel <bojoigabriel@gmail.com>
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.3.0
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pandas>=2.3.3
Requires-Dist: psycopg2-binary>=2.9.11
Requires-Dist: pyarrow>=22.0.0
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pydantic-settings>=2.12.0
Requires-Dist: pyodbc>=5.2.0
Requires-Dist: python-dotenv>=1.1.1
Requires-Dist: sqlalchemy>=2.0.43
Dynamic: license-file

# Easy Data Loader 🚀


[![PyPI version](https://badge.fury.io/py/easy-data-loader.svg)](https://badge.fury.io/py/easy-data-loader)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
![Downloads](https://static.pepy.tech/badge/easy-data-loader)

**Easy Data Loader** is a flexible, modular Python library designed to streamline ETL (Extract, Transform, Load) processes between various file data sources (csv, xlsx, parquet, orc) and databases (MSSQL, PostgreSQL and others).

## ✨ Key Features
- **Declarative Configuration**: Manage connections and pipelines through simple python files and `.env` resources.
- **Integrated CLI**: Initialize a standardized project structure with a single command.
- **Custom Transformation Hooks**: Inject your own Pandas transformation logic directly into the pipeline execution.
- **Performance Optimized**: Built-in support for chunked loading and writing to handle large datasets efficiently.
- **Extensible Architecture**: Uses a Factory Pattern for database connectors, making it easy to support new drivers.

---

## 📦 Installation

Install directly via `pip` or `uv`:

```bash
pip install easy_data_loader
uv add easy_data_loader
```

## 🚀 Getting Started

1. Initialize a new project structure to generate template configurations:
   ```bash
   easy-data-loader init
   ```
2. Review the generated `config/` folders for sample resources and pipelines.
3. Run all discovered pipelines across the active configurations:
   ```bash
   easy-data-loader run_all
   ```

## ✔️ Generic concepts

`easy_data_loader` uses `resources` as a way to define a file or a database. The resouces can represent either a source or a destination making posible the folowing ETL scenarios: file -> file, file -> database, database -> file, database -> database.

`easy_data_loader` project initializer will created the predefined folder structure `/config/resources` where the resources are expected to be defined following the current convention: the file type is .env and the file name must be prefixed with the resource type `file_` or `database_`. The predefined folder structure together with the naming convention enables `easy_data_loader` to find and load all resources.

A secondary predefined folder `/config/pipelines` will contain the pipeline definition files, which are regular Python files. There are 3 types of pipelines that can be defined:
- `LoadPipeline` the main pipeline type which transports data from source to destination
- `ProcedurePipeline` a pipeline dedicated for executing stored procedures inside a database
- `OrchestratorPipeline` a pipeline that can execute a group of pipelines sequentialy

## LoadPipeline

In order to define a `LoadPipeline` we must use the `BasePipelineDefinition` from `easy_data_loader` as depicted in the example pipelines created by the initializer.
In the simplest form there are only a few mandatory parameters:
- `pipeline_name : str` - this name will be used to execute the pipeline
- `source : str` - the file name (without extension) coresponding to the desired resource to be the data source
- `destination : str` - the file name (without extension) coresponding to the desired resource to be the data destination

If either the source or destination are a database then additional parameters become mandatory:
- `source_sql : str` - can be a table name or a specific query in the SQL dialect of the source database flavor
- `destination_table : str` - table name where the data will be inserted

There are many other aspects of the pipeline that can be defined:
- `audit : str` - the pipeline has a built in audit functionality, it records certain information after the pipeline completes in a SqlLite database. If the user desires, the same information can be recorded in a database `resource`
- `validator: Pydantic BaseModel` - the data read from the source `resource` can be validated using an arbitrary defined Pydantic model before is written to destination
- `columns : Dict[str, ColumnDefinition]` - this parameter is used for strict control on how the data is written to destination; it has the dual purpose of renaming the columns and also define explicitly the data types (mainly for inserting into a database table); the `ColumnDefinition` is constructed with an optional `target_name: str` for renaming columns and / or a  `data_type : SqlAlchemy Type` thus controling column data types, lenghts, precision etc.
- `read_parameters : Dict[str, Any]` and `write_parameters : Dict[str, Any]` - these parameters control how the data is being read or written from source to destination and provide an easy way to use special delimiters for files, drop and recreate the database table, etc. `easy_data_loader` is using pandas as the transport layer therefore the read and write parameters will be passed to the coresponding read and write functions supported by pandas.
- the pipeline has a set of predefined hooks allowing the execution of functions at specific moments during the execution: `file_pre_process : Callable` - executed before the file is read into the pandas DataFrame (e.g. unzip the file); `transform : Callable` - perform data transformation over the data already in the pandas DataFrame (requires pandas methods); `file_post_process : Callable` - after the pipeline completes and the data is written to the destination perform post processing on the source file (e.g. move the file to another folder)

## ProcedurePipeline

This secondary pipeline type is responsible for executing one or more stored procedures inside a database.
To define one we need to use the `ProcedureDefinition` with the following parameters:
- `pipeline_name : str` - this name will be used to execute the pipeline
- `audit : str, optional` - database resource name where the audit info will be recorded
- `resource : str` - database resource name where the stored procedure(s) wil be executed
- `procedures : List[tuple(str, Optional[Dict[str, Any]])]` - list of one or more stord procedures along with optional procedures parameters as dictionaries

## OrchestratorPipeline

This pipeline type is responsible of executing sequentially a set of pipelines, `LoadPipeline`s and / or `ProcedurePipeline`s. Very simple to define using the `OrchestratorDefinition` with:
- `orchestrator_name : str` - name by which the orchestrator is executer
- 'pipelines : List[str]` - list of pipelines to execute sequentially
- `fail_fast : bool, Default True` - if any of the pipelines fail the rest of the pipelines in the list do not get executed
