Metadata-Version: 2.1
Name: airflow-livy-plugins
Version: 0.1
Summary: Plugins for Airflow to run Spark jobs via Livy: sessions and batches
Home-page: https://github.com/panovvv/aivrflow-livy-plugins
Author: Vadim Panov
Author-email: headcra6@gmail.com
License: MIT License
Platform: UNKNOWN
Requires-Python: >=3.7
Description-Content-Type: text/markdown

# Airflow Livy Plugins

[![Build Status](https://travis-ci.org/panovvv/airflow-livy-plugins.svg?branch=master)](https://travis-ci.org/panovvv/airflow-livy-plugins)
[![Code coverage](https://codecov.io/gh/panovvv/airflow-livy-plugins/branch/master/graph/badge.svg)](https://codecov.io/gh/panovvv/airflow-livy-plugins)

Plugins for Airflow to run Spark jobs via Livy: 
* Sessions,
* Batches. This mode supports additional verification via Spark/YARN REST API.

See [this blog post](https://www.shortn0tes.com/2019/08/airflow-spark-livy-sessions-batches.html "Blog post") for more information and detailed comparison of ways to run Spark jobs from Airflow.

## Directories and files of interest
* `airflow_home`: example DAGs and plugins for Airflow. Can be used as 
Airflow home path.
* `batches`: Spark jobs code, to be used in Livy batches.
* `sessions`: (Optionally) templated Spark code for Livy sessions.
* `airflow.sh`: helper shell script. Can be used to run sample DAGs,
prep development environment and more.
Run it to find out what other commands are available.


## How do I...

### ...run the examples?
Prerequisites:
* Python 3. Make sure it's installed and in __$PATH__

Now, 
1. Do you have a Spark cluster with Livy running somewhere?
    1. *No*. Either get one, or "mock" it with 
    [my Spark cluster on Docker Compose](https://github.com/panovvv/bigdata-docker-compose).
    1. *Yes*. You're golden!
1. __Optional - this step can be skipped if you're mocking a cluster on your
machine__. Open *airflow.sh*. Inside `init_airflow ()` function you'll see Airflow
Connections for Livy, Spark and YARN. Redefine as appropriate.
1. run `./airflow.sh up` to bring up the whole infrastructure. 
Airflow UI will be available at
[localhost:8080](http://localhost:8888 "Airflow UI").
1. Ctrl+C to stop Airflow. Then `./airflow.sh down` to dispose of
remaining Airflow processes (shouldn't be needed there if everything goes well).


### ...set up development environment?

* run `/airflow.sh dev` to install all dev dependencies.
* (Pycharm-specific) point PyCharm to your newly-created virtual environment: go to
`"Preferences" -> "Project: airflow-livy-plugins" -> "Project interpreter", select
"Existing environment"` and pick __python3__ executable from __venv__ folder
(__venv/bin/python3__)
* `./airflow.sh cov` - run tests with coverage report 
(will be saved to *htmlcov/*).
* `./airflow.sh lint` - highlight code style errors.
* `./airflow.sh format` to reformat all code 
([Black](https://black.readthedocs.io/en/stable/) + 
[isort](https://readthedocs.org/projects/isort/))

### ... debug?

* (Pycharm-specific) Step-by-step debugging with `airflow test` 
and running PySpark batch jobs locally (with debugging as well) 
is supported via run configurations under `.idea/runConfigurations`.
You shouldn't have to do anything to use them - just open the folder
in PyCharm as a project.
* An example of how a batch can be ran on local Spark:
```bash
python ./batches/join_2_files.py \
"file:////Users/vpanov/data/vpanov/bigdata-docker-compose/data/grades.csv" \
"file:///Users/vpanov/data/vpanov/bigdata-docker-compose/data/ssn-address.tsv" \
-file1_sep=, -file1_header=true \
-file1_schema="\`Last name\` STRING, \`First name\` STRING, SSN STRING, Test1 INT, Test2 INT, Test3 INT, Test4 INT, Final INT, Grade STRING" \
-file1_join_column=SSN -file2_header=false \
-file2_schema="\`Last name\` STRING, \`First name\` STRING, SSN STRING, Address1 STRING, Address2 STRING" \
-file2_join_column=SSN -output_header=true \
-output_columns="file1.\`Last name\` AS LastName, file1.\`First name\` AS FirstName, file1.SSN, file2.Address1, file2.Address2" 

# Optionally append to save result to file
#-output_path="file:///Users/vpanov/livy_batch_example" 
```

## TODO
* airflow.sh - replace with modern tools (e.g. pipenv + Docker image)
* Disable some of flake8 flags for cleaner code

