Metadata-Version: 2.0
Name: aws-hadoop
Version: 1.0.dev2
Summary: Create enterprise grade Hadoop cluster in AWS in minutes.
Home-page: https://github.com/varmarakesh/aws-hadoop
Author: Rakesh Varma
Author-email: varma.rakesh@gmail.com
License: BSD
Download-URL: https://github.com/varmarakesh/aws-hadoop/tarball/1.0dev2
Description-Content-Type: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 2.7
Requires-Dist: jinja2
Requires-Dist: coverage
Requires-Dist: pypi-publisher
Requires-Dist: python-terraform
Requires-Dist: fabric
Requires-Dist: pep8

Create Enterprise grade Hadoop cluster in AWS.
===============================

author: Rakesh Varma

Overview
--------

Create enterprise grade hadoop cluster in AWS in minutes.

Installation / Usage
--------------------

Make sure [terraform](https://www.terraform.io/intro/getting-started/install.html) is installed. It is required to run this solution.

Make sure AWS credentials exists in your local `~/.aws/credentials` file. 
If you are using an `AWS_PROFILE` called `test` then your `credentials` file should like looks this:

```sh
[test]
aws_access_key_id = SOMEAWSACCESSKEYID
aws_secret_access_key = SOMEAWSSECRETACCESSKEY
```

Create a `config.ini` with the appropriate settings.

```sh
[default]

# AWS settings
aws_region = us-east-1
aws_profile = test
terraform_s3_bucket = hadoop-terraform-state
ssh_private_key = key.pem
vpc_id = vpc-883883883
vpc_subnets = [
                'subnet-89dad652',
                'subnet-7887z892',
                'subnet-f300b8z8'
              ]
hadoop_namenode_instance_type = t2.micro
hadoop_secondarynamenode_instance_type = t2.micro
hadoop_datanodes_instance_type = t2.micro
hadoop_datanodes_count = 2

# Hadoop settings
hadoop_replication_factor = 2
```

Once `config.ini` file is ready then install the libs and run. It is recommended to use a virtualenv.

```
   pip install aws-hadoop
```
Run this in python to create a hadoop cluster.
```
from aws_hadoop.install import Install
Install().create()
```

For running the source directly,

```sh
pip install -r requirements.txt
```
```sh
from aws_hadoop.install import Install
Install().create()
```

### Configuration Settings

This section describes each of the settings that go into the config file. Note some of the settings are optional.

###### aws_region

The aws_region where your terraform state bucket and your hadoop resources get created (eg: us-east-1)

##### aws_profile

The aws_profile that is used in your local `~/.aws/credentials` file.

##### terraform_s3_bucket

The terraform state information will be maintained in the specified s3 bucket. Make sure the aws_profile has write access to the s3 bucket.

##### ssh_key_pair

For hadoop provisioning, aws_hadoop needs to connect to hadoop nodes using SSH. The specified `ssh_key_pair` will allow the hadoop ec2's to be created with the public key.
If So make sure your machine has the private key in your `~/.ssh/` directory.

##### vpc_id

Specifiy the vpc id your AWS region in which the terraform resources should be created.

##### vpc_subnets

vpc_subnets is a list item that contains one or more subnet_id's. You can specify as many subnet id's as you want. Hadoop EC2 will get created in multiple subnets.

##### hadoop_namenode_instance_type (optional)

Specify the instance type of hadoop namenode. It not specified then the default instance type is `t2.micro`

##### hadoop_secondarynamenode_instance_type (optional)

Specify the instance type of hadoop secondarynamenode. It not specified then the default instance type is t2.micro

##### hadoop_datanodes_instance_type (optional)

Specify the instance type of hadoop datanodes. It not specified then the default instance type is t2.micro

##### hadoop_datanodes_count (optional)

Specify the number of hadoop data nodes that should be created. It not specified then the default value is set to 2

##### hadoop_replication_factor (optional)

Specify the replication factor of hadoop. It not specified then the default value is set to 2.

The following are ssh settings, used to ssh into the nodes.

##### ssh_user (optional)
The ssh user, eg: ubuntu

##### ssh_use_ssh_config (optional)
Set it to True if you want to use your settings in your `~/.ssh/config`

##### ssh_key_file (optional)
This is the key file location. SSH login is done thru a private/public key pair.

##### ssh_proxy (optional)
Use this setting if you are using a proxy ssh server (such as bastion).

Logging
------

A log file `hadoop-cluster.log` is created in the local directory.


