Metadata-Version: 1.1
Name: apiarist
Version: 0.1.12
Summary: Python Hive query framework
Home-page: http://github.com/msharp/apiarist
Author: Max Sharples
Author-email: maxsharples@gmail.com
License: Apache
Description: Apiarist
        ========
        
        A python 2.5+ package for defining Hive queries which can be run on AWS
        EMR.
        
        It is, in its current form, only addressing a very narrow use-case.
        Reading large text files into a Hive database, running a Hive query, and
        outputting the results to a text file.
        
        File format can be CSV or similar - other delimiters can be specified.
        
        The jobs are runnable locally, which is mainly for testing. You will
        need a local version of Hive which is in your ``PATH`` such that the
        command ``hive -f /some/hive/script.hql`` causes hive to execute the
        contents of the file.
        
        It is heavily modeled on `mrjob <https://github.com/Yelp/mrjob>`__ and
        attempts to present a similar API and use similar common variables to
        cooperate with ``boto``.
        
        A simple Hive job
        -----------------
        
        You will need to provide four methods:
        
        -  ``table`` the name of the table that your query will select from.
        -  ``input_columns`` the columns in the source data file.
        -  ``output_columns`` the columns that your query will output.
        -  ``query`` the HiveQL query.
        
        This code lives in ``/examples``.
        
        .. code:: python
        
            from apiarist.job import HiveJob
        
            class EmailRecipientsSummary(HiveJob):
        
                def table(self):
                    return 'emails_sent'
        
                def input_columns(self):
                    return [
                            ('day', 'STRING'),
                            ('weekday', 'INT'),
                            ('sent', 'BIGINT')
                            ]
        
                def output_columns(self):
                    return [
                            ('year', 'INT'),
                            ('weekday', 'INT'),
                            ('sent', 'BIGINT')
                            ]
        
                def query(self):
                    return "SELECT YEAR(day), weekday, SUM(sent) FROM emails_sent GROUP BY YEAR(day), weekday;"
        
            if __name__ == "__main__":
                EmailRecipientsSummary().run()
        
        Try it out
        ~~~~~~~~~~
        
        Locally (must have a Hive server available):
        
        ::
        
            python email_recipients_summary.py -r local /path/to/your/local/file.csv
        
        EMR:
        
        ::
        
            python email_recipients_summary.py -r emr s3://path/to/your/S3/files/
        
        *NOTE: for the EMR command, you will need to supply some basic
        configuration.*
        
        Serde
        ~~~~~
        
        Hive allows custom a serde to be used to define data formats in tables.
        Apiarist uses `csv-serde <https://github.com/ogrodnek/csv-serde>`__ to
        handle the CSV format properly.
        
        This serde also allows configuration of the delimiter, quoting
        character, and escape character. The defaults are, delimiter = ``,``,
        quote character = ``"``, escape character = ``\``.
        
        You can override the defaults in your job. You should be careful about
        escape sequences when doing so because the value needs to be written
        into a file.
        
        It is best to definie them as string literals. Example:
        
        .. code:: python
        
            from apiarist.job import HiveJob
        
            class EmailRecipientsSummary(HiveJob):
        
                INFILE_DELIMITER_CHAR = r'\t'
                INFILE_QUOTE_CHAR = r"\'"
                INFILE_ESCAPE_CHAR = r'%'
        
                OUTFILE_DELIMITER_CHAR = r'\t'
                OUTFILE_QUOTE_CHAR = r'\"'
                OUTFILE_ESCAPE_CHAR = r"\\"
        
        Configuration
        -------------
        
        There are a range of options for providing job-specific configuration.
        
        Command-line options
        ~~~~~~~~~~~~~~~~~~~~
        
        Arguments can be passed to jobs on the command line, or programmatically
        with an array of options. Argument handling uses the
        `optparse <https://docs.python.org/2/library/optparse.html>`__ module.
        
        Various options can be passed to control the running of the job. In
        particular the AWS/EMR options.
        
        -  ``-r`` the run mode. Either ``local`` or ``emr`` (default is
           ``local``)
        -  ``--conf-path`` use a YAML configuration file.
        -  ``--output-dir`` where the results of the job will go.
        -  ``--s3-scratch-uri`` the bucket in which all the temporary files can
           go.
        -  ``--local-scratch-dir`` this is where temporary file will be written.
        -  ``--s3-log-uri`` write the logs to this location on S3.
        -  ``--ec2-instance-type`` the base instance type. Default is
           ``m3.xlarge``
        -  ``--ec2-master-instance-type`` if you want the master type to be
           different.
        -  ``--num-ec2-instances`` number of instances (including the master).
           Default is ``2``.
        -  ``--ami-version`` the ami version. Default is ``latest``.
        -  ``--hive-version``. Default is ``latest``.
        -  ``--iam-instance-profile`` role for the EC2 instances on the cluster.
           Default is ``EMR_EC2_DefaultRole``.
        -  ``--iam-service-role`` role for the Amazon EMR service on the
           cluster. Default is ``EMR_DefaultRole``.
        -  ``--s3-sync-wait-time`` to configure how long to wait after uploading
           files to S3.
        -  ``--check-emr-status-every`` configure the interval between each
           status check on a running job.
        -  ``--quiet`` less logging
        -  ``--verbose`` more logging
        -  ``--retain-hive-table`` for local mode, keep the hive table to run
           further ad-hoc queries.
        
        *NOTE: IAM roles will be mandatory for all users after June 30, 2015.
        These are set via the ``--iam-instance-profile`` and
        ``--iam-service-role`` options above.*
        
        *See `Configure IAM Roles for Amazon
        EMR <http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html>`__*
        
        Configuration file
        ~~~~~~~~~~~~~~~~~~
        
        You can supply arguments to your job in a configuration file. It takes
        the same format as ``mrjob`` configuration.
        
        The name of the arguments is different, using underscores instead of
        hyphens and omitting leading hyphens. Config options are divided by the
        type of runner (local/emr) to allow provision of all options for a job
        in one file.
        
        Below is a sample config file:
        
        .. code:: yaml
        
            runners:
              emr:
                aws_access_key_id: AABBCCDDEEFF11223344
                aws_secret_access_key: AABBCCDDEEFF1122334AABBCCDDEEFF
                ec2_instance_type: m3.xlarge
                num_ec2_instances: 5
                s3_scratch_uri: s3://myjobs/scratchspace/
                ami_version: 3.2.1
                hive_version: 0.13.1
              local:
                local_scratch_dir: /home/apiarist/temp/
        
        Arguments supplied on command-line or in application code will override
        those supplied in the config file.
        
        Environment variables
        ~~~~~~~~~~~~~~~~~~~~~
        
        Some environment variables are used when the value is not provided in
        other configuration methods.
        
        ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` for connecting to
        AWS.
        
        ``S3_SCRATCH_URI`` a S3 base location where all the temporary file for
        the job will be written.
        
        ``APIARIST_TMP_DIR`` where local files will be written during job runs.
        (This is overridden by the ``--local-scratch-dir`` option)
        
        ``CSV_SERDE_JAR_S3`` a permanent location of the serde jar. If this is
        not set, Apiarist will automatically upload a copy of the jar to an S3
        location in the scratch space.
        
        Passing options to your jobs
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        Jobs can be configured to accept arguments.
        
        To do this, add the following method to your job class to configutr the
        options:
        
        .. code:: python
        
            def configure_options(self):
                super(EmailRecipientsSummary, self).configure_options()
                self.add_passthrough_option('--year', dest='year')
        
        And then use the option by providing it in the command line arguments,
        like this:
        
        ::
        
            python email_recipients_summary.py -r local /path/to/your/local/file.csv --year 2014
        
        Then incorporating it into your HiveQL query like this:
        
        .. code:: python
        
            def query(self):
                q = "SELECT YEAR(day), weekday, SUM(sent) "
                q += "FROM emails_sent "
                q += "WHERE YEAR(day) = {0} ".format(self.options.year)
                q += "GROUP BY YEAR(day), weekday;"
                return q
        
        License
        -------
        
        Apiarist source code is released under Apache 2 License. Check LICENSE
        file for more information.
        
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2.5
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Topic :: System :: Distributed Computing
Provides: apiarist
