Metadata-Version: 2.1
Name: PyStellarDB
Version: 0.11.0
Summary: Python interface to StellarDB
Home-page: https://github.com/WarpCloud/PyStellarDB
Author: Zhiping Wang
Author-email: zhiping.wang@transwarp.io
License: Apache License, Version 2.0
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Database :: Front-Ends
Requires-Python: >=2.7,<=3.7
Provides-Extra: hive
Provides-Extra: sqlalchemy
Provides-Extra: kerberos
Provides-Extra: presto
Requires-Dist: future
Requires-Dist: python-dateutil
Requires-Dist: pyhive
Requires-Dist: sasl
Requires-Dist: thrift
Requires-Dist: thrift-sasl (>=0.3.0)
Requires-Dist: pyspark (>=2.4.0)
Provides-Extra: hive
Requires-Dist: sasl (>=0.2.1); extra == 'hive'
Requires-Dist: thrift (>=0.10.0); extra == 'hive'
Provides-Extra: kerberos
Requires-Dist: requests-kerberos (>=0.12.0); extra == 'kerberos'
Provides-Extra: presto
Requires-Dist: requests (>=1.0.0); extra == 'presto'
Provides-Extra: sqlalchemy
Requires-Dist: sqlalchemy (>=1.3.0); extra == 'sqlalchemy'

PyStellarDB
===========

PyStellarDB is a Python API for executing Transwarp Exetended OpenCypher(TEoC) and Hive query.
It could also generate a RDD object which could be used in PySpark.
It is base on PyHive(https://github.com/dropbox/PyHive) and PySpark(https://github.com/apache/spark/)

PySpark RDD
===========

We hack a way to generate RDD object using the same method in `sc.parallelize(data)`.
It could cause memory panic if the query returns a large amount of data.

Users could use a workaround if you do need huge data:

1. If you are querying a graph, refer to StellarDB manual of Chapter 4.4.5 to save the query data into a temporary table.

2. If you are querying a SQL table, save your query result into a temporary table.

3. Find the HDFS path of the temporary table generated in Step 1 or Step 2.

4. Use API like `sc.newAPIHadoopFile()` to generate RDD.

Usage
=====

PLAIN Mode (No security is configured)
---------------------------------------
.. code-block:: python

    from pystellardb import stellar_hive

    conn = stellar_hive.StellarConnection(host="localhost", port=10000, graph_name='pokemon')
    cur = conn.cursor()
    cur.execute('config query.lang cypher')
    cur.execute('use graph pokemon')
    cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')

    print cur.fetchall()


LDAP Mode
---------
.. code-block:: python

    from pystellardb import stellar_hive

    conn = stellar_hive.StellarConnection(host="localhost", port=10000, username='hive', password='123456', auth='LDAP', graph_name='pokemon')
    cur = conn.cursor()
    cur.execute('config query.lang cypher')
    cur.execute('use graph pokemon')
    cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')

    print cur.fetchall()


Kerberos Mode
-------------
.. code-block:: python

    # Make sure you have the correct realms infomation about the KDC server in /etc/krb5.conf
    # Make sure you have the correct keytab file in your environment
    # Run kinit command:
    # In Linux: kinit -kt FILE_PATH_OF_KEYTABL PRINCIPAL_NAME
    # In Mac: kinit -t FILE_PATH_OF_KEYTABL -f PRINCIPAL_NAME

    from pystellardb import stellar_hive

    conn = stellar_hive.StellarConnection(host="localhost", port=10000, kerberos_service_name='hive', auth='KERBEROS', graph_name='pokemon')
    cur = conn.cursor()
    cur.execute('config query.lang cypher')
    cur.execute('use graph pokemon')
    cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')

    print cur.fetchall()


Execute Hive Query
------------------
.. code-block:: python

    from pystellardb import stellar_hive

    # If `graph_name` parameter is None, it will execute a Hive query and return data just as PyHive does
    conn = stellar_hive.StellarConnection(host="localhost", port=10000, database='default')
    cur = conn.cursor()
    cur.execute('SELECT * FROM default.abc limit 10')


Execute Graph Query and change to a PySpark RDD object
------------------------------------------------------
.. code-block:: python

    from pyspark import SparkContext
    from pystellardb import stellar_hive

    sc = SparkContext("local", "Demo App")

    conn = stellar_hive.StellarConnection(host="localhost", port=10000, graph_name='pokemon')
    cur = conn.cursor()
    cur.execute('config query.lang cypher')
    cur.execute('use graph pokemon')
    cur.execute('match p = (a)-[f]->(b) return a,f,b limit 10')

    rdd = cur.toRDD(sc)

    def f(x): print(x)

    rdd.map(lambda x: (x[0].toJSON(), x[1].toJSON(), x[2].toJSON())).foreach(f)

    # Every line of this query is in format of Tuple(VertexObject, EdgeObject, VertexObject)
    # Vertex and Edge object has a function of toJSON() which can print the object in JSON format


Execute Hive Query and change to a PySpark RDD object
-----------------------------------------------------
.. code-block:: python

    from pyspark import SparkContext
    from pystellardb import stellar_hive

    sc = SparkContext("local", "Demo App")

    conn = stellar_hive.StellarConnection(host="localhost", port=10000)
    cur = conn.cursor()
    cur.execute('select * from default_db.default_table limit 10')

    rdd = cur.toRDD(sc)

    def f(x): print(x)

    rdd.foreach(f)

    # Every line of this query is in format of Tuple(Column, Column, Column)

Dependencies
============

Required:
------------

- Python 2.7+ / Less than Python 3.7

System SASL
------------

Different systems require different packages to be installed to enable SASL support.
Some examples of how to install the packages on different distributions
follow.

Ubuntu:

.. code-block:: bash

    apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit
    apt-get install python-dev gcc              #Update python and gcc if needed

RHEL/CentOS:

.. code-block:: bash

    yum install cyrus-sasl-md5 cyrus-sasl-plain cyrus-sasl-gssapi cyrus-sasl-devel
    yum install gcc-c++ python-devel.x86_64     #Update python and gcc if needed

    # If your Python environment is 3.X, then you may need to compile and reinstall Python 
    # if pip3 install fails with a message like 'Can't connect to HTTPS URL because the SSL module is not available'

    # 1. Download a higher version of openssl, e.g: https://www.openssl.org/source/openssl-1.1.1k.tar.gz
    # 2. Install openssl: ./config && make && make install
    # 3. Link openssl: echo /usr/local/lib64/ > /etc/ld.so.conf.d/openssl-1.1.1.conf
    # 4. Update dynamic lib: ldconfig -v
    # 5. Download a Python source package
    # 6. vim Modules/Setup, search '_socket socketmodule.c', uncomment
    #    _socket socketmodule.c
    #    SSL=/usr/local/ssl
    #    _ssl _ssl.c \
    #            -DUSE_SSL -I$(SSL)/include -I$(SSL)/include/openssl \
    #            -L$(SSL)/lib -lssl -lcrypto
    #
    # 7. Install Python: ./configure && make && make install

Windows:

.. code-block:: bash

    # There are 3 ways of installing sasl for python on windows
    # 1. (recommended) Download a .whl version of sasl from https://www.lfd.uci.edu/~gohlke/pythonlibs/#sasl
    # 2. (recommended) If using anaconda, use conda install sasl.
    # 3. Install Microsoft Visual C++ 9.0/14.0 buildtools for python2.7/3.x, then pip install sasl(under test).

Notices
=======

If you install pystellardb >= 0.9, then it will install a beeline command into system.
Delete /usr/local/bin/beeline if you don't need it. 

Requirements
============

Install using

- ``pip install 'pystellardb[hive]'`` for the Hive interface.

PyHive works with

- For Hive: `HiveServer2 <https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2>`_ daemon


Windows Kerberos Configuration
==============================

If you're connecting to databases using Kerberos authentication from Windows platform,
you'll need to install & configure Kerberos for Windows first.
Get it from http://web.mit.edu/kerberos/dist/

After installation, configure the environment variables.
Make sure your Kerberos variable is set ahead of JDK variable(If you have JDK), because JDK also has kinit etc.

Find /etc/krb5.conf on your KDC, copy it into krb5.ini on Windows with some modifications.
e.g.(krb5.conf on KDC):

.. code-block:: bash

    [logging]
    default = FILE:/var/log/krb5libs.log
    kdc = FILE:/var/log/krb5kdc.log
    admin_server = FILE:/var/log/kadmind.log

    [libdefaults]
    default_realm = DEFAULT
    dns_lookup_realm = false
    dns_lookup_kdc = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    allow_weak_crypto = true
    udp_preference_limit = 32700
    default_ccache_name = FILE:/tmp/krb5cc_%{uid}

    [realms]
    DEFAULT = {
    kdc = host1:1088
    kdc = host2:1088
    }

Modify it, delete [logging] and default_ccache_name in [libdefaults]:

.. code-block:: bash

    [libdefaults]
    default_realm = DEFAULT
    dns_lookup_realm = false
    dns_lookup_kdc = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    allow_weak_crypto = true
    udp_preference_limit = 32700

    [realms]
    DEFAULT = {
    kdc = host1:1088
    kdc = host2:1088
    }

This is your krb5.ini for Windows Kerberos. Put it at those 3 places:

    C:\ProgramData\MIT\Kerberos5\krb5.ini

    C:\Program Files\MIT\Kerberos\krb5.ini

    C:\Windows\krb5.ini


Finally, configure hosts at: C:/Windows/System32/drivers/etc/hosts
Add ip mappings of host1, host2 in the previous example. e.g.

.. code-block:: bash

    10.6.6.96     host1
    10.6.6.97     host2

Now, you can run kinit in the command line!

Testing
=======

On his way


