Metadata-Version: 2.1
Name: SixAdsDS
Version: 0.1.6
Summary: Generic functions for SixAds data science projects
Home-page: https://bitbucket.org/eligijus112/sixadsml/
Author: Eligijus Bujokas
Author-email: eligijus@sixads.net
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: inflection
Requires-Dist: nltk
Requires-Dist: pandas
Requires-Dist: sqlalchemy
Requires-Dist: PyYAML
Requires-Dist: keras
Requires-Dist: tensorflow
Requires-Dist: sklearn
Requires-Dist: tqdm
Requires-Dist: pymysql
Requires-Dist: opencv-python
Requires-Dist: Pillow
Requires-Dist: requests

# sixadsml

Package that is used by the SixAds data science department. To know more about sixads,
visit https://sixads.net/.

The github link for this package is https://bitbucket.org/eligijus112/sixadsml/src/master/

Installation

In anaconda prompt type (windows users):

```console
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
pip install SixAdsDS
```

# sixadsml.clean_text

Functions for text preprocesing/summarizing.
A common way to use these functions is to combine them into a pipeline which the input is a list containing strings.

## lemmatize_word
```python
lemmatize_word(string_list, engine=<WordNetLemmatizer>)
```

Lemmatize words using one of the WordNet engines

Parameters
----------

**string_list** : list
    List which stores strings

**engine** : WordNetLemmatizer() (default)
    An object from the nltk.stem.wodnet library

Returns
-------

    List with the same length as *string_list* where each word in each
    string is lemmatized


## to_str
```python
to_str(string_list)
```

Converts every list element to str type

Parameters
----------

**string_list** : list
    List which stores strings

Returns
-------

    List with the same length as *string_list* where every list element
    is converted to a str type object


## rm_short_words
```python
rm_short_words(string_list, lower_bound=1, upper_bound=2)
```

Removes characters that are in the range of *lower_bound* and *upper_bound*

Parameters
----------

**string_list** : list
    List which stores strings

**lower_bound**: int
    Integer indicating the lower bound of a character length

**upper_bound**: int
    Integer indicating the upper bound of a character length

Returns
-------

    List with the same length as *string_list* where every character that is
    split by whitespace is removed if it has a length in the range
    [lower_bound, upper_bound]

Examples
--------
>>> string_list = ['python is awesome', 'R is good as well']

>>> rm_short_words(string_list)

>>> rm_short_words(string_list, 4, 5)

## to_single
```python
to_single(string_list)
```

Converts every word in string_list to it's singular form.

Parameters
----------

**string_list**: list
    List which stores strings

Returns
-------

    List with the same length as *string_list* where every word is converted
    to singular form

## to_lower
```python
to_lower(string_list)
```

Makes every word in the string_list lowercase

Parameters
----------

**string_list** : list
    List which stores strings

Returns
-------

    List with the same length as *string_list* where every word is converted
    to lowercase

## rm_stop_words
```python
rm_stop_words(string_list)
```

Removes stop words using the nltk stopwords module.

Parameters
----------

 **string_list** : list
    List which stores strings

Returns
-------

    List with the same length as *string_list* where every string is without
    stopwords

## rm_punctuations
```python
rm_punctuations(string_list)
```

Removes punctuations and other special characters from a string list

Parameters
----------

**string_list** : list
    List which stores strings

Returns
-------

    List with the same length as *string_list* where every string is without
    punctuations and other special characters

## rm_digits
```python
rm_digits(string_list)
```

Removes digits from a string list

Parameters
----------

**string_list** : list
    List which stores strings

Returns
-------

    List with the same length as *string_list* where every string is without
    digits

## stem_words
```python
stem_words(string_list, stemmer=<nltk.stem.snowball.SnowballStemmer object at 0x000001C95E8BCA20>)
```

A function to stemm the words in a given string vector

Parameters
----------

**string_list** : list
    List which stores strings

**stemmer** : word stemmer from nltk.stem library;
     nltk.stem.SnowballStemmer('english') default

Returns
-------

    List with the same length as *string_list* where every character is stemmed

## clean_ws
```python
clean_ws(string_list)
```

Cleans one or more whitespaces

Parameters
----------
 string_list : list
    List which stores strings

Returns
-------

    List with the same length as *string_list* where every string has only
    one or less whitespace


## build_vocab
```python
build_vocab(string_list, verbose=True)
```

A function that creates a term frequency vocabulary from the text

Parameters
----------

**string_list** : list
    List which stores strings

**verbose** : boolean; default=True
     Whether to show the timing of the for loop

Returns
-------

    dictionary that each key is a unique term in the string_list and
    the key value is the number of times a certain term appeared in the
    string

Example
-------
>>> string_list = string_list = ['python is awesome', 'R is awesome as well']

>>> build_vocab(string_list)

# sixadsml.images

Functions to preproces images from the web or a local machine

## img_read_url
```python
img_read_url(url, h=256, w=256, to_grey=False, timeout=2)
```

Returns an image via an url

Parameters
----------

**url** : string
    url (in a string format)

**h**: int
    Desired height of the returned image (px)

**w**: int
    Desired width of the returned image (px)

**to_grey**: bool
    should the image be returned in greyscale?

**timeout**: int
    maximum wait time before dropping the request

Returns
-------

    A numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True

## img_read_url_PIL
```python
img_read_url_PIL(url, h=256, w=256, timeout=2)
```

Returns an image via an url (using PIL framework)

Parameters
----------

**url** : string
    url (in a string format)

**h**: int
    Desired height of the returned image (px)

**w**: int
    Desired width of the returned image (px)

**timeout**: int
    maximum wait time before dropping the request

Returns
-------
    PIL.Image.Image

## img_read
```python
img_read(path, h=256, w=256, to_grey=False)
```

Reads an image from the local machine

Parameters
----------

**path** : string
    path to image on a local machine

**h**: int
    Desired height of the returned image (px)

**w**: int
    Desired width of the returned image (px)

**to_grey**: bool
    Should the image be returned in greyscale?

Returns
-------

    Numpy array width dimensions (h, w, 3) or (h, w, 1) if to_grey=True

## return_image_hist
```python
return_image_hist(image, no_bins_per_channel=10, normalize=False)
```

Function to get the histogram of the colours in a photo

Parameters
----------

**image** : numpy ndarray
    A numpy array with the shape (x, y, 3)

**no_bins_per_channel**: int
    How many bins should a histgoram have for each channel of colors

**normalize** : bool
    Should the coordinates add up to 1?

Returns
-------

    A list of size 3 * no_bins_per_channel representing the distribution
    of colors in the image


# sixadsml.utility

Utility functions

## make_connection
```python
make_connection(specs)
```

Creates a connection based on the information in the *specs*. Ussually,
the *specs* dictionary is the output of the *read_yaml* function

Parameters
----------

**specs** : dictionary
    A dictionary that stores the user, password, host and db keys

Returns
-------

    An sql_alchemy connection object

## exec_file
```python
exec_file(file, add_params=None)
```

Executes a file with the .py extension

Parameters
----------

**file**: string
    path to the python file

**add_params**:
    additional parameters that are used in the file that is beeing executed

Returns
-------

    Whatever output the executable file outputs

## read_yaml
```python
read_yaml(file)
```

Reads a .yml or .yaml file

Parameters
----------

**path**: string

    path to the .yml or .yaml files

Returns
-------

    Dictionary with the .yml or .yaml file contents

## chunks_of_n
```python
chunks_of_n(l, n)
```

Splits a list into n equal sizes

Parameters
----------

**l**: list

**n**: int

Returns
-------

    A list of size *n* with the items of *l* splited equaly

## unique
```python
unique(l)
```

A handy function to return unique elements of a list or a numpy array

Parameters
----------

**l** : list or array

Returns
-------

    A list or array containing unique elements of l

# sixadsml.sql_utility

Functions to deal with downloading and writting data to the database

## Get_sql
```python
Get_sql(self, /, *args, **kwargs)
```

Class that deals with downloading data

### get_google_tree
```python
Get_sql.get_google_tree(connection)
```

Function to download the google taxonomy tree from the sixads database

Parameters
----------

**connection**: sql_alchemy connection object

Returns
-------

    A pandas dataframe

### get_data
```python
Get_sql.get_data(connection, select_part, from_part, where_part='')
```

Function that construcs a query from the given parts and executes it

Parameters
----------

**select_part**: list
    list of strings identifying the desired columns

**from_part**: string
    the table name

**where_part**: string
    additional constaints

Returns
-------

    A pandas dataframe

## Write_sql
```python
Write_sql(self, /, *args, **kwargs)
```

Class that deals with writting data

### write_to_table
```python
Write_sql.write_to_table(specs, table, data, if_exists='replace')
```

Writes data to the desired table

Parameters
---------

**specs**: dictionary
    Must contain the keys user, password, host and db

**table**: string
    A string refering to the table which we want to write to

**data**: pandas dataframe
    Data which we want to write to the table

**if_exists**: string
    What to do if the table already exists. Possible string values:
    'replace', 'append', 'fail'


# sixadsml.embeddings

Class for dealing with word embeddings

## load_from_text
```python
load_from_text(path)
```

Reads the word embeddings from a txt documents and saves it as a dictionary

Parameters
----------

**path**: string
    path to a txt document containing the word embeddings

Returns
-------

    A dictionary where the key values are individual words and the
    values are the vectors

## tokenize_text
```python
tokenize_text(string_list, max_features, max_len)
```

Creates a tokenizer from a given text list

Parameters
----------

**string_list**: list
    List containing strings

**max_features**: int
    The maximum number of unique words that the tokenizer saves in memory

**max_len**: int
    The length of the vector into which all elements of *string_list* will
    be converted to.

Returns
-------

    A tuple of the tokenized text and the fitted tokenizer for future use.
    The first element of the tuple is an array of shape (len(*string_list*), max_len)

## create_embedding_matrix
```python
create_embedding_matrix(embeddings, tokenizer, max_features, embed_size=300)
```

Function to create the embedding matrix to use in neural networks. This goes
directly to the embedding layer.

Parameters
----------

**embeddings**: dictionary
    output of load_from_text() function

**tokenizer**: keras.Tokenizer object
    output of tokenize_text() function

**max_features**: int
    how many unique tokens to use

**embed_size**: int
    how many coordinates does the embedding have; default=300

Returns
-------

    A numpy.ndarray of shape (max_features, embed_size)



