Metadata-Version: 2.1
Name: cane
Version: 0.0.1.6.1
Summary: Cane - Categorical Attribute traNsformation Environment
Home-page: https://github.com/Metalkiler/Cane-Categorical-Attribute-traNsformation-Environment
Author: Luís Miguel Matos, Paulo Cortez, Rui Mendes
Author-email: luis.matos@dsi.uminho.pt
License: MIT
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: bounded-pool-executor (==0.0.3)
Requires-Dist: numpy (==1.18.4)
Requires-Dist: pandas (==1.0.4)
Requires-Dist: pqdm (==0.1.0)
Requires-Dist: python-dateutil (==2.8.1)
Requires-Dist: pytz (==2020.1)
Requires-Dist: tqdm (==4.46.0)
Requires-Dist: typing-extensions (==3.7.4.2)

# Cane - Categorical Attribute traNsformation Environment 
CANE is a simpler but powerful preprocessing method for machine learning. 


At the moment offers 3 preprocessing methods:

--> The Percentage Categorical Pruned (PCP) merges all least frequent levels (summing up to "perc" percent) into a single level as presented in (https://doi.org/10.1109/IJCNN.2019.8851888), which, for example, can be "Others" category. It can be useful when dealing with several amounts of categorical information (e.g., city data).

--> The Inverse Document Frequency (IDF) codifies the categorical levels into frequency values, where the closer to 0 means, the more frequent it is (https://ieeexplore.ieee.org/document/8710472). 

--> Finally it also has implemented a simpler standard One-Hot-Encoding method.




# Installation

To install this package please run the following command

``` cmd
pip install cane 

```

# Suggestions and feedback
Any feedback would be appreciated.
For questions and other suggestions contact luis.matos@dsi.uminho.pt


# Example
``` python
import pandas as pd
import cane
x = [k for s in ([k] * n for k, n in [('a', 30000), ('b', 50000), ('c', 70000), ('d', 10000), ('e', 1000)]) for k in s]
df = pd.DataFrame({f'x{i}' : x for i in range(1, 13)})

dataPCP, dicionary = cane.pcp(df)  # uses the PCP method and only 1 core
dataPCP, dicionary = cane.pcp(df, n_coresJob=2)  # uses the PCP method and only 2 cores
dataPCP, dicionary = cane.pcp(df, n_coresJob=2,disableLoadBar = False)  # With Progress Bar

dataIDF = cane.idf(df)  # uses the IDF method and only 1 core
dataIDF = cane.idf(df, n_coresJob=2)  # uses the IDF method and only 2 core
dataIDF = cane.idf(df, n_coresJob=2,disableLoadBar = False)  # With Progress Bar

dataH = cane.one_hot(df)  # without a column prefixer
dataH2 = cane.one_hot(df, column_prefix='column')  # it will use the original column name prefix
# (useful for when dealing with id number columns)
dataH3 = cane.one_hot(df, column_prefix='customColName')  # it will use a custom prefix defined by
# the value of the column_prefix
dataH4 = cane.one_hot(df, column_prefix='column', n_coresJob=2)  # it will use the original column name prefix
# (useful for when dealing with id number columns)
# with 2 cores

dataH4 = cane.one_hot(df, column_prefix='column', n_coresJob=2
                      ,disableLoadBar = False)  # With Progress Bar Active!
# with 2 cores

```


