Metadata-Version: 1.0
Name: arbok
Version: 0.1.8
Summary: A wrapper toolbox that provides compatibility layers between TPOT and Auto-Sklearn and OpenML
Home-page: https://github.com/Yatoom/arbok
Author: Jeroen van Hoof
Author-email: jeroen@jeroenvanhoof.nl
License: UNKNOWN
Description: Arbok
        =====
        
        Arbok (**A**\ utoml w\ **r**\ apper tool\ **b**\ ox for **o**\ penml
        **c**\ ompatibility) provides wrappers for TPOT and Auto-Sklearn, as a
        compatibility layer between these tools and OpenML.
        
        The wrapper extends Sklearn’s ``BaseSearchCV`` and provides all the
        internal parameters that OpenML needs, such as ``cv_results_``,
        ``best_index_``, ``best_params_``, ``best_score_`` and ``classes_``.
        
        Installation
        ------------
        
        ::
        
            pip install arbok
        
        Simple example
        --------------
        
        .. code:: python
        
            import openml
            from arbok import AutoSklearnWrapper, TPOTWrapper
        
        
            task = openml.tasks.get_task(31)
            dataset = task.get_dataset()
        
            # Get the AutoSklearn wrapper and pass parameters like you would to AutoSklearn
            clf = AutoSklearnWrapper(
                time_left_for_this_task=3600, per_run_time_limit=360
            )
        
            # Or get the TPOT wrapper and pass parameters like you would to TPOT
            clf = TPOTWrapper(
                generations=100, population_size=100, verbosity=2
            )
        
            # Execute the task
            run = openml.runs.run_model_on_task(task, clf)
            run.publish()
        
            print('URL for run: %s/run/%d' % (openml.config.server, run.run_id))
        
        Preprocessing data
        ------------------
        
        To make the wrapper more robust, we need to preprocess the data. We can
        fill the missing values, and one-hot encode categorical data.
        
        First, we get a mask that tells us whether a feature is a categorical
        feature or not.
        
        .. code:: python
        
            dataset = task.get_dataset()
            _, categorical = dataset.get_data(return_categorical_indicator=True)
            categorical = categorical[:-1]  # Remove last index (which is the class)
        
        Next, we setup a pipeline for the preprocessing. We are using a
        ``ConditionalImputer``, which is an imputer which is able to use
        different strategies for categorical (nominal) and numerical data.
        
        .. code:: python
        
            from sklearn.pipeline import make_pipeline
            from sklearn.preprocessing import OneHotEncoder
            from arbok import ConditionalImputer
        
            preprocessor = make_pipeline(
        
                ConditionalImputer(
                    categorical_features=categorical,
                    strategy="mean",
                    strategy_nominal="most_frequent"
                ),
                
                OneHotEncoder(
                    categorical_features=categorical, handle_unknown="ignore", sparse=False
                )
            )
        
        And finally, we put everything together in one of the wrappers.
        
        .. code:: python
        
            clf = AutoSklearnWrapper(
                preprocessor=preprocessor, time_left_for_this_task=3600, per_run_time_limit=360
            )
        
        Limitations
        ~~~~~~~~~~~
        
        -  Currently only the classifiers are implemented. Regression is
           therefore not possible.
        -  For TPOT, the ``config_dict`` variable can not be set, because this
           causes problems with the API.
        
        Benchmarking
        ------------
        
        Installing the ``arbok`` package includes the ``arbench`` cli tool. We
        can generate a json file like this:
        
        .. code:: python
        
            from arbok.bench import Benchmark
            bench = Benchmark()
            config_file = bench.create_config_file(
                   
                # Wrapper parameters
                wrapper={"refit": True, "verbose": False, "retry_on_error": True},
                
                # TPOT parameters
                tpot={
                    "max_time_mins": 6,              # Max total time in minutes
                    "max_eval_time_mins": 1          # Max time per candidate in minutes
                },
                
                # Autosklearn parameters
                autosklearn={
                    "time_left_for_this_task": 360,  # Max total time in seconds
                    "per_run_time_limit": 60         # Max time per candidate in seconds
                }
            )
        
        And then, we can call arbench like this:
        
        .. code:: bash
        
            arbench --classifier autosklearn --task-id 31 --config config.json
        
        Or calling arbok as a python module:
        
        .. code:: bash
        
            python -m arbok --classifier autosklearn --task-id 31 --config config.json
        
        Running a benchmark on batch systems
        ------------------------------------
        
        To run a large scale benchmark, we can create a configuration file like
        above, and generate and submit jobs to a batch system as follows.
        
        .. code:: python
        
            # We create a benchmark setup where we specify the headers, the interpreter we
            # want to use, the directory to where we store the jobs (.sh-files), and we give
            # it the config-file we created earlier.
            bench = Benchmark(
                headers="#PBS -lnodes=1:cpu3\n#PBS -lwalltime=1:30:00",
                python_interpreter="python3",  # Path to interpreter
                root="/path/to/project/",
                jobs_dir="jobs",
                config_file="config.json",
                log_file="log.json"
            )
        
            # Create the config file like we did in the section above
            config_file = bench.create_config_file(
                   
                # Wrapper parameters
                wrapper={"refit": True, "verbose": False, "retry_on_error": True},
                
                # TPOT parameters
                tpot={
                    "max_time_mins": 6,              # Max total time in minutes
                    "max_eval_time_mins": 1          # Max time per candidate in minutes
                },
                
                # Autosklearn parameters
                autosklearn={
                    "time_left_for_this_task": 360,  # Max total time in seconds
                    "per_run_time_limit": 60         # Max time per candidate in seconds
                }
            )
        
            # Next, we load the tasks we want to benchmark on from OpenML.
            # In this case, we load a list of task id's from study 99.
            tasks = openml.study.get_study(99).tasks
        
            # Next, we create jobs for both tpot and autosklearn.
            bench.create_jobs(tasks, classifiers=["tpot", "autosklearn"])
        
            # And finally, we submit the jobs using qsub
            bench.submit_jobs()
        
        Preprocessing parameters
        ------------------------
        
        .. code:: python
        
            from arbok import ParamPreprocessor
            import numpy as np
            from sklearn.feature_selection import VarianceThreshold
            from sklearn.pipeline import make_pipeline
        
            X = np.array([
                [1, 2, True, "foo", "one"],
                [1, 3, False, "bar", "two"],
                [np.nan, "bar", None, None, "three"],
                [1, 7, 0, "zip", "four"],
                [1, 9, 1, "foo", "five"],
                [1, 10, 0.1, "zip", "six"]
            ], dtype=object)
        
            types = ["numeric", "mixed", "bool", "nominal", "nominal"]
        
            pipeline = make_pipeline(ParamPreprocessor(types), VarianceThreshold())
        
            pipeline.fit_transform(X)
        
        Output:
        
        ::
        
            [[-0.4472136  -0.4472136   1.41421356 -0.70710678 -0.4472136  -0.4472136
               2.23606798 -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
              -0.85226648  1.        ]
             [-0.4472136   2.23606798 -0.70710678 -0.70710678 -0.4472136  -0.4472136
              -0.4472136  -0.4472136  -0.4472136   2.23606798  0.4472136  -0.4472136
              -0.5831297  -1.        ]
             [ 2.23606798 -0.4472136  -0.70710678 -0.70710678 -0.4472136  -0.4472136
              -0.4472136  -0.4472136   2.23606798 -0.4472136  -2.23606798  2.23606798
              -1.39054004 -1.        ]
             [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136   2.23606798
              -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
               0.49341743 -1.        ]
             [-0.4472136  -0.4472136   1.41421356 -0.70710678  2.23606798 -0.4472136
              -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
               1.031691    1.        ]
             [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136  -0.4472136
              -0.4472136   2.23606798 -0.4472136  -0.4472136   0.4472136  -0.4472136
               1.30082778  1.        ]]
        
Platform: UNKNOWN
