Automated Runs

Xcessiv includes support for various algorithms that aim to provide automation for things such as hyperparameter optimization and base learner/pipeline construction.

Once you begin an automated run, Xcessiv will take care of updating your base learner setups/base learners for you while you go do something else.

As of v0.4.0, Xcessiv supports two types of automated runs: Bayesian Hyperparameter Search and TPOT base learner construction.

TPOT base learner construction

Xcessiv is great for tuning different pipelines/base learners and stacking them together, but with all possible combinations of pipelines, it is a boon to use something that can build that pipeline for you automatically.

This is exactly what TPOT promises to do for you.

As of v0.4, Xcessiv has built-in support for directly exporting the pipeline code generated by TPOT as a base learner setup in Xcessiv.

Right next to the Add new base learner origin button, click on the Automated base learner generation with TPOT button. In the modal that pops up, enter the following code.:

from tpot import TPOTClassifier

tpot_learner = TPOTClassifier(generations=5, population_size=50, verbosity=2)

To use TPOT, simply define a TPOTClassifer or TPOTRegressor and assign it to the variable tpot_learner. The arguments for TPOTClassifer or TPOTRegressor can be found in the TPOT API documentation.

When you click Go, a new automated run will be created that runs tpot_learner on your training data then creates a new base learner setup containing the code for the best pipeline found by TPOT.

Once TPOT is finished, you’ll likely end up with something like this in your newly generated base learner.:

import numpy as np

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    Normalizer(norm="max"),
    ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.15, min_samples_leaf=7, min_samples_split=13, n_estimators=100)
)

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

To convert it to an Xcessiv-compatible base learner, remove all the unneeded parts and modify the code to this.:

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

base_learner = make_pipeline(
    Normalizer(norm="max"),
    ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.15, min_samples_leaf=7, min_samples_split=13, n_estimators=100, random_state=8)
)

Notice two changes: we renamed exported_pipeline to base_learner to follow the Xcessiv format, and set the random_state parameter in the sklearn.ensemble.ExtraTreesClassifier object to 8 for determinism.

Set the name, meta-feature generator, and metrics for your base learner setup as usual, then verify and confirm. You will now be able to use your curated pipeline as any other base learner in your Xcessiv workflow.

Greedy Forward Model Selection

Stacking is usually reserved as the last step of the Xcessiv process, after you’ve squeezed out all you can from pipeline and hyperparameter optimization. When creating stacked ensembles, you can usually expect its performance to be better than any single base learner in the ensemble.

The problem here lies in figuring out which base learners to include in your ensemble. Stacking together the top N base learners is a good first strategy, but not always optimal. Even if a base learner doesn’t perform that well on its own, it could still provide brand new information to the secondary learner, thereby boosting the entire ensemble’s performance even further. One way to look at it is that it provides the secondary learner a new angle to look at the problem and make better judgments moving forward.

Figuring out which base learners to add to a stacked ensemble is much like hyperparameter optimization. You can’t really be sure if something will work until you try it. Unfortunately, trying out every possible combination of base learners is unfeasible when you have hundreds of base learners to choose from.

Xcessiv provides an automated ensemble construction method based on a heuristic process called greedy forward model selection. This method is adapted from Ensemble Selection from Libraries of Models by Caruana et al.

In a nutshell, the algorithm is as follows:

  1. Start with the empty ensemble
  2. Add to the ensemble the model in the library that maximizes the ensemmble’s performance on the error metric.
  3. Repeat step 2 for a fixed number of iterations or until all models have been used.

That’s it!

To perform greedy forward model selection in Xcessiv, simply click on the Automated ensemble search button in the Stacked Ensemble section.

Select your secondary base learner in the configuration modal (Logistic Regression is a good first choice for classification tasks) and copy the following code into the code box and click Go to start your automated run.:

secondary_learner_hyperparameters = {}  # hyperparameters of secondary learner

metric_to_optimize = 'Accuracy'  # metric to optimize

invert_metric = False  # Whether or not to invert metric e.g. optimizing a loss

max_num_base_learners = 6  # Maximum size of ensemble to consider (the higher this is, the longer the run will take)

secondary_learner_hyperparameters is a dictionary containing the hyperparameters for your chosen secondary learner. Again, an empty dictionary signifies default parameters.

metric_to_optimize and invert_metric mean the same things they do as in Bayesian Hyperparameter Search.

max_num_base_learners refers to the total number of iterations of the algorithm. As such, this also signifies the maximum number of base learners that a stacked ensemble found through this automated run can contain. Please note that the higher this number is, the longer the search will run.

Unlike TPOT pipeline construction and Bayesian optimization, which both have an element of randomness, greedy forward model selection will always explore the same ensembles if the pool of base learners remains unchanged.