Resampling Strategies

In auto-sklearn it is possible to use different resampling strategies by specifying the arguments resampling_strategy and resampling_strategy_arguments. The following example shows common settings for the AutoSklearnClassifier.

import numpy as np
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

Data Loading

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

Holdout

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_resampling_example_tmp',
    disable_evaluator_output=False,
    # 'holdout' with 'train_size'=0.67 is the default argument setting
    # for AutoSklearnClassifier. It is explicitly specified in this example
    # for demonstrational purpose.
    resampling_strategy='holdout',
    resampling_strategy_arguments={'train_size': 0.67},
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')

Out:

/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:68: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.metafeatures = self.metafeatures.append(metafeatures)
/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:72: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.algorithm_runs[metric].append(runs)

AutoSklearnClassifier(per_run_time_limit=30,
                      resampling_strategy_arguments={'train_size': 0.67},
                      time_left_for_this_task=120,
                      tmp_folder='/tmp/autosklearn_resampling_example_tmp')

Get the Score of the final ensemble

predictions = automl.predict(X_test)
print("Accuracy score holdout: ", sklearn.metrics.accuracy_score(y_test, predictions))

Out:

Accuracy score holdout:  0.958041958041958

Cross-validation

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_resampling_example_tmp',
    disable_evaluator_output=False,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5},
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')

# One can use models trained during cross-validation directly to predict
# for unseen data. For this, all k models trained during k-fold
# cross-validation are considered as a single soft-voting ensemble inside
# the ensemble constructed with ensemble selection.
print('Before re-fit')
predictions = automl.predict(X_test)
print("Accuracy score CV", sklearn.metrics.accuracy_score(y_test, predictions))

Out:

/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:68: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.metafeatures = self.metafeatures.append(metafeatures)
/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:72: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.algorithm_runs[metric].append(runs)
Before re-fit
Accuracy score CV 0.965034965034965

Perform a refit

During fit(), models are fit on individual cross-validation folds. To use all available data, we call refit() which trains all models in the final ensemble on the whole dataset.

print('After re-fit')
automl.refit(X_train.copy(), y_train.copy())
predictions = automl.predict(X_test)
print("Accuracy score CV", sklearn.metrics.accuracy_score(y_test, predictions))

Out:

After re-fit
Accuracy score CV 0.965034965034965

scikit-learn splitter objects

It is also possible to use scikit-learn’s splitter classes to further customize the outputs. In case one needs to have 100% control over the splitting, it is possible to use scikit-learn’s PredefinedSplit.

Below is an example of using a predefined split. We split the training data by the first feature. In practice, one would use a splitting according to the use case at hand.

selected_indices = (X_train[:, 0] < np.mean(X_train[:, 0])).astype(int)
resampling_strategy = sklearn.model_selection.PredefinedSplit(
    test_fold=selected_indices
)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_resampling_example_tmp',
    disable_evaluator_output=False,
    resampling_strategy=resampling_strategy,
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')

print(automl.sprint_statistics())

Out:

/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:68: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.metafeatures = self.metafeatures.append(metafeatures)
/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:72: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.algorithm_runs[metric].append(runs)
auto-sklearn results:
  Dataset name: breast_cancer
  Metric: accuracy
  Best validation score: 0.964789
  Number of target algorithm runs: 29
  Number of successful target algorithm runs: 29
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

For custom resampling strategies (i.e. resampling strategies that are not defined as strings by Auto-sklearn) it is necessary to perform a refit:

automl.refit(X_train, y_train)

Out:

AutoSklearnClassifier(per_run_time_limit=30,
                      resampling_strategy=PredefinedSplit(test_fold=array([0, 0, ..., 1, 1])),
                      time_left_for_this_task=120,
                      tmp_folder='/tmp/autosklearn_resampling_example_tmp')

Get the Score of the final ensemble (again)

Obviously, this score is pretty bad as we “destroyed” the dataset by splitting it on the first feature.

predictions = automl.predict(X_test)
print("Accuracy score custom split", sklearn.metrics.accuracy_score(y_test, predictions))

Out:

Accuracy score custom split 0.951048951048951

Total running time of the script: ( 6 minutes 42.991 seconds)

Gallery generated by Sphinx-Gallery