Tabular Classification with different resampling strategy

The following example shows how to fit a sample classification model with different resampling strategies in AutoPyTorch By default, AutoPyTorch uses Holdout Validation with a 67% train size split.

import os
import tempfile as tmp
import warnings

os.environ['JOBLIB_TEMP_FOLDER'] = tmp.gettempdir()
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'

warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

import sklearn.datasets
import sklearn.model_selection

from autoPyTorch.api.tabular_classification import TabularClassificationTask
from autoPyTorch.datasets.resampling_strategy import CrossValTypes, HoldoutValTypes

Default Resampling Strategy

Data Loading

X, y = sklearn.datasets.fetch_openml(data_id=40981, return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X,
    y,
    random_state=1,
)

Build and fit a classifier with default resampling strategy

api = TabularClassificationTask(
    # 'HoldoutValTypes.holdout_validation' with 'val_share': 0.33
    # is the default argument setting for TabularClassificationTask.
    # It is explicitly specified in this example for demonstrational
    # purpose.
    resampling_strategy=HoldoutValTypes.holdout_validation,
    resampling_strategy_args={'val_share': 0.33}
)

Search for an ensemble of machine learning algorithms

api.search(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test.copy(),
    y_test=y_test.copy(),
    optimize_metric='accuracy',
    total_walltime_limit=150,
    func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f9aa456f820>

Cross validation Resampling Strategy

Build and fit a classifier with Cross validation resampling strategy

api = TabularClassificationTask(
    resampling_strategy=CrossValTypes.k_fold_cross_validation,
    resampling_strategy_args={'num_splits': 3}
)

Search for an ensemble of machine learning algorithms

api.search(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test.copy(),
    y_test=y_test.copy(),
    optimize_metric='accuracy',
    total_walltime_limit=150,
    func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f9aa4a83970>

Print the final ensemble performance

y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())

# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8728323699421965}
|    | Preprocessing                                                                                  | Estimator                                                       |   Weight |
|---:|:-----------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
|  0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor     | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential    |     0.56 |
|  1 | None                                                                                           | TabularTraditionalModel                                         |     0.16 |
|  2 | None                                                                                           | TabularTraditionalModel                                         |     0.12 |
|  3 | None                                                                                           | TabularTraditionalModel                                         |     0.08 |
|  4 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,QuantileTransformer,TruncSVD        | no embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential    |     0.04 |
|  5 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,MinMaxScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential |     0.04 |
autoPyTorch results:
        Dataset name: c3af43a2-22f6-11ed-8835-b1fa420cf160
        Optimisation Metric: accuracy
        Best validation score: 0.8626733083495604
        Number of target algorithm runs: 15
        Number of successful target algorithm runs: 11
        Number of crashed target algorithm runs: 4
        Number of target algorithms that exceeded the time limit: 0
        Number of target algorithms that exceeded the memory limit: 0

Stratified Resampling Strategy

Build and fit a classifier with Stratified resampling strategy

api = TabularClassificationTask(
    # For demonstration purposes, we use
    # Stratified hold out validation. However,
    # one can also use CrossValTypes.stratified_k_fold_cross_validation.
    resampling_strategy=HoldoutValTypes.stratified_holdout_validation,
    resampling_strategy_args={'val_share': 0.33}
)

Search for an ensemble of machine learning algorithms

api.search(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test.copy(),
    y_test=y_test.copy(),
    optimize_metric='accuracy',
    total_walltime_limit=150,
    func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f9b234a8e80>

Print the final ensemble performance

y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())

# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8670520231213873}
|    | Preprocessing                                                                                    | Estimator                                                       |   Weight |
|---:|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
|  0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor       | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential    |     0.62 |
|  1 | None                                                                                             | RFLearner                                                       |     0.14 |
|  2 | None                                                                                             | KNNLearner                                                      |     0.12 |
|  3 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential |     0.06 |
|  4 | None                                                                                             | SVMLearner                                                      |     0.04 |
|  5 | None                                                                                             | LGBMLearner                                                     |     0.02 |
autoPyTorch results:
        Dataset name: 2b5c3792-22f7-11ed-8835-b1fa420cf160
        Optimisation Metric: accuracy
        Best validation score: 0.8362573099415205
        Number of target algorithm runs: 17
        Number of successful target algorithm runs: 13
        Number of crashed target algorithm runs: 3
        Number of target algorithms that exceeded the time limit: 1
        Number of target algorithms that exceeded the memory limit: 0

Total running time of the script: ( 8 minutes 43.593 seconds)

Gallery generated by Sphinx-Gallery