Note
Click here to download the full example code or to run this example in your browser via Binder
Tabular Classification with different resampling strategy¶
The following example shows how to fit a sample classification model with different resampling strategies in AutoPyTorch By default, AutoPyTorch uses Holdout Validation with a 67% train size split.
import os
import tempfile as tmp
import warnings
os.environ['JOBLIB_TEMP_FOLDER'] = tmp.gettempdir()
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
import sklearn.datasets
import sklearn.model_selection
from autoPyTorch.api.tabular_classification import TabularClassificationTask
from autoPyTorch.datasets.resampling_strategy import CrossValTypes, HoldoutValTypes
Default Resampling Strategy¶
Data Loading¶
X, y = sklearn.datasets.fetch_openml(data_id=40981, return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X,
y,
random_state=1,
)
Build and fit a classifier with default resampling strategy¶
api = TabularClassificationTask(
# 'HoldoutValTypes.holdout_validation' with 'val_share': 0.33
# is the default argument setting for TabularClassificationTask.
# It is explicitly specified in this example for demonstrational
# purpose.
resampling_strategy=HoldoutValTypes.holdout_validation,
resampling_strategy_args={'val_share': 0.33}
)
Search for an ensemble of machine learning algorithms¶
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric='accuracy',
total_walltime_limit=150,
func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f6e14d54970>
Print the final ensemble performance¶
y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())
# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8554913294797688}
| | Preprocessing | Estimator | Weight |
|---:|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
| 0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.28 |
| 1 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.28 |
| 2 | SimpleImputer,Variance Threshold,NoCoalescer,NoEncoder,StandardScaler,SPC | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.24 |
| 3 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.18 |
| 4 | None | KNNLearner | 0.02 |
autoPyTorch results:
Dataset name: 3765d4f8-2308-11ed-884d-557eb8b24584
Optimisation Metric: accuracy
Best validation score: 0.8713450292397661
Number of target algorithm runs: 19
Number of successful target algorithm runs: 15
Number of crashed target algorithm runs: 3
Number of target algorithms that exceeded the time limit: 1
Number of target algorithms that exceeded the memory limit: 0
Cross validation Resampling Strategy¶
Build and fit a classifier with Cross validation resampling strategy¶
api = TabularClassificationTask(
resampling_strategy=CrossValTypes.k_fold_cross_validation,
resampling_strategy_args={'num_splits': 3}
)
Search for an ensemble of machine learning algorithms¶
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric='accuracy',
total_walltime_limit=150,
func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f6e15587fa0>
Print the final ensemble performance¶
y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())
# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8728323699421965}
| | Preprocessing | Estimator | Weight |
|---:|:-----------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
| 0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.56 |
| 1 | None | TabularTraditionalModel | 0.16 |
| 2 | None | TabularTraditionalModel | 0.12 |
| 3 | None | TabularTraditionalModel | 0.08 |
| 4 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,QuantileTransformer,TruncSVD | no embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.04 |
| 5 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,MinMaxScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.04 |
autoPyTorch results:
Dataset name: 9b37e4ea-2308-11ed-884d-557eb8b24584
Optimisation Metric: accuracy
Best validation score: 0.8626733083495604
Number of target algorithm runs: 14
Number of successful target algorithm runs: 10
Number of crashed target algorithm runs: 3
Number of target algorithms that exceeded the time limit: 1
Number of target algorithms that exceeded the memory limit: 0
Stratified Resampling Strategy¶
Build and fit a classifier with Stratified resampling strategy¶
api = TabularClassificationTask(
# For demonstration purposes, we use
# Stratified hold out validation. However,
# one can also use CrossValTypes.stratified_k_fold_cross_validation.
resampling_strategy=HoldoutValTypes.stratified_holdout_validation,
resampling_strategy_args={'val_share': 0.33}
)
Search for an ensemble of machine learning algorithms¶
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric='accuracy',
total_walltime_limit=150,
func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f6e1498ce80>
Print the final ensemble performance¶
y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())
# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8670520231213873}
| | Preprocessing | Estimator | Weight |
|---:|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
| 0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.62 |
| 1 | None | RFLearner | 0.14 |
| 2 | None | KNNLearner | 0.12 |
| 3 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.06 |
| 4 | None | SVMLearner | 0.04 |
| 5 | None | LGBMLearner | 0.02 |
autoPyTorch results:
Dataset name: 04ae2bbc-2309-11ed-884d-557eb8b24584
Optimisation Metric: accuracy
Best validation score: 0.8362573099415205
Number of target algorithm runs: 17
Number of successful target algorithm runs: 13
Number of crashed target algorithm runs: 3
Number of target algorithms that exceeded the time limit: 1
Number of target algorithms that exceeded the memory limit: 0
Total running time of the script: ( 8 minutes 32.801 seconds)