Note
Click here to download the full example code or to run this example in your browser via Binder
Tabular Classification with different resampling strategy¶
The following example shows how to fit a sample classification model with different resampling strategies in AutoPyTorch By default, AutoPyTorch uses Holdout Validation with a 67% train size split.
import os
import tempfile as tmp
import warnings
os.environ['JOBLIB_TEMP_FOLDER'] = tmp.gettempdir()
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
import sklearn.datasets
import sklearn.model_selection
from autoPyTorch.api.tabular_classification import TabularClassificationTask
from autoPyTorch.datasets.resampling_strategy import CrossValTypes, HoldoutValTypes
Default Resampling Strategy¶
Data Loading¶
X, y = sklearn.datasets.fetch_openml(data_id=40981, return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X,
y,
random_state=1,
)
Build and fit a classifier with default resampling strategy¶
api = TabularClassificationTask(
# 'HoldoutValTypes.holdout_validation' with 'val_share': 0.33
# is the default argument setting for TabularClassificationTask.
# It is explicitly specified in this example for demonstrational
# purpose.
resampling_strategy=HoldoutValTypes.holdout_validation,
resampling_strategy_args={'val_share': 0.33}
)
Search for an ensemble of machine learning algorithms¶
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric='accuracy',
total_walltime_limit=150,
func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f9aa456f820>
Print the final ensemble performance¶
y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())
# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8554913294797688}
| | Preprocessing | Estimator | Weight |
|---:|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
| 0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.28 |
| 1 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.28 |
| 2 | SimpleImputer,Variance Threshold,NoCoalescer,NoEncoder,StandardScaler,SPC | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.24 |
| 3 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.18 |
| 4 | None | KNNLearner | 0.02 |
autoPyTorch results:
Dataset name: 5aca1730-22f6-11ed-8835-b1fa420cf160
Optimisation Metric: accuracy
Best validation score: 0.8713450292397661
Number of target algorithm runs: 20
Number of successful target algorithm runs: 15
Number of crashed target algorithm runs: 4
Number of target algorithms that exceeded the time limit: 1
Number of target algorithms that exceeded the memory limit: 0
Cross validation Resampling Strategy¶
Build and fit a classifier with Cross validation resampling strategy¶
api = TabularClassificationTask(
resampling_strategy=CrossValTypes.k_fold_cross_validation,
resampling_strategy_args={'num_splits': 3}
)
Search for an ensemble of machine learning algorithms¶
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric='accuracy',
total_walltime_limit=150,
func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f9aa4a83970>
Print the final ensemble performance¶
y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())
# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8728323699421965}
| | Preprocessing | Estimator | Weight |
|---:|:-----------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
| 0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.56 |
| 1 | None | TabularTraditionalModel | 0.16 |
| 2 | None | TabularTraditionalModel | 0.12 |
| 3 | None | TabularTraditionalModel | 0.08 |
| 4 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,QuantileTransformer,TruncSVD | no embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.04 |
| 5 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,MinMaxScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.04 |
autoPyTorch results:
Dataset name: c3af43a2-22f6-11ed-8835-b1fa420cf160
Optimisation Metric: accuracy
Best validation score: 0.8626733083495604
Number of target algorithm runs: 15
Number of successful target algorithm runs: 11
Number of crashed target algorithm runs: 4
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0
Stratified Resampling Strategy¶
Build and fit a classifier with Stratified resampling strategy¶
api = TabularClassificationTask(
# For demonstration purposes, we use
# Stratified hold out validation. However,
# one can also use CrossValTypes.stratified_k_fold_cross_validation.
resampling_strategy=HoldoutValTypes.stratified_holdout_validation,
resampling_strategy_args={'val_share': 0.33}
)
Search for an ensemble of machine learning algorithms¶
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric='accuracy',
total_walltime_limit=150,
func_eval_time_limit_secs=30
)
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f9b234a8e80>
Print the final ensemble performance¶
y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())
# Print statistics from search
print(api.sprint_statistics())
{'accuracy': 0.8670520231213873}
| | Preprocessing | Estimator | Weight |
|---:|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
| 0 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,NoScaler,LinearSVC Preprocessor | embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.62 |
| 1 | None | RFLearner | 0.14 |
| 2 | None | KNNLearner | 0.12 |
| 3 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.06 |
| 4 | None | SVMLearner | 0.04 |
| 5 | None | LGBMLearner | 0.02 |
autoPyTorch results:
Dataset name: 2b5c3792-22f7-11ed-8835-b1fa420cf160
Optimisation Metric: accuracy
Best validation score: 0.8362573099415205
Number of target algorithm runs: 17
Number of successful target algorithm runs: 13
Number of crashed target algorithm runs: 3
Number of target algorithms that exceeded the time limit: 1
Number of target algorithms that exceeded the memory limit: 0
Total running time of the script: ( 8 minutes 43.593 seconds)