Note
Click here to download the full example code or to run this example in your browser via Binder
Regression¶
The following example shows how to fit a simple regression model with auto-sklearn.
from pprint import pprint
import sklearn.datasets
import sklearn.metrics
import autosklearn.regression
import matplotlib.pyplot as plt
Data Loading¶
X, y = sklearn.datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)
Build and fit a regressor¶
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=120,
per_run_time_limit=30,
tmp_folder='/tmp/autosklearn_regression_example_tmp',
)
automl.fit(X_train, y_train, dataset_name='diabetes')
Out:
/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:68: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
self.metafeatures = self.metafeatures.append(metafeatures)
/home/runner/work/auto-sklearn/auto-sklearn/autosklearn/metalearning/metalearning/meta_base.py:72: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
self.algorithm_runs[metric].append(runs)
AutoSklearnRegressor(per_run_time_limit=30, time_left_for_this_task=120,
tmp_folder='/tmp/autosklearn_regression_example_tmp')
View the models found by auto-sklearn¶
print(automl.leaderboard())
Out:
rank ensemble_weight type cost duration
model_id
25 1 0.46 sgd 0.436679 0.572129
6 2 0.32 ard_regression 0.455042 0.588294
27 3 0.14 ard_regression 0.462249 0.572937
11 4 0.02 random_forest 0.507400 9.819607
7 5 0.06 gradient_boosting 0.518673 1.036891
Print the final ensemble constructed by auto-sklearn¶
pprint(automl.show_models(), indent=4)
Out:
{ 6: { 'cost': 0.4550418898836528,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7fb0ef4db1f0>,
'ensemble_weight': 0.32,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7fb0ea196190>,
'model_id': 6,
'rank': 2,
'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7fb0ea196640>,
'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788, alpha_2=2.2118001735899097e-07,
copy_X=False, lambda_1=1.2037591637980971e-06,
lambda_2=4.358378124977852e-09,
threshold_lambda=1136.5286041327277, tol=0.021944240404849075)},
7: { 'cost': 0.5186726734789994,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7fb0ef7af4c0>,
'ensemble_weight': 0.06,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7fb0ec021e80>,
'model_id': 7,
'rank': 5,
'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7fb0ec021f70>,
'sklearn_regressor': HistGradientBoostingRegressor(l2_regularization=1.8428972335335263e-10,
learning_rate=0.012607824914758717, max_iter=512,
max_leaf_nodes=10, min_samples_leaf=8,
n_iter_no_change=0, random_state=1,
validation_fraction=None, warm_start=True)},
11: { 'cost': 0.5073997164657239,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7fb0edb94fd0>,
'ensemble_weight': 0.02,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7fb0efc596d0>,
'model_id': 11,
'rank': 4,
'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7fb0efc59370>,
'sklearn_regressor': RandomForestRegressor(bootstrap=False, criterion='mae',
max_features=0.6277363920171745, min_samples_leaf=6,
min_samples_split=15, n_estimators=512, n_jobs=1,
random_state=1, warm_start=True)},
25: { 'cost': 0.43667876507897496,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7fb0eed45790>,
'ensemble_weight': 0.46,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7fb0eaa6c6d0>,
'model_id': 25,
'rank': 1,
'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7fb0eaa6c550>,
'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654, epsilon=0.012150149892783745,
eta0=0.016444224834275295, l1_ratio=1.7462342366289323e-09,
loss='epsilon_insensitive', max_iter=16, penalty='elasticnet',
power_t=0.21521743568582094, random_state=1,
tol=0.002431731981071206, warm_start=True)},
27: { 'cost': 0.4622486119001967,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7fb0eaa6c2b0>,
'ensemble_weight': 0.14,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7fb0ef7afc70>,
'model_id': 27,
'rank': 3,
'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7fb0ebe50460>,
'sklearn_regressor': ARDRegression(alpha_1=2.7664515192592053e-05, alpha_2=9.504988116581138e-07,
copy_X=False, lambda_1=6.50650698230178e-09,
lambda_2=4.238533890074848e-07,
threshold_lambda=78251.58542976103, tol=0.0007301343236220855)}}
Get the Score of the final ensemble¶
After training the estimator, we can now quantify the goodness of fit. One possibility for is the R2 score. The values range between -inf and 1 with 1 being the best possible value. A dummy estimator predicting the data mean has an R2 score of 0.
train_predictions = automl.predict(X_train)
print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
test_predictions = automl.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
Out:
Train R2 score: 0.5944780427522034
Test R2 score: 0.3959585042866587
Plot the predictions¶
Furthermore, we can now visually inspect the predictions. We plot the true value against the predictions and show results on train and test data. Points on the diagonal depict perfect predictions. Points below the diagonal were overestimated by the model (predicted value is higher than the true value), points above the diagonal were underestimated (predicted value is lower than the true value).
plt.scatter(train_predictions, y_train, label="Train samples", c='#d95f02')
plt.scatter(test_predictions, y_test, label="Test samples", c='#7570b3')
plt.xlabel("Predicted value")
plt.ylabel("True value")
plt.legend()
plt.plot([30, 400], [30, 400], c='k', zorder=0)
plt.xlim([30, 400])
plt.ylim([30, 400])
plt.tight_layout()
plt.show()

Total running time of the script: ( 1 minutes 56.523 seconds)