Note
Click here to download the full example code or to run this example in your browser via Binder
Feature Types¶
In auto-sklearn it is possible to specify the feature types of a dataset when calling
the method fit()
by
specifying the argument feat_type
.
The following example demonstrates a way it can be done.
Additionally, you can provide a properly formatted pandas DataFrame, and the feature types will be automatically inferred, as demonstrated in Performance-over-time plot.
import numpy as np
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import autosklearn.classification
Data Loading¶
Load Australian dataset from https://www.openml.org/d/40981
bunch = data = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch["target"].to_numpy()
X = bunch["data"].to_numpy(np.float)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
# Auto-sklearn can automatically recognize categorical/numerical data from a pandas
# DataFrame. This example highlights how the user can provide the feature types,
# when using numpy arrays, as there is no per-column dtype in this case.
# feat_type is a list that tags each column from a DataFrame/ numpy array / list
# with the case-insensitive string categorical or numerical, accordingly.
feat_type = [
"Categorical" if x.name == "category" else "Numerical" for x in bunch["data"].dtypes
]
/home/runner/work/auto-sklearn/auto-sklearn/examples/40_advanced/example_feature_types.py:31: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
X = bunch["data"].to_numpy(np.float)
Build and fit a classifier¶
cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
)
cls.fit(X_train, y_train, X_test, y_test, feat_type=feat_type)
Fitting to the training data: 0%| | 0/30 [00:00<?, ?it/s, The total time budget for this task is 0:00:30]
Fitting to the training data: 3%|3 | 1/30 [00:01<00:29, 1.00s/it, The total time budget for this task is 0:00:30]
Fitting to the training data: 7%|6 | 2/30 [00:02<00:28, 1.00s/it, The total time budget for this task is 0:00:30]
Fitting to the training data: 10%|# | 3/30 [00:03<00:27, 1.00s/it, The total time budget for this task is 0:00:30]
Fitting to the training data: 100%|##########| 30/30 [00:03<00:00, 9.98it/s, The total time budget for this task is 0:00:30]
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
initial_configurations_via_metalearning=0,
per_run_time_limit=3,
smac_scenario_args={'runcount_limit': 1},
time_left_for_this_task=30)
Get the Score of the final ensemble¶
predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score 0.8786127167630058
Total running time of the script: ( 0 minutes 15.509 seconds)