Feature Types

In auto-sklearn it is possible to specify the feature types of a dataset when calling the method fit() by specifying the argument feat_type. The following example demonstrates a way it can be done.

Additionally, you can provide a properly formatted pandas DataFrame, and the feature types will be automatically inferred, as demonstrated in Performance-over-time plot.

import numpy as np

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

Data Loading

Load Australian dataset from https://www.openml.org/d/40981

bunch = data = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch["target"].to_numpy()
X = bunch["data"].to_numpy(np.float)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

# Auto-sklearn can automatically recognize categorical/numerical data from a pandas
# DataFrame. This example highlights how the user can provide the feature types,
# when using numpy arrays, as there is no per-column dtype in this case.
# feat_type is a list that tags each column from a DataFrame/ numpy array / list
# with the case-insensitive string categorical or numerical, accordingly.
feat_type = [
    "Categorical" if x.name == "category" else "Numerical" for x in bunch["data"].dtypes
]
/home/runner/work/auto-sklearn/auto-sklearn/examples/40_advanced/example_feature_types.py:31: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X = bunch["data"].to_numpy(np.float)

Build and fit a classifier

cls = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=30,
    # Bellow two flags are provided to speed up calculations
    # Not recommended for a real implementation
    initial_configurations_via_metalearning=0,
    smac_scenario_args={"runcount_limit": 1},
)
cls.fit(X_train, y_train, X_test, y_test, feat_type=feat_type)
Fitting to the training data:   0%|          | 0/30 [00:00<?, ?it/s, The total time budget for this task is 0:00:30]
Fitting to the training data:   3%|3         | 1/30 [00:01<00:29,  1.00s/it, The total time budget for this task is 0:00:30]
Fitting to the training data:   7%|6         | 2/30 [00:02<00:28,  1.00s/it, The total time budget for this task is 0:00:30]
Fitting to the training data:  10%|#         | 3/30 [00:03<00:27,  1.00s/it, The total time budget for this task is 0:00:30]
Fitting to the training data: 100%|##########| 30/30 [00:03<00:00,  9.98it/s, The total time budget for this task is 0:00:30]

AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      initial_configurations_via_metalearning=0,
                      per_run_time_limit=3,
                      smac_scenario_args={'runcount_limit': 1},
                      time_left_for_this_task=30)

Get the Score of the final ensemble

predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score 0.8786127167630058

Total running time of the script: ( 0 minutes 15.509 seconds)

Gallery generated by Sphinx-Gallery