APIs

Main modules

Classification

class autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int | None = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include: Optional[Dict[str, List[str]]] = None, exclude: Optional[Dict[str, List[str]]] = None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric: Scorer | Sequence[Scorer] | None = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, get_trials_callback: SMACCallback | None = None, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True, disable_progress_bar: bool = False)[source]

This class implements the classification task.

Parameters
time_left_for_this_taskint, optional (default=3600)

Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)

Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

initial_configurations_via_metalearningint, optional (default=25)

Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.

ensemble_sizeint, optional

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to 0 no ensemble is fit.

Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via ensemble_kwargs={"ensemble_size": int} if you want to change the ensemble size for ensemble selection.

ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)

Class implementing the post-hoc ensemble algorithm. Set to None to disable ensemble building or use SingleBest to obtain only use the single best model instead of an ensemble.

If set to “default” it will use EnsembleSelection for single-objective problems and MultiObjectiveDummyEnsemble for multi-objective problems.

ensemble_kwargsDict, optional

Keyword arguments that are passed to the ensemble class upon initialization.

ensemble_nbestint, optional (default=50)

Only consider the ensemble_nbest models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of the ensemble_class argument and this pruning step is done prior to constructing an ensemble.

max_models_on_disc: int, optional (default=50),

Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.

seedint, optional (default=1)

Used to seed SMAC. Will determine the output file names.

memory_limitint, optional (3072)

Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB.

Important notes:

  • If None is provided, no memory limit is set.

  • In case of multi-processing, memory_limit will be per job, so the total usage is n_jobs x memory_limit.

  • The memory limit also applies to the ensemble creation process.

includeOptional[Dict[str, List[str]]] = None

If None, all possible algorithms are used.

Otherwise, specifies a step and the components that are included in search. See /pipeline/components/<step>/* for available components.

Incompatible with parameter exclude.

Possible Steps:

  • "data_preprocessor"

  • "balancing"

  • "feature_preprocessor"

  • "classifier" - Only for when when using AutoSklearnClasssifier

  • "regressor" - Only for when when using AutoSklearnRegressor

Example:

include = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}
excludeOptional[Dict[str, List[str]]] = None

If None, all possible algorithms are used.

Otherwise, specifies a step and the components that are excluded from search. See /pipeline/components/<step>/* for available components.

Incompatible with parameter include.

Possible Steps:

  • "data_preprocessor"

  • "balancing"

  • "feature_preprocessor"

  • "classifier" - Only for when when using AutoSklearnClasssifier

  • "regressor" - Only for when when using AutoSklearnRegressor

Example:

exclude = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}
resampling_strategystr | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = “holdout”

How to to handle overfitting, might need to use resampling_strategy_arguments if using "cv" based method or a Splitter object.

  • Options
    • "holdout" - Use a 67:33 (train:test) split

    • "cv": perform cross validation, requires “folds” in resampling_strategy_arguments

    • "holdout-iterative-fit" - Same as “holdout” but iterative fit where possible

    • "cv-iterative-fit": Same as “cv” but iterative fit where possible

    • "partial-cv": Same as “cv” but uses intensification.

    • BaseCrossValidator - any BaseCrossValidator subclass (found in scikit-learn model_selection module)

    • _RepeatedSplits - any _RepeatedSplits subclass (found in scikit-learn model_selection module)

    • BaseShuffleSplit - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)

If using a Splitter object that relies on the dataset retaining it’s current size and order, you will need to look at the dataset_compression argument and ensure that "subsample" is not included in the applied compression "methods" or disable it entirely with False.

resampling_strategy_argumentsOptional[Dict] = None

Additional arguments for resampling_strategy, this is required if using a cv based strategy. The default arguments if left as None are:

{
    "train_size": 0.67,     # The size of the training set
    "shuffle": True,        # Whether to shuffle before splitting data
    "folds": 5              # Used in 'cv' based resampling strategies
}

If using a custom splitter class, which takes n_splits such as PredefinedSplit, the value of "folds" will be used.

tmp_folderstring, optional (None)

folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: bool, optional (True)

remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted

n_jobsint, optional, experimental

The number of jobs to run in parallel for fit(). -1 means using all processors.

Important notes:

  • By default, Auto-sklearn uses one core.

  • Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble.

  • predict() is not affected by n_jobs (in contrast to most scikit-learn models)

  • If dask_client is None, a new dask client is created.

dask_clientdask.distributed.Client, optional

User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.

disable_evaluator_output: bool or list, optional (False)

If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:

  • 'y_optimization' : do not save the predictions for the optimization set, which would later on be used to build an ensemble.

  • model : do not save any model files

smac_scenario_argsdict, optional (None)

Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.

get_smac_object_callbackcallable

Callback function to create an object of class smac.facade.AbstractFacade. The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with SMAC.

logging_configdict, optional (None)

dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.

metadata_directorystr, optional (None)

path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.

metricScorer, optional (None)

An instance of autosklearn.metrics.Scorer as created by autosklearn.metrics.make_scorer(). These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.

scoring_functionsList[Scorer], optional (None)

List of scorers which will be calculated for each pipeline and results will be available via cv_results

load_modelsbool, optional (True)

Whether to load the models after fitting Auto-sklearn.

get_trials_callback: callable

A callable with the following definition.

  • (smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None

This will be called after SMAC, the underlying optimizer for autosklearn, finishes training each run.

You can use this to record your own information about the optimization process. You can also use this to enable a early stopping based on some critera.

See the example: Early Stopping And Callbacks.

dataset_compression: Union[bool, Mapping[str, Any]] = True

We compress datasets so that they fit into some predefined amount of memory. Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.

NOTE - If using a custom resampling_strategy that relies on specific size or ordering of data, this must be disabled to preserve these properties.

You can disable this entirely by passing False or leave as the default True for configuration below.

{
    "memory_allocation": 0.1,
    "methods": ["precision", "subsample"]
}

You can also pass your own configuration with the same keys and choosing from the available "methods".

The available options are described here:

  • memory_allocation

    By default, we attempt to fit the dataset into 0.1 * memory_limit. This float value can be set with "memory_allocation": 0.1. We also allow for specifying absolute memory in MB, e.g. 10MB is "memory_allocation": 10.

    The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in "methods" will not be performed.

    For example, if methods: ["precision", "subsample"] and the "precision" reduction step was enough to make the dataset fit into memory, then the "subsample" reduction step will not be performed.

  • methods

    We provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given.

    • "precision" - We reduce floating point precision as follows: * np.float128 -> np.float64 * np.float96 -> np.float64 * np.float64 -> np.float32

    • subsample - We subsample data such that it fits directly into the memory allocation memory_allocation * memory_limit. Therefore, this should likely be the last method listed in "methods". Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.

allow_string_features: bool = True

Whether autosklearn should process string features. By default the textpreprocessing is enabled.

disable_progress_bar: bool = False

Whether to disable the progress bar that is displayed in the console while fitting to the training data.

Attributes
cv_results_dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

Not all keys returned by scikit-learn are supported yet.

performance_over_time_pandas.core.frame.DataFrame

A DataFrame containing the models performance over time data. Can be used for plotting directly. Please refer to the example Train and Test Inputs.

fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]

Fit auto-sklearn to given training set (X, y).

Fit both optimizes the machine learning models and builds an ensemble out of them.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The target classes.

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Test data target classes. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

dataset_namestr, optional (default=None)

Create nicer output. If None, a string will be determined by the md5 hash of the dataset.

Returns
self
fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)

Fit an ensemble to models trained during an optimization process.

All parameters are None by default. If no other value is given, the default values which were set in a call to fit() are used.

Parameters
yarray-like

Target values.

taskint

A constant from the module autosklearn.constants. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).

precisionint

Numeric precision used when loading ensemble data. Can be either 16, 32 or 64.

dataset_namestr

Name of the current data set.

ensemble_sizeint, optional

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to 0 no ensemble is fit.

Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via ensemble_kwargs={"ensemble_size": int} if you want to change the ensemble size for ensemble selection.

ensemble_kwargsDict, optional

Keyword arguments that are passed to the ensemble class upon initialization.

ensemble_nbestint

Only consider the ensemble_nbest models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of the ensemble_class argument and this pruning step is done prior to constructing an ensemble.

ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)

Class implementing the post-hoc ensemble algorithm. Set to None to disable ensemble building or use class:SingleBest to obtain only use the single best model instead of an ensemble.

If set to “default” it will use EnsembleSelection for single-objective problems and MultiObjectiveDummyEnsemble for multi-objective problems.

metric: Scorer | Sequence[Scorer] | None = None

A metric or list of metrics to score the ensemble with

Returns
self
fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]

Fits and individual pipeline configuration and returns the result to the user.

The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.

Any additional argument provided is directly passed to the worker exercising the run.

Parameters
X: array-like, shape = (n_samples, n_features)

The features used for training

y: array-like

The labels used for training

X_test: Optionalarray-like, shape = (n_samples, n_features)

If provided, the testing performance will be tracked on this features.

y_test: array-like

If provided, the testing performance will be tracked on this labels

config: Union[Configuration, Dict[str, Union[str, float, int]]]

A configuration object used to define the pipeline steps. If a dict is passed, a configuration is created based on this dict.

dataset_name: Optional[str]

Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

Returns
pipeline: Optional[BasePipeline]

The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.

run_info: RunInFo

A named tuple that contains the configuration launched

run_value: RunValue

A named tuple that contains the result of the run

get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace

Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

Array with the training features, used to get characteristics like data sparsity

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Array with features used for performance estimation

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels for the testing split

dataset_name: Optional[str]

A string to tag the Auto-Sklearn run

get_models_with_weights()

Return a list of the final ensemble found by auto-sklearn.

Returns
[(weight_1, model_1), …, (weight_n, model_n)]
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame

Returns a pandas table of results for all evaluated models.

Gives an overview of all models trained during the search process along with various statistics about their training.

The available statistics are:

Simple:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "type" - The type of classifier/regressor used.

  • "cost" - The loss of the model on the validation set.

  • "duration" - Length of time the model was optimized for.

Detailed: The detailed view includes all of the simple statistics along with the following.

  • "config_id" - The id used by SMAC for optimization.

  • "budget" - How much budget was allocated to this model.

  • "status" - The return status of training the model with SMAC.

  • "train_loss" - The loss of the model on the training set.

  • "balancing_strategy" - The balancing strategy used for data preprocessing.

  • "start_time" - Time the model began being optimized

  • "end_time" - Time the model ended being optimized

  • "data_preprocessors" - The preprocessors used on the data

  • "feature_preprocessors" - The preprocessors for features types

Parameters
detailed: bool = False

Whether to give detailed information or just a simple overview.

ensemble_only: bool = True

Whether to view only models included in the ensemble or all models trained.

top_k: int or “all” = “all”

How many models to display.

sort_by: str = ‘cost’

What column to sort by. If that column is not present, the sorting defaults to the "model_id" index column.

Defaults to the metric optimized. Sort by the first objective in case of a multi-objective optimization problem

sort_order: “auto” or “ascending” or “descending” = “auto”

Which sort order to apply to the sort_by column. If left as "auto", it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.

include: Optional[str or Iterable[str]]

Items to include, other items not specified will be excluded. The exception is the "model_id" index column which is always included.

If left as None, it will resort back to using the detailed param to decide the columns to include.

Returns
pd.DataFrame

A dataframe of statistics for the models, ordered by sort_by.

predict(X, batch_size=None, n_jobs=1)[source]

Predict classes for X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
Returns
yarray of shape = [n_samples] or [n_samples, n_labels]

The predicted classes.

predict_proba(X, batch_size=None, n_jobs=1)[source]

Predict probabilities of classes for all samples X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
batch_sizeint (optional)

Number of data points to predict for (predicts all points at once if None.

n_jobsint
Returns
yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]

The predicted class probabilities.

refit(X, y)

Refit all models found with fit to new data.

Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The targets.

Returns
self
score(X, y)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

show_models()

Returns a dictionary containing dictionaries of ensemble models.

Each model in the ensemble can be accessed by giving its model_id as key.

A model dictionary contains the following:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "cost" - The loss of the model on the validation set.

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "voting_model" - The cv_voting_ensemble model (for ‘cv’ resampling).

  • "estimators" - List of models (dicts) in cv_voting_ensemble

    (‘cv’ resampling).

  • "data_preprocessor" - The preprocessor used on the data.

  • "balancing" - The balancing used on the data (for classification).

  • "feature_preprocessor" - The preprocessor for features types.

  • "classifier" / "regressor" - The autosklearn wrapped classifier or regressor.

  • "sklearn_classifier" or "sklearn_regressor" - The sklearn classifier or regressor.

Example

import sklearn.datasets
import sklearn.metrics
import autosklearn.regression

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120
    )
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

Output:

{
    25: {'model_id': 25.0,
         'rank': 1,
         'cost': 0.43667876507897496,
         'ensemble_weight': 0.38,
         'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
         'feature_preprocessor': <autosklearn.pipeline.components....>,
         'regressor': <autosklearn.pipeline.components.regression....>,
         'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...)
        },
    6: {'model_id': 6.0,
        'rank': 2,
        'cost': 0.4550418898836528,
        'ensemble_weight': 0.3,
        'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
        'feature_preprocessor': <autosklearn.pipeline.components....>,
        'regressor': <autosklearn.pipeline.components.regression....>,
        'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...)
        }...
}
Returns
Dict(int, Any)dictionary of length = number of models in the ensemble

A dictionary of models in the ensemble, where model_id is the key.

sprint_statistics()

Return the following statistics of the training result:

  • dataset name

  • metric used

  • best validation score

  • number of target algorithm runs

  • number of successful target algorithm runs

  • number of crashed target algorithm runs

  • number of target algorithm runs that exceeded the memory limit

  • number of target algorithm runs that exceeded the time limit

Returns
str
class autosklearn.experimental.askl2.AutoSklearn2Classifier(time_left_for_this_task: int = 3600, per_run_time_limit=None, ensemble_size: int | None = None, ensemble_class: AbstractEnsemble | None = <class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>, ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest: Union[float, int] = 50, max_models_on_disc: int = 50, seed: int = 1, memory_limit: int = 3072, tmp_folder: Optional[str] = None, delete_tmp_folder_after_terminate: bool = True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output: bool = False, smac_scenario_args: Optional[Dict[str, Any]] = None, logging_config: Optional[Dict[str, Any]] = None, metric: Optional[Scorer] = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True, disable_progress_bar: bool = False)[source]
Parameters
time_left_for_this_taskint, optional (default=3600)

Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)

Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

ensemble_sizeint, optional

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to 0 no ensemble is fit.

Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via ensemble_kwargs={"ensemble_size": int} if you want to change the ensemble size for ensemble selection.

ensemble_classType[AbstractEnsemble], optional (default=EnsembleSelection)

Class implementing the post-hoc ensemble algorithm. Set to None to disable ensemble building or use SingleBest to obtain only use the single best model instead of an ensemble.

ensemble_kwargsDict, optional

Keyword arguments that are passed to the ensemble class upon initialization.

max_models_on_disc: int, optional (default=50),

Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.

seedint, optional (default=1)

Used to seed SMAC. Will determine the output file names.

memory_limitint, optional (3072)

Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB.

Important notes:

  • If None is provided, no memory limit is set.

  • In case of multi-processing, memory_limit will be per job, so the total usage is n_jobs x memory_limit.

  • The memory limit also applies to the ensemble creation process.

tmp_folderstring, optional (None)

folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: string, optional (True)

remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted

n_jobsint, optional, experimental

The number of jobs to run in parallel for fit(). -1 means using all processors.

Important notes:

  • By default, Auto-sklearn uses one core.

  • Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble.

  • predict() is not affected by n_jobs (in contrast to most scikit-learn models)

  • If dask_client is None, a new dask client is created.

dask_clientdask.distributed.Client, optional

User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.

disable_evaluator_output: bool or list, optional (False)

If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:

  • 'y_optimization' : do not save the predictions for the optimization/validation set, which would later on be used to build an ensemble.

  • model : do not save any model files

smac_scenario_argsdict, optional (None)

Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.

logging_configdict, optional (None)

dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.

metricScorer, optional (None)

An instance of autosklearn.metrics.Scorer as created by autosklearn.metrics.make_scorer(). These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.

scoring_functionsList[Scorer], optional (None)

List of scorers which will be calculated for each pipeline and results will be available via cv_results

load_modelsbool, optional (True)

Whether to load the models after fitting Auto-sklearn.

disable_progress_bar: bool = False

Whether to disable the progress bar that is displayed in the console while fitting to the training data.

Attributes
cv_results_dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

Not all keys returned by scikit-learn are supported yet.

fit(X, y, X_test=None, y_test=None, metric=None, feat_type=None, dataset_name=None)[source]

Fit auto-sklearn to given training set (X, y).

Fit both optimizes the machine learning models and builds an ensemble out of them.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The target classes.

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Test data target classes. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

dataset_namestr, optional (default=None)

Create nicer output. If None, a string will be determined by the md5 hash of the dataset.

Returns
self
fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)

Fit an ensemble to models trained during an optimization process.

All parameters are None by default. If no other value is given, the default values which were set in a call to fit() are used.

Parameters
yarray-like

Target values.

taskint

A constant from the module autosklearn.constants. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).

precisionint

Numeric precision used when loading ensemble data. Can be either 16, 32 or 64.

dataset_namestr

Name of the current data set.

ensemble_sizeint, optional

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to 0 no ensemble is fit.

Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via ensemble_kwargs={"ensemble_size": int} if you want to change the ensemble size for ensemble selection.

ensemble_kwargsDict, optional

Keyword arguments that are passed to the ensemble class upon initialization.

ensemble_nbestint

Only consider the ensemble_nbest models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of the ensemble_class argument and this pruning step is done prior to constructing an ensemble.

ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)

Class implementing the post-hoc ensemble algorithm. Set to None to disable ensemble building or use class:SingleBest to obtain only use the single best model instead of an ensemble.

If set to “default” it will use EnsembleSelection for single-objective problems and MultiObjectiveDummyEnsemble for multi-objective problems.

metric: Scorer | Sequence[Scorer] | None = None

A metric or list of metrics to score the ensemble with

Returns
self
fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]

Fits and individual pipeline configuration and returns the result to the user.

The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.

Any additional argument provided is directly passed to the worker exercising the run.

Parameters
X: array-like, shape = (n_samples, n_features)

The features used for training

y: array-like

The labels used for training

X_test: Optionalarray-like, shape = (n_samples, n_features)

If provided, the testing performance will be tracked on this features.

y_test: array-like

If provided, the testing performance will be tracked on this labels

config: Union[Configuration, Dict[str, Union[str, float, int]]]

A configuration object used to define the pipeline steps. If a dict is passed, a configuration is created based on this dict.

dataset_name: Optional[str]

Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

Returns
pipeline: Optional[BasePipeline]

The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.

run_info: RunInFo

A named tuple that contains the configuration launched

run_value: RunValue

A named tuple that contains the result of the run

get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace

Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

Array with the training features, used to get characteristics like data sparsity

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Array with features used for performance estimation

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels for the testing split

dataset_name: Optional[str]

A string to tag the Auto-Sklearn run

get_models_with_weights()

Return a list of the final ensemble found by auto-sklearn.

Returns
[(weight_1, model_1), …, (weight_n, model_n)]
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame

Returns a pandas table of results for all evaluated models.

Gives an overview of all models trained during the search process along with various statistics about their training.

The available statistics are:

Simple:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "type" - The type of classifier/regressor used.

  • "cost" - The loss of the model on the validation set.

  • "duration" - Length of time the model was optimized for.

Detailed: The detailed view includes all of the simple statistics along with the following.

  • "config_id" - The id used by SMAC for optimization.

  • "budget" - How much budget was allocated to this model.

  • "status" - The return status of training the model with SMAC.

  • "train_loss" - The loss of the model on the training set.

  • "balancing_strategy" - The balancing strategy used for data preprocessing.

  • "start_time" - Time the model began being optimized

  • "end_time" - Time the model ended being optimized

  • "data_preprocessors" - The preprocessors used on the data

  • "feature_preprocessors" - The preprocessors for features types

Parameters
detailed: bool = False

Whether to give detailed information or just a simple overview.

ensemble_only: bool = True

Whether to view only models included in the ensemble or all models trained.

top_k: int or “all” = “all”

How many models to display.

sort_by: str = ‘cost’

What column to sort by. If that column is not present, the sorting defaults to the "model_id" index column.

Defaults to the metric optimized. Sort by the first objective in case of a multi-objective optimization problem

sort_order: “auto” or “ascending” or “descending” = “auto”

Which sort order to apply to the sort_by column. If left as "auto", it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.

include: Optional[str or Iterable[str]]

Items to include, other items not specified will be excluded. The exception is the "model_id" index column which is always included.

If left as None, it will resort back to using the detailed param to decide the columns to include.

Returns
pd.DataFrame

A dataframe of statistics for the models, ordered by sort_by.

predict(X, batch_size=None, n_jobs=1)

Predict classes for X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
Returns
yarray of shape = [n_samples] or [n_samples, n_labels]

The predicted classes.

predict_proba(X, batch_size=None, n_jobs=1)

Predict probabilities of classes for all samples X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
batch_sizeint (optional)

Number of data points to predict for (predicts all points at once if None.

n_jobsint
Returns
yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]

The predicted class probabilities.

refit(X, y)

Refit all models found with fit to new data.

Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The targets.

Returns
self
score(X, y)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

show_models()

Returns a dictionary containing dictionaries of ensemble models.

Each model in the ensemble can be accessed by giving its model_id as key.

A model dictionary contains the following:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "cost" - The loss of the model on the validation set.

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "voting_model" - The cv_voting_ensemble model (for ‘cv’ resampling).

  • "estimators" - List of models (dicts) in cv_voting_ensemble

    (‘cv’ resampling).

  • "data_preprocessor" - The preprocessor used on the data.

  • "balancing" - The balancing used on the data (for classification).

  • "feature_preprocessor" - The preprocessor for features types.

  • "classifier" / "regressor" - The autosklearn wrapped classifier or regressor.

  • "sklearn_classifier" or "sklearn_regressor" - The sklearn classifier or regressor.

Example

import sklearn.datasets
import sklearn.metrics
import autosklearn.regression

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120
    )
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

Output:

{
    25: {'model_id': 25.0,
         'rank': 1,
         'cost': 0.43667876507897496,
         'ensemble_weight': 0.38,
         'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
         'feature_preprocessor': <autosklearn.pipeline.components....>,
         'regressor': <autosklearn.pipeline.components.regression....>,
         'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...)
        },
    6: {'model_id': 6.0,
        'rank': 2,
        'cost': 0.4550418898836528,
        'ensemble_weight': 0.3,
        'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
        'feature_preprocessor': <autosklearn.pipeline.components....>,
        'regressor': <autosklearn.pipeline.components.regression....>,
        'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...)
        }...
}
Returns
Dict(int, Any)dictionary of length = number of models in the ensemble

A dictionary of models in the ensemble, where model_id is the key.

sprint_statistics()

Return the following statistics of the training result:

  • dataset name

  • metric used

  • best validation score

  • number of target algorithm runs

  • number of successful target algorithm runs

  • number of crashed target algorithm runs

  • number of target algorithm runs that exceeded the memory limit

  • number of target algorithm runs that exceeded the time limit

Returns
str

Regression

class autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int | None = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include: Optional[Dict[str, List[str]]] = None, exclude: Optional[Dict[str, List[str]]] = None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric: Scorer | Sequence[Scorer] | None = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, get_trials_callback: SMACCallback | None = None, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True, disable_progress_bar: bool = False)[source]

This class implements the regression task.

Parameters
time_left_for_this_taskint, optional (default=3600)

Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)

Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

initial_configurations_via_metalearningint, optional (default=25)

Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.

ensemble_sizeint, optional

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to 0 no ensemble is fit.

Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via ensemble_kwargs={"ensemble_size": int} if you want to change the ensemble size for ensemble selection.

ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)

Class implementing the post-hoc ensemble algorithm. Set to None to disable ensemble building or use SingleBest to obtain only use the single best model instead of an ensemble.

If set to “default” it will use EnsembleSelection for single-objective problems and MultiObjectiveDummyEnsemble for multi-objective problems.

ensemble_kwargsDict, optional

Keyword arguments that are passed to the ensemble class upon initialization.

ensemble_nbestint, optional (default=50)

Only consider the ensemble_nbest models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of the ensemble_class argument and this pruning step is done prior to constructing an ensemble.

max_models_on_disc: int, optional (default=50),

Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.

seedint, optional (default=1)

Used to seed SMAC. Will determine the output file names.

memory_limitint, optional (3072)

Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB.

Important notes:

  • If None is provided, no memory limit is set.

  • In case of multi-processing, memory_limit will be per job, so the total usage is n_jobs x memory_limit.

  • The memory limit also applies to the ensemble creation process.

includeOptional[Dict[str, List[str]]] = None

If None, all possible algorithms are used.

Otherwise, specifies a step and the components that are included in search. See /pipeline/components/<step>/* for available components.

Incompatible with parameter exclude.

Possible Steps:

  • "data_preprocessor"

  • "balancing"

  • "feature_preprocessor"

  • "classifier" - Only for when when using AutoSklearnClasssifier

  • "regressor" - Only for when when using AutoSklearnRegressor

Example:

include = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}
excludeOptional[Dict[str, List[str]]] = None

If None, all possible algorithms are used.

Otherwise, specifies a step and the components that are excluded from search. See /pipeline/components/<step>/* for available components.

Incompatible with parameter include.

Possible Steps:

  • "data_preprocessor"

  • "balancing"

  • "feature_preprocessor"

  • "classifier" - Only for when when using AutoSklearnClasssifier

  • "regressor" - Only for when when using AutoSklearnRegressor

Example:

exclude = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}
resampling_strategystr | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = “holdout”

How to to handle overfitting, might need to use resampling_strategy_arguments if using "cv" based method or a Splitter object.

  • Options
    • "holdout" - Use a 67:33 (train:test) split

    • "cv": perform cross validation, requires “folds” in resampling_strategy_arguments

    • "holdout-iterative-fit" - Same as “holdout” but iterative fit where possible

    • "cv-iterative-fit": Same as “cv” but iterative fit where possible

    • "partial-cv": Same as “cv” but uses intensification.

    • BaseCrossValidator - any BaseCrossValidator subclass (found in scikit-learn model_selection module)

    • _RepeatedSplits - any _RepeatedSplits subclass (found in scikit-learn model_selection module)

    • BaseShuffleSplit - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)

If using a Splitter object that relies on the dataset retaining it’s current size and order, you will need to look at the dataset_compression argument and ensure that "subsample" is not included in the applied compression "methods" or disable it entirely with False.

resampling_strategy_argumentsOptional[Dict] = None

Additional arguments for resampling_strategy, this is required if using a cv based strategy. The default arguments if left as None are:

{
    "train_size": 0.67,     # The size of the training set
    "shuffle": True,        # Whether to shuffle before splitting data
    "folds": 5              # Used in 'cv' based resampling strategies
}

If using a custom splitter class, which takes n_splits such as PredefinedSplit, the value of "folds" will be used.

tmp_folderstring, optional (None)

folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: bool, optional (True)

remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted

n_jobsint, optional, experimental

The number of jobs to run in parallel for fit(). -1 means using all processors.

Important notes:

  • By default, Auto-sklearn uses one core.

  • Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble.

  • predict() is not affected by n_jobs (in contrast to most scikit-learn models)

  • If dask_client is None, a new dask client is created.

dask_clientdask.distributed.Client, optional

User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.

disable_evaluator_output: bool or list, optional (False)

If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:

  • 'y_optimization' : do not save the predictions for the optimization set, which would later on be used to build an ensemble.

  • model : do not save any model files

smac_scenario_argsdict, optional (None)

Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.

get_smac_object_callbackcallable

Callback function to create an object of class smac.facade.AbstractFacade. The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with SMAC.

logging_configdict, optional (None)

dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.

metadata_directorystr, optional (None)

path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.

metricScorer, optional (None)

An instance of autosklearn.metrics.Scorer as created by autosklearn.metrics.make_scorer(). These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.

scoring_functionsList[Scorer], optional (None)

List of scorers which will be calculated for each pipeline and results will be available via cv_results

load_modelsbool, optional (True)

Whether to load the models after fitting Auto-sklearn.

get_trials_callback: callable

A callable with the following definition.

  • (smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None

This will be called after SMAC, the underlying optimizer for autosklearn, finishes training each run.

You can use this to record your own information about the optimization process. You can also use this to enable a early stopping based on some critera.

See the example: Early Stopping And Callbacks.

dataset_compression: Union[bool, Mapping[str, Any]] = True

We compress datasets so that they fit into some predefined amount of memory. Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.

NOTE - If using a custom resampling_strategy that relies on specific size or ordering of data, this must be disabled to preserve these properties.

You can disable this entirely by passing False or leave as the default True for configuration below.

{
    "memory_allocation": 0.1,
    "methods": ["precision", "subsample"]
}

You can also pass your own configuration with the same keys and choosing from the available "methods".

The available options are described here:

  • memory_allocation

    By default, we attempt to fit the dataset into 0.1 * memory_limit. This float value can be set with "memory_allocation": 0.1. We also allow for specifying absolute memory in MB, e.g. 10MB is "memory_allocation": 10.

    The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in "methods" will not be performed.

    For example, if methods: ["precision", "subsample"] and the "precision" reduction step was enough to make the dataset fit into memory, then the "subsample" reduction step will not be performed.

  • methods

    We provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given.

    • "precision" - We reduce floating point precision as follows: * np.float128 -> np.float64 * np.float96 -> np.float64 * np.float64 -> np.float32

    • subsample - We subsample data such that it fits directly into the memory allocation memory_allocation * memory_limit. Therefore, this should likely be the last method listed in "methods". Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.

allow_string_features: bool = True

Whether autosklearn should process string features. By default the textpreprocessing is enabled.

disable_progress_bar: bool = False

Whether to disable the progress bar that is displayed in the console while fitting to the training data.

Attributes
cv_results_dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

Not all keys returned by scikit-learn are supported yet.

performance_over_time_pandas.core.frame.DataFrame

A DataFrame containing the models performance over time data. Can be used for plotting directly. Please refer to the example Train and Test Inputs.

fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]

Fit Auto-sklearn to given training set (X, y).

Fit both optimizes the machine learning models and builds an ensemble out of them.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_targets]

The regression target.

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.

y_testarray-like, shape = [n_samples] or [n_samples, n_targets]

The regression target. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded.

dataset_namestr, optional (default=None)

Create nicer output. If None, a string will be determined by the md5 hash of the dataset.

Returns
self
fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)

Fit an ensemble to models trained during an optimization process.

All parameters are None by default. If no other value is given, the default values which were set in a call to fit() are used.

Parameters
yarray-like

Target values.

taskint

A constant from the module autosklearn.constants. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).

precisionint

Numeric precision used when loading ensemble data. Can be either 16, 32 or 64.

dataset_namestr

Name of the current data set.

ensemble_sizeint, optional

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to 0 no ensemble is fit.

Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via ensemble_kwargs={"ensemble_size": int} if you want to change the ensemble size for ensemble selection.

ensemble_kwargsDict, optional

Keyword arguments that are passed to the ensemble class upon initialization.

ensemble_nbestint

Only consider the ensemble_nbest models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of the ensemble_class argument and this pruning step is done prior to constructing an ensemble.

ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)

Class implementing the post-hoc ensemble algorithm. Set to None to disable ensemble building or use class:SingleBest to obtain only use the single best model instead of an ensemble.

If set to “default” it will use EnsembleSelection for single-objective problems and MultiObjectiveDummyEnsemble for multi-objective problems.

metric: Scorer | Sequence[Scorer] | None = None

A metric or list of metrics to score the ensemble with

Returns
self
fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]

Fits and individual pipeline configuration and returns the result to the user.

The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.

Any additional argument provided is directly passed to the worker exercising the run.

Parameters
X: array-like, shape = (n_samples, n_features)

The features used for training

y: array-like

The labels used for training

X_test: Optionalarray-like, shape = (n_samples, n_features)

If provided, the testing performance will be tracked on this features.

y_test: array-like

If provided, the testing performance will be tracked on this labels

config: Union[Configuration, Dict[str, Union[str, float, int]]]

A configuration object used to define the pipeline steps. If a dict is passed, a configuration is created based on this dict.

dataset_name: Optional[str]

Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

Returns
pipeline: Optional[BasePipeline]

The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.

run_info: RunInFo

A named tuple that contains the configuration launched

run_value: RunValue

A named tuple that contains the result of the run

get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace

Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

Array with the training features, used to get characteristics like data sparsity

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Array with features used for performance estimation

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels for the testing split

dataset_name: Optional[str]

A string to tag the Auto-Sklearn run

get_models_with_weights()

Return a list of the final ensemble found by auto-sklearn.

Returns
[(weight_1, model_1), …, (weight_n, model_n)]
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame

Returns a pandas table of results for all evaluated models.

Gives an overview of all models trained during the search process along with various statistics about their training.

The available statistics are:

Simple:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "type" - The type of classifier/regressor used.

  • "cost" - The loss of the model on the validation set.

  • "duration" - Length of time the model was optimized for.

Detailed: The detailed view includes all of the simple statistics along with the following.

  • "config_id" - The id used by SMAC for optimization.

  • "budget" - How much budget was allocated to this model.

  • "status" - The return status of training the model with SMAC.

  • "train_loss" - The loss of the model on the training set.

  • "balancing_strategy" - The balancing strategy used for data preprocessing.

  • "start_time" - Time the model began being optimized

  • "end_time" - Time the model ended being optimized

  • "data_preprocessors" - The preprocessors used on the data

  • "feature_preprocessors" - The preprocessors for features types

Parameters
detailed: bool = False

Whether to give detailed information or just a simple overview.

ensemble_only: bool = True

Whether to view only models included in the ensemble or all models trained.

top_k: int or “all” = “all”

How many models to display.

sort_by: str = ‘cost’

What column to sort by. If that column is not present, the sorting defaults to the "model_id" index column.

Defaults to the metric optimized. Sort by the first objective in case of a multi-objective optimization problem

sort_order: “auto” or “ascending” or “descending” = “auto”

Which sort order to apply to the sort_by column. If left as "auto", it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.

include: Optional[str or Iterable[str]]

Items to include, other items not specified will be excluded. The exception is the "model_id" index column which is always included.

If left as None, it will resort back to using the detailed param to decide the columns to include.

Returns
pd.DataFrame

A dataframe of statistics for the models, ordered by sort_by.

predict(X, batch_size=None, n_jobs=1)[source]

Predict regression target for X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
Returns
yarray of shape = [n_samples] or [n_samples, n_outputs]

The predicted values.

refit(X, y)

Refit all models found with fit to new data.

Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The targets.

Returns
self
score(X, y)

Return the coefficient of determination \(R^2\) of the prediction.

The coefficient \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred) ** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

\(R^2\) of self.predict(X) wrt. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

show_models()

Returns a dictionary containing dictionaries of ensemble models.

Each model in the ensemble can be accessed by giving its model_id as key.

A model dictionary contains the following:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "cost" - The loss of the model on the validation set.

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "voting_model" - The cv_voting_ensemble model (for ‘cv’ resampling).

  • "estimators" - List of models (dicts) in cv_voting_ensemble

    (‘cv’ resampling).

  • "data_preprocessor" - The preprocessor used on the data.

  • "balancing" - The balancing used on the data (for classification).

  • "feature_preprocessor" - The preprocessor for features types.

  • "classifier" / "regressor" - The autosklearn wrapped classifier or regressor.

  • "sklearn_classifier" or "sklearn_regressor" - The sklearn classifier or regressor.

Example

import sklearn.datasets
import sklearn.metrics
import autosklearn.regression

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120
    )
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

Output:

{
    25: {'model_id': 25.0,
         'rank': 1,
         'cost': 0.43667876507897496,
         'ensemble_weight': 0.38,
         'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
         'feature_preprocessor': <autosklearn.pipeline.components....>,
         'regressor': <autosklearn.pipeline.components.regression....>,
         'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...)
        },
    6: {'model_id': 6.0,
        'rank': 2,
        'cost': 0.4550418898836528,
        'ensemble_weight': 0.3,
        'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
        'feature_preprocessor': <autosklearn.pipeline.components....>,
        'regressor': <autosklearn.pipeline.components.regression....>,
        'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...)
        }...
}
Returns
Dict(int, Any)dictionary of length = number of models in the ensemble

A dictionary of models in the ensemble, where model_id is the key.

sprint_statistics()

Return the following statistics of the training result:

  • dataset name

  • metric used

  • best validation score

  • number of target algorithm runs

  • number of successful target algorithm runs

  • number of crashed target algorithm runs

  • number of target algorithm runs that exceeded the memory limit

  • number of target algorithm runs that exceeded the time limit

Returns
str

Metrics

autosklearn.metrics.make_scorer(name: str, score_func: Callable, *, optimum: float = 1.0, worst_possible_result: float = 0.0, greater_is_better: bool = True, needs_proba: bool = False, needs_threshold: bool = False, needs_X: bool = False, **kwargs: Any) autosklearn.metrics.Scorer[source]

Make a scorer from a performance metric or loss function.

Factory inspired by scikit-learn which wraps scikit-learn scoring functions to be used in auto-sklearn.

Parameters
name: str

Descriptive name of the metric

score_funccallable

Score function (or loss function) with signature score_func(y, y_pred, **kwargs).

optimumint or float, default=1

The best score achievable by the score function, i.e. maximum in case of scorer function and minimum in case of loss function.

worst_possible_resultint of float, default=0

The worst score achievable by the score function, i.e. minimum in case of scorer function and maximum in case of loss function.

greater_is_betterboolean, default=True

Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

needs_probaboolean, default=False

Whether score_func requires predict_proba to get probability estimates out of a classifier.

needs_thresholdboolean, default=False

Whether score_func takes a continuous decision certainty. This only works for binary classification.

needs_Xboolean, default=False

Whether score_func requires X in __call__ to compute a metric.

**kwargsadditional arguments

Additional parameters to be passed to score_func.

Returns
scorercallable

Callable object that returns a scalar score; greater is better or set greater_is_better to False.

Built-in Metrics

Classification metrics

Note: The default autosklearn.metrics.f1, autosklearn.metrics.precision and autosklearn.metrics.recall built-in metrics are applicable only for binary classification. In order to apply them on multilabel and multiclass classification, please use the corresponding metrics with an appropriate averaging mechanism, such as autosklearn.metrics.f1_macro. For more information about how these metrics are used, please read this scikit-learn documentation.

autosklearn.metrics.accuracy

alias of accuracy

autosklearn.metrics.balanced_accuracy

alias of balanced_accuracy

autosklearn.metrics.f1

alias of f1

autosklearn.metrics.f1_macro

alias of f1_macro

autosklearn.metrics.f1_micro

alias of f1_micro

autosklearn.metrics.f1_samples

alias of f1_samples

autosklearn.metrics.f1_weighted

alias of f1_weighted

autosklearn.metrics.roc_auc

alias of roc_auc

autosklearn.metrics.precision

alias of precision

autosklearn.metrics.precision_macro

alias of precision_macro

autosklearn.metrics.precision_micro

alias of precision_micro

autosklearn.metrics.precision_samples

alias of precision_samples

autosklearn.metrics.precision_weighted

alias of precision_weighted

autosklearn.metrics.average_precision

alias of average_precision

autosklearn.metrics.recall

alias of recall

autosklearn.metrics.recall_macro

alias of recall_macro

autosklearn.metrics.recall_micro

alias of recall_micro

autosklearn.metrics.recall_samples

alias of recall_samples

autosklearn.metrics.recall_weighted

alias of recall_weighted

autosklearn.metrics.log_loss

alias of log_loss

Regression metrics

autosklearn.metrics.r2

alias of r2

autosklearn.metrics.mean_squared_error

alias of mean_squared_error

autosklearn.metrics.mean_absolute_error

alias of mean_absolute_error

autosklearn.metrics.median_absolute_error

alias of median_absolute_error

Extension Interfaces

class autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm[source]

Provide an abstract interface for classification algorithms in auto-sklearn.

See Extending auto-sklearn for more information.

get_estimator()[source]

Return the underlying estimator object.

Returns
estimatorthe underlying estimator object
predict(X)[source]

The predict function calls the predict function of the underlying scikit-learn model and returns an array with the predictions.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
array, shape = (n_samples,) or shape = (n_samples, n_labels)

Returns the predicted values

Notes

Please see the scikit-learn API documentation for further information.

predict_proba(X)[source]

Predict probabilities.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes)
class autosklearn.pipeline.components.base.AutoSklearnRegressionAlgorithm[source]

Provide an abstract interface for regression algorithms in auto-sklearn.

Make a subclass of this and put it into the directory autosklearn/pipeline/components/regression to make it available.

get_estimator()[source]

Return the underlying estimator object.

Returns
estimatorthe underlying estimator object
predict(X)[source]

The predict function calls the predict function of the underlying scikit-learn model and returns an array with the predictions.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
array, shape = (n_samples,) or shape = (n_samples, n_targets)

Returns the predicted values

Notes

Please see the scikit-learn API documentation for further information.

class autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm[source]

Provide an abstract interface for preprocessing algorithms in auto-sklearn.

See Extending auto-sklearn for more information.

get_preprocessor()[source]

Return the underlying preprocessor object.

Returns
preprocessorthe underlying preprocessor object
transform(X)[source]

The transform function calls the transform function of the underlying scikit-learn model and returns the transformed array.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
Xarray

Return the transformed training data

Notes

Please see the scikit-learn API documentation for further information.

Ensembles

Single objective

class autosklearn.ensembles.EnsembleSelection(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, ensemble_size: int = 50, bagging: bool = False, mode: str = 'fast', random_state: int | np.random.RandomState | None = None)[source]

An ensemble of selected algorithms

Fitting an EnsembleSelection generates an ensemble from the the models generated during the search process. Can be further used for prediction.

Parameters
task_type: int

An identifier indicating which task is being performed.

metrics: Sequence[Scorer] | Scorer

The metric used to evaluate the models. If multiple metrics are passed, ensemble selection only optimizes for the first

backendBackend

Gives access to the backend of Auto-sklearn. Not used by Ensemble Selection.

bagging: bool = False

Whether to use bagging in ensemble selection

mode: str in [‘fast’, ‘slow’] = ‘fast’

Which kind of ensemble generation to use * ‘slow’ - The original method used in Rich Caruana’s ensemble selection. * ‘fast’ - A faster version of Rich Caruanas’ ensemble selection.

random_state: int | RandomState | None = None

The random_state used for ensemble selection.

  • None - Uses numpy’s default RandomState object

  • int - Successive calls to fit will produce the same results

  • RandomState - Truly random, each call to fit will produce different results, even with the same object.

References

Ensemble selection from libraries of models
Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew and Alex Ksikes
ICML 2004
fit(base_models_predictions: List[np.ndarray], true_targets: np.ndarray, model_identifiers: List[Tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) EnsembleSelection[source]

Fit an ensemble given predictions of base models and targets.

Ensemble building maximizes performance (in contrast to hyperparameter optimization)!

Parameters
base_models_predictions: np.ndarray

shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression

Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.

X_datalist-like or sparse data
true_targetsarray of shape [n_targets]
model_identifiersidentifier for each base model.

Can be used for practical text output of the ensemble.

runs: Sequence[Run]

Additional information for each run executed by SMAC that was considered by the ensemble builder.

Returns
self
get_identifiers_with_weights() List[Tuple[Tuple[int, int, float], float]][source]

Return a (identifier, weight)-pairs for all models that were passed to the ensemble builder.

Parameters
modelsdict {identifiermodel object}

The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.

Returns
List[Tuple[Tuple[int, int, float], float]
get_models_with_weights(models: Dict[Tuple[int, int, float], autosklearn.pipeline.base.BasePipeline]) List[Tuple[float, autosklearn.pipeline.base.BasePipeline]][source]

List of (weight, model) pairs for all models included in the ensemble.

Parameters
modelsdict {identifiermodel object}

The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.

Returns
List[Tuple[float, BasePipeline]]
get_selected_model_identifiers() List[Tuple[int, int, float]][source]

Return identifiers of models in the ensemble.

This includes models which have a weight of zero!

Returns
list
get_validation_performance() float[source]

Return validation performance of ensemble.

Returns
float
predict(base_models_predictions: Union[numpy.ndarray, List[numpy.ndarray]]) numpy.ndarray[source]

Create ensemble predictions from the base model predictions.

Parameters
base_models_predictionsnp.ndarray

shape = (n_base_models, n_data_points, n_targets) Same as in the fit method.

Returns
np.ndarray

Single model classes

These classes wrap a single model to provide a unified interface in Auto-sklearn.

class autosklearn.ensembles.SingleBest(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, random_state: int | np.random.RandomState | None = None)[source]

Ensemble consisting of the single best model.

Parameters
task_type: int

An identifier indicating which task is being performed.

metrics: Sequence[Scorer] | Scorer

The metrics used to evaluate the models.

random_state: int | RandomState | None = None

Not used.

backendBackend

Gives access to the backend of Auto-sklearn. Not used.

fit(base_models_predictions: np.ndarray | list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) SingleBest[source]

Select the single best model.

Parameters
base_models_predictions: np.ndarray

shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression

Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.

true_targetsarray of shape [n_targets]
model_identifiersidentifier for each base model.

Can be used for practical text output of the ensemble.

runs: Sequence[Run]

Additional information for each run executed by SMAC that was considered by the ensemble builder. Not used.

X_dataarray-like | sparse matrix | None = None
Returns
self
class autosklearn.ensembles.SingleModelEnsemble(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, model_index: int, random_state: int | np.random.RandomState | None = None)[source]

Ensemble consisting of a single model.

This class is used by the MultiObjectiveDummyEnsemble to represent ensembles consisting of a single model, and this class should not be used on its own.

Do not use by yourself!

Parameters
task_type: int

An identifier indicating which task is being performed.

metrics: Sequence[Scorer] | Scorer

The metrics used to evaluate the models.

backendBackend

Gives access to the backend of Auto-sklearn. Not used.

model_indexint

Index of the model that constitutes the ensemble. This index will be used to select the correct predictions that will be passed during fit and predict.

random_state: int | RandomState | None = None

Not used.

fit(base_models_predictions: np.ndarray | list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) SingleModelEnsemble[source]

Dummy implementation of the fit method.

Actualy work of passing the model index is done in the constructor. This method only stores the identifier of the selected model and computes it’s validation loss.

Parameters
base_models_predictions: np.ndarray

shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression

Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.

true_targetsarray of shape [n_targets]
model_identifiersidentifier for each base model.

Can be used for practical text output of the ensemble.

runs: Sequence[Run]

Additional information for each run executed by SMAC that was considered by the ensemble builder. Not used.

X_datalist-like | spmatrix | None = None

X data to feed to a metric if it requires it

Returns
self
class autosklearn.ensembles.SingleBestFromRunhistory(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, run_history: RunHistory, seed: int, random_state: int | np.random.RandomState | None = None)[source]

In the case of a crash, this class searches for the best individual model.

Such model is returned as an ensemble of a single object, to comply with the expected interface of an AbstractEnsemble.

Do not use by yourself!

get_identifiers_from_run_history() list[tuple[int, int, float]][source]

Parses the run history, to identify the best performing model

Populates the identifiers attribute, which is used by the backend to access the actual model.

Multi-objective

class autosklearn.ensembles.MultiObjectiveDummyEnsemble(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, random_state: int | np.random.RandomState | None = None)[source]

A dummy implementation of a multi-objective ensemble.

Builds ensembles that are individual models on the Pareto front each.

Parameters
task_type: int

An identifier indicating which task is being performed.

metrics: Sequence[Scorer] | Scorer

The metrics used to evaluate the models.

backendBackend

Gives access to the backend of Auto-sklearn. Not used.

random_state: int | RandomState | None = None

Not used.

fit(base_models_predictions: list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) MultiObjectiveDummyEnsemble[source]

Select dummy ensembles given predictions of base models and targets.

Parameters
base_models_predictions: np.ndarray

shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression

Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.

true_targetsarray of shape [n_targets]
model_identifiersidentifier for each base model.

Can be used for practical text output of the ensemble.

runs: Sequence[Run]

Additional information for each run executed by SMAC that was considered by the ensemble builder. Not used.

X_datalist-like | sparse matrix | None = None

X data to give to the metric if required

Returns
self
get_identifiers_with_weights() list[tuple[tuple[int, int, float], float]][source]

Return a (identifier, weight)-pairs for all models that were passed to the ensemble builder based on the ensemble that is best for the 1st metric.

Parameters
modelsdict {identifiermodel object}

The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.

Returns
list[tuple[tuple[int, int, float], float]
get_models_with_weights(models: dict[tuple[int, int, float], BasePipeline]) list[tuple[float, BasePipeline]][source]

Return a list of (weight, model) pairs for the ensemble that is best for the 1st metric.

Parameters
modelsdict {identifiermodel object}

The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.

Returns
list[tuple[float, BasePipeline]]
get_selected_model_identifiers() list[tuple[int, int, float]][source]

Return identifiers of models in the ensemble that is best for the 1st metric.

This includes models which have a weight of zero!

Returns
list
get_validation_performance() float[source]

Validation performance of the ensemble that is best for the 1st metric.

Returns
float
property pareto_set: Sequence[autosklearn.ensembles.abstract_ensemble.AbstractEnsemble]

Get a sequence on ensembles that are on the pareto front

Returns
Sequence[AbstractEnsemble]
Raises
SklearnNotFittedError

If fit has not been called and the pareto set does not exist yet

predict(base_models_predictions: np.ndarray | list[np.ndarray]) np.ndarray[source]

Predict using the ensemble which is best for the 1st metric.

Parameters
base_models_predictionsnp.ndarray

shape = (n_base_models, n_data_points, n_targets) Same as in the fit method.

Returns
np.ndarray