APIs

Main modules

Classification

class autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int = 50, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include=None, exclude=None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[distributed.client.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric=None, scoring_functions: Optional[List[autosklearn.metrics.Scorer]] = None, load_models: bool = True, get_trials_callback=None)[source]

This class implements the classification task.

Parameters
time_left_for_this_taskint, optional (default=3600)

Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)

Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

initial_configurations_via_metalearningint, optional (default=25)

Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.

ensemble_sizeint, optional (default=50)

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.

ensemble_nbestint, optional (default=50)

Only consider the ensemble_nbest models when building an ensemble.

max_models_on_disc: int, optional (default=50),

Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.

seedint, optional (default=1)

Used to seed SMAC. Will determine the output file names.

memory_limitint, optional (3072)

Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.

includedict, optional (None)

If None, all possible algorithms are used. Otherwise specifies set of algorithms for each added component is used. Include and exclude are incompatible if used together on the same component

excludedict, optional (None)

If None, all possible algorithms are used. Otherwise specifies set of algorithms for each added component is not used. Incompatible with include. Include and exclude are incompatible if used together on the same component

resampling_strategystring or object, optional (‘holdout’)

how to to handle overfitting, might need ‘resampling_strategy_arguments’

  • ‘holdout’: 67:33 (train:test) split

  • ‘holdout-iterative-fit’: 67:33 (train:test) split, calls iterative fit where possible

  • ‘cv’: crossvalidation, requires ‘folds’

  • ‘cv-iterative-fit’: crossvalidation, calls iterative fit where possible

  • ‘partial-cv’: crossvalidation with intensification, requires ‘folds’

  • BaseCrossValidator object: any BaseCrossValidator class found

    in scikit-learn model_selection module

  • _RepeatedSplits object: any _RepeatedSplits class found

    in scikit-learn model_selection module

  • BaseShuffleSplit object: any BaseShuffleSplit class found

    in scikit-learn model_selection module

resampling_strategy_argumentsdict, optional if ‘holdout’ (train_size default=0.67)

Additional arguments for resampling_strategy:

  • train_size should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split.

  • shuffle determines whether the data is shuffled prior to splitting it into train and validation.

Available arguments:

  • ‘holdout’: {‘train_size’: float}

  • ‘holdout-iterative-fit’: {‘train_size’: float}

  • ‘cv’: {‘folds’: int}

  • ‘cv-iterative-fit’: {‘folds’: int}

  • ‘partial-cv’: {‘folds’: int, ‘shuffle’: bool}

  • BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments

    required by chosen class as specified in scikit-learn documentation. If arguments are not provided, scikit-learn defaults are used. If no defaults are available, an exception is raised. Refer to the ‘n_splits’ argument as ‘folds’.

tmp_folderstring, optional (None)

folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: bool, optional (True)

remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted

n_jobsint, optional, experimental

The number of jobs to run in parallel for fit(). -1 means using all processors. By default, Auto-sklearn uses a single core for fitting the machine learning model and a single core for fitting an ensemble. Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble. In contrast to most scikit-learn models, n_jobs given in the constructor is not applied to the predict() method. If dask_client is None, a new dask client is created.

dask_clientdask.distributed.Client, optional

User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.

disable_evaluator_output: bool or list, optional (False)

If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:

  • 'y_optimization' : do not save the predictions for the optimization/validation set, which would later on be used to build an ensemble.

  • 'model' : do not save any model files

smac_scenario_argsdict, optional (None)

Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.

get_smac_object_callbackcallable

Callback function to create an object of class smac.optimizer.smbo.SMBO. The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with SMAC.

logging_configdict, optional (None)

dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.

metadata_directorystr, optional (None)

path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.

metricScorer, optional (None)

An instance of autosklearn.metrics.Scorer as created by autosklearn.metrics.make_scorer(). These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.

scoring_functionsList[Scorer], optional (None)

List of scorers which will be calculated for each pipeline and results will be available via cv_results

load_modelsbool, optional (True)

Whether to load the models after fitting Auto-sklearn.

get_trials_callback: callable

Callback function to create an object of subclass defined in module smac.callbacks. This is an advanced feature. Use only if you are familiar with SMAC.

Attributes
cv_results_dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

Not all keys returned by scikit-learn are supported yet.

performance_over_time_pandas.core.frame.DataFrame

A DataFrame containing the models performance over time data. Can be used for plotting directly. Please refer to the example Train and Test Inputs.

fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]

Fit auto-sklearn to given training set (X, y).

Fit both optimizes the machine learning models and builds an ensemble out of them. To disable ensembling, set ensemble_size==0.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The target classes.

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Test data target classes. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

dataset_namestr, optional (default=None)

Create nicer output. If None, a string will be determined by the md5 hash of the dataset.

Returns
self
fit_ensemble(y, task=None, precision=32, dataset_name=None, ensemble_nbest=None, ensemble_size=None)

Fit an ensemble to models trained during an optimization process.

All parameters are None by default. If no other value is given, the default values which were set in a call to fit() are used.

Calling this function is only necessary if ensemble_size==0, for example when executing auto-sklearn in parallel.

Parameters
yarray-like

Target values.

taskint

A constant from the module autosklearn.constants. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).

precisionstr

Numeric precision used when loading ensemble data. Can be either '16', '32' or '64'.

dataset_namestr

Name of the current data set.

ensemble_nbestint

Determines how many models should be considered from the ensemble building. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection.

ensemble_sizeint

Size of the ensemble built by Ensemble Selection.

Returns
self
fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]

Fits and individual pipeline configuration and returns the result to the user.

The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.

Any additional argument provided is directly passed to the worker exercising the run.

Parameters
X: array-like, shape = (n_samples, n_features)

The features used for training

y: array-like

The labels used for training

X_test: Optionalarray-like, shape = (n_samples, n_features)

If provided, the testing performance will be tracked on this features.

y_test: array-like

If provided, the testing performance will be tracked on this labels

config: Union[Configuration, Dict[str, Union[str, float, int]]]

A configuration object used to define the pipeline steps. If a dictionary is passed, a configuration is created based on this dictionary.

dataset_name: Optional[str]

Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

Returns
pipeline: Optional[BasePipeline]

The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.

run_info: RunInFo

A named tuple that contains the configuration launched

run_value: RunValue

A named tuple that contains the result of the run

get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace

Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

Array with the training features, used to get characteristics like data sparsity

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Array with features used for performance estimation

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels for the testing split

dataset_name: Optional[str]

A string to tag the Auto-Sklearn run

get_models_with_weights()

Return a list of the final ensemble found by auto-sklearn.

Returns
[(weight_1, model_1), …, (weight_n, model_n)]
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame

Returns a pandas table of results for all evaluated models.

Gives an overview of all models trained during the search process along with various statistics about their training.

The availble statistics are:

Simple:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "type" - The type of classifier/regressor used.

  • "cost" - The loss of the model on the validation set.

  • "duration" - Length of time the model was optimized for.

Detailed: The detailed view includes all of the simple statistics along with the following.

  • "config_id" - The id used by SMAC for optimization.

  • "budget" - How much budget was allocated to this model.

  • "status" - The return status of training the model with SMAC.

  • "train_loss" - The loss of the model on the training set.

  • "balancing_strategy" - The balancing strategy used for data preprocessing.

  • "start_time" - Time the model began being optimized

  • "end_time" - Time the model ended being optimized

  • "data_preprocessors" - The preprocessors used on the data

  • "feature_preprocessors" - The preprocessors for features types

Parameters
detailed: bool = False

Whether to give detailed information or just a simple overview.

ensemble_only: bool = True

Whether to view only models included in the ensemble or all models trained.

top_k: int or “all” = “all”

How many models to display.

sort_by: str = ‘cost’

What column to sort by. If that column is not present, the sorting defaults to the "model_id" index column.

sort_order: “auto” or “ascending” or “descending” = “auto”

Which sort order to apply to the sort_by column. If left as "auto", it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.

include: Optional[str or Iterable[str]]

Items to include, other items not specified will be excluded. The exception is the "model_id" index column which is always included.

If left as None, it will resort back to using the detailed param to decide the columns to include.

Returns
pd.DataFrame

A dataframe of statistics for the models, ordered by sort_by.

predict(X, batch_size=None, n_jobs=1)[source]

Predict classes for X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
Returns
yarray of shape = [n_samples] or [n_samples, n_labels]

The predicted classes.

predict_proba(X, batch_size=None, n_jobs=1)[source]

Predict probabilities of classes for all samples X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
batch_sizeint (optional)

Number of data points to predict for (predicts all points at once if None.

n_jobsint
Returns
yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]

The predicted class probabilities.

refit(X, y)

Refit all models found with fit to new data.

Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The targets.

Returns
self
score(X, y)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

show_models()

Return a representation of the final ensemble found by auto-sklearn.

Returns
str
sprint_statistics()

Return the following statistics of the training result:

  • dataset name

  • metric used

  • best validation score

  • number of target algorithm runs

  • number of successful target algorithm runs

  • number of crashed target algorithm runs

  • number of target algorithm runs that exceeded the memory limit

  • number of target algorithm runs that exceeded the time limit

Returns
str
class autosklearn.experimental.askl2.AutoSklearn2Classifier(time_left_for_this_task: int = 3600, per_run_time_limit=None, ensemble_size: int = 50, ensemble_nbest: Union[float, int] = 50, max_models_on_disc: int = 50, seed: int = 1, memory_limit: int = 3072, tmp_folder: Optional[str] = None, delete_tmp_folder_after_terminate: bool = True, n_jobs: Optional[int] = None, dask_client: Optional[distributed.client.Client] = None, disable_evaluator_output: bool = False, smac_scenario_args: Optional[Dict[str, Any]] = None, logging_config: Optional[Dict[str, Any]] = None, metric: Optional[autosklearn.metrics.Scorer] = None, scoring_functions: Optional[List[autosklearn.metrics.Scorer]] = None, load_models: bool = True)[source]
Parameters
time_left_for_this_taskint, optional (default=3600)

Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)

Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

ensemble_sizeint, optional (default=50)

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.

ensemble_nbestint, optional (default=50)

Only consider the ensemble_nbest models when building an ensemble.

max_models_on_disc: int, optional (default=50),

Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.

seedint, optional (default=1)

Used to seed SMAC. Will determine the output file names.

memory_limitint, optional (3072)

Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.

tmp_folderstring, optional (None)

folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: string, optional (True)

remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted

n_jobsint, optional, experimental

The number of jobs to run in parallel for fit(). -1 means using all processors. By default, Auto-sklearn uses a single core for fitting the machine learning model and a single core for fitting an ensemble. Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble. In contrast to most scikit-learn models, n_jobs given in the constructor is not applied to the predict() method. If dask_client is None, a new dask client is created.

dask_clientdask.distributed.Client, optional

User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.

disable_evaluator_output: bool or list, optional (False)

If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:

  • 'y_optimization' : do not save the predictions for the optimization/validation set, which would later on be used to build an ensemble.

  • 'model' : do not save any model files

smac_scenario_argsdict, optional (None)

Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.

logging_configdict, optional (None)

dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.

metricScorer, optional (None)

An instance of autosklearn.metrics.Scorer as created by autosklearn.metrics.make_scorer(). These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.

scoring_functionsList[Scorer], optional (None)

List of scorers which will be calculated for each pipeline and results will be available via cv_results

load_modelsbool, optional (True)

Whether to load the models after fitting Auto-sklearn.

Attributes
cv_results_dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

Not all keys returned by scikit-learn are supported yet.

fit(X, y, X_test=None, y_test=None, metric=None, feat_type=None, dataset_name=None)[source]

Fit auto-sklearn to given training set (X, y).

Fit both optimizes the machine learning models and builds an ensemble out of them. To disable ensembling, set ensemble_size==0.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The target classes.

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Test data target classes. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

dataset_namestr, optional (default=None)

Create nicer output. If None, a string will be determined by the md5 hash of the dataset.

Returns
self
fit_ensemble(y, task=None, precision=32, dataset_name=None, ensemble_nbest=None, ensemble_size=None)

Fit an ensemble to models trained during an optimization process.

All parameters are None by default. If no other value is given, the default values which were set in a call to fit() are used.

Calling this function is only necessary if ensemble_size==0, for example when executing auto-sklearn in parallel.

Parameters
yarray-like

Target values.

taskint

A constant from the module autosklearn.constants. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).

precisionstr

Numeric precision used when loading ensemble data. Can be either '16', '32' or '64'.

dataset_namestr

Name of the current data set.

ensemble_nbestint

Determines how many models should be considered from the ensemble building. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection.

ensemble_sizeint

Size of the ensemble built by Ensemble Selection.

Returns
self
fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]

Fits and individual pipeline configuration and returns the result to the user.

The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.

Any additional argument provided is directly passed to the worker exercising the run.

Parameters
X: array-like, shape = (n_samples, n_features)

The features used for training

y: array-like

The labels used for training

X_test: Optionalarray-like, shape = (n_samples, n_features)

If provided, the testing performance will be tracked on this features.

y_test: array-like

If provided, the testing performance will be tracked on this labels

config: Union[Configuration, Dict[str, Union[str, float, int]]]

A configuration object used to define the pipeline steps. If a dictionary is passed, a configuration is created based on this dictionary.

dataset_name: Optional[str]

Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

Returns
pipeline: Optional[BasePipeline]

The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.

run_info: RunInFo

A named tuple that contains the configuration launched

run_value: RunValue

A named tuple that contains the result of the run

get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace

Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

Array with the training features, used to get characteristics like data sparsity

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Array with features used for performance estimation

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels for the testing split

dataset_name: Optional[str]

A string to tag the Auto-Sklearn run

get_models_with_weights()

Return a list of the final ensemble found by auto-sklearn.

Returns
[(weight_1, model_1), …, (weight_n, model_n)]
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame

Returns a pandas table of results for all evaluated models.

Gives an overview of all models trained during the search process along with various statistics about their training.

The availble statistics are:

Simple:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "type" - The type of classifier/regressor used.

  • "cost" - The loss of the model on the validation set.

  • "duration" - Length of time the model was optimized for.

Detailed: The detailed view includes all of the simple statistics along with the following.

  • "config_id" - The id used by SMAC for optimization.

  • "budget" - How much budget was allocated to this model.

  • "status" - The return status of training the model with SMAC.

  • "train_loss" - The loss of the model on the training set.

  • "balancing_strategy" - The balancing strategy used for data preprocessing.

  • "start_time" - Time the model began being optimized

  • "end_time" - Time the model ended being optimized

  • "data_preprocessors" - The preprocessors used on the data

  • "feature_preprocessors" - The preprocessors for features types

Parameters
detailed: bool = False

Whether to give detailed information or just a simple overview.

ensemble_only: bool = True

Whether to view only models included in the ensemble or all models trained.

top_k: int or “all” = “all”

How many models to display.

sort_by: str = ‘cost’

What column to sort by. If that column is not present, the sorting defaults to the "model_id" index column.

sort_order: “auto” or “ascending” or “descending” = “auto”

Which sort order to apply to the sort_by column. If left as "auto", it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.

include: Optional[str or Iterable[str]]

Items to include, other items not specified will be excluded. The exception is the "model_id" index column which is always included.

If left as None, it will resort back to using the detailed param to decide the columns to include.

Returns
pd.DataFrame

A dataframe of statistics for the models, ordered by sort_by.

predict(X, batch_size=None, n_jobs=1)

Predict classes for X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
Returns
yarray of shape = [n_samples] or [n_samples, n_labels]

The predicted classes.

predict_proba(X, batch_size=None, n_jobs=1)

Predict probabilities of classes for all samples X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
batch_sizeint (optional)

Number of data points to predict for (predicts all points at once if None.

n_jobsint
Returns
yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]

The predicted class probabilities.

refit(X, y)

Refit all models found with fit to new data.

Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The targets.

Returns
self
score(X, y)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

show_models()

Return a representation of the final ensemble found by auto-sklearn.

Returns
str
sprint_statistics()

Return the following statistics of the training result:

  • dataset name

  • metric used

  • best validation score

  • number of target algorithm runs

  • number of successful target algorithm runs

  • number of crashed target algorithm runs

  • number of target algorithm runs that exceeded the memory limit

  • number of target algorithm runs that exceeded the time limit

Returns
str

Regression

class autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int = 50, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include=None, exclude=None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[distributed.client.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric=None, scoring_functions: Optional[List[autosklearn.metrics.Scorer]] = None, load_models: bool = True, get_trials_callback=None)[source]

This class implements the regression task.

Parameters
time_left_for_this_taskint, optional (default=3600)

Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)

Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

initial_configurations_via_metalearningint, optional (default=25)

Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.

ensemble_sizeint, optional (default=50)

Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.

ensemble_nbestint, optional (default=50)

Only consider the ensemble_nbest models when building an ensemble.

max_models_on_disc: int, optional (default=50),

Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.

seedint, optional (default=1)

Used to seed SMAC. Will determine the output file names.

memory_limitint, optional (3072)

Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.

includedict, optional (None)

If None, all possible algorithms are used. Otherwise specifies set of algorithms for each added component is used. Include and exclude are incompatible if used together on the same component

excludedict, optional (None)

If None, all possible algorithms are used. Otherwise specifies set of algorithms for each added component is not used. Incompatible with include. Include and exclude are incompatible if used together on the same component

resampling_strategystring or object, optional (‘holdout’)

how to to handle overfitting, might need ‘resampling_strategy_arguments’

  • ‘holdout’: 67:33 (train:test) split

  • ‘holdout-iterative-fit’: 67:33 (train:test) split, calls iterative fit where possible

  • ‘cv’: crossvalidation, requires ‘folds’

  • ‘cv-iterative-fit’: crossvalidation, calls iterative fit where possible

  • ‘partial-cv’: crossvalidation with intensification, requires ‘folds’

  • BaseCrossValidator object: any BaseCrossValidator class found

    in scikit-learn model_selection module

  • _RepeatedSplits object: any _RepeatedSplits class found

    in scikit-learn model_selection module

  • BaseShuffleSplit object: any BaseShuffleSplit class found

    in scikit-learn model_selection module

resampling_strategy_argumentsdict, optional if ‘holdout’ (train_size default=0.67)

Additional arguments for resampling_strategy:

  • train_size should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split.

  • shuffle determines whether the data is shuffled prior to splitting it into train and validation.

Available arguments:

  • ‘holdout’: {‘train_size’: float}

  • ‘holdout-iterative-fit’: {‘train_size’: float}

  • ‘cv’: {‘folds’: int}

  • ‘cv-iterative-fit’: {‘folds’: int}

  • ‘partial-cv’: {‘folds’: int, ‘shuffle’: bool}

  • BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments

    required by chosen class as specified in scikit-learn documentation. If arguments are not provided, scikit-learn defaults are used. If no defaults are available, an exception is raised. Refer to the ‘n_splits’ argument as ‘folds’.

tmp_folderstring, optional (None)

folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: bool, optional (True)

remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted

n_jobsint, optional, experimental

The number of jobs to run in parallel for fit(). -1 means using all processors. By default, Auto-sklearn uses a single core for fitting the machine learning model and a single core for fitting an ensemble. Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble. In contrast to most scikit-learn models, n_jobs given in the constructor is not applied to the predict() method. If dask_client is None, a new dask client is created.

dask_clientdask.distributed.Client, optional

User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.

disable_evaluator_output: bool or list, optional (False)

If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:

  • 'y_optimization' : do not save the predictions for the optimization/validation set, which would later on be used to build an ensemble.

  • 'model' : do not save any model files

smac_scenario_argsdict, optional (None)

Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.

get_smac_object_callbackcallable

Callback function to create an object of class smac.optimizer.smbo.SMBO. The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with SMAC.

logging_configdict, optional (None)

dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.

metadata_directorystr, optional (None)

path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.

metricScorer, optional (None)

An instance of autosklearn.metrics.Scorer as created by autosklearn.metrics.make_scorer(). These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.

scoring_functionsList[Scorer], optional (None)

List of scorers which will be calculated for each pipeline and results will be available via cv_results

load_modelsbool, optional (True)

Whether to load the models after fitting Auto-sklearn.

get_trials_callback: callable

Callback function to create an object of subclass defined in module smac.callbacks. This is an advanced feature. Use only if you are familiar with SMAC.

Attributes
cv_results_dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

Not all keys returned by scikit-learn are supported yet.

performance_over_time_pandas.core.frame.DataFrame

A DataFrame containing the models performance over time data. Can be used for plotting directly. Please refer to the example Train and Test Inputs.

fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]

Fit Auto-sklearn to given training set (X, y).

Fit both optimizes the machine learning models and builds an ensemble out of them. To disable ensembling, set ensemble_size==0.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_targets]

The regression target.

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.

y_testarray-like, shape = [n_samples] or [n_samples, n_targets]

The regression target. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded.

dataset_namestr, optional (default=None)

Create nicer output. If None, a string will be determined by the md5 hash of the dataset.

Returns
self
fit_ensemble(y, task=None, precision=32, dataset_name=None, ensemble_nbest=None, ensemble_size=None)

Fit an ensemble to models trained during an optimization process.

All parameters are None by default. If no other value is given, the default values which were set in a call to fit() are used.

Calling this function is only necessary if ensemble_size==0, for example when executing auto-sklearn in parallel.

Parameters
yarray-like

Target values.

taskint

A constant from the module autosklearn.constants. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).

precisionstr

Numeric precision used when loading ensemble data. Can be either '16', '32' or '64'.

dataset_namestr

Name of the current data set.

ensemble_nbestint

Determines how many models should be considered from the ensemble building. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection.

ensemble_sizeint

Size of the ensemble built by Ensemble Selection.

Returns
self
fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]

Fits and individual pipeline configuration and returns the result to the user.

The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.

Any additional argument provided is directly passed to the worker exercising the run.

Parameters
X: array-like, shape = (n_samples, n_features)

The features used for training

y: array-like

The labels used for training

X_test: Optionalarray-like, shape = (n_samples, n_features)

If provided, the testing performance will be tracked on this features.

y_test: array-like

If provided, the testing performance will be tracked on this labels

config: Union[Configuration, Dict[str, Union[str, float, int]]]

A configuration object used to define the pipeline steps. If a dictionary is passed, a configuration is created based on this dictionary.

dataset_name: Optional[str]

Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run

feat_typelist, optional (default=None)

List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.

Returns
pipeline: Optional[BasePipeline]

The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.

run_info: RunInFo

A named tuple that contains the configuration launched

run_value: RunValue

A named tuple that contains the result of the run

get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse.base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace

Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

Array with the training features, used to get characteristics like data sparsity

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels

X_testarray-like or sparse matrix of shape = [n_samples, n_features]

Array with features used for performance estimation

y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]

Array with the problem labels for the testing split

dataset_name: Optional[str]

A string to tag the Auto-Sklearn run

get_models_with_weights()

Return a list of the final ensemble found by auto-sklearn.

Returns
[(weight_1, model_1), …, (weight_n, model_n)]
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame

Returns a pandas table of results for all evaluated models.

Gives an overview of all models trained during the search process along with various statistics about their training.

The availble statistics are:

Simple:

  • "model_id" - The id given to a model by autosklearn.

  • "rank" - The rank of the model based on it’s "cost".

  • "ensemble_weight" - The weight given to the model in the ensemble.

  • "type" - The type of classifier/regressor used.

  • "cost" - The loss of the model on the validation set.

  • "duration" - Length of time the model was optimized for.

Detailed: The detailed view includes all of the simple statistics along with the following.

  • "config_id" - The id used by SMAC for optimization.

  • "budget" - How much budget was allocated to this model.

  • "status" - The return status of training the model with SMAC.

  • "train_loss" - The loss of the model on the training set.

  • "balancing_strategy" - The balancing strategy used for data preprocessing.

  • "start_time" - Time the model began being optimized

  • "end_time" - Time the model ended being optimized

  • "data_preprocessors" - The preprocessors used on the data

  • "feature_preprocessors" - The preprocessors for features types

Parameters
detailed: bool = False

Whether to give detailed information or just a simple overview.

ensemble_only: bool = True

Whether to view only models included in the ensemble or all models trained.

top_k: int or “all” = “all”

How many models to display.

sort_by: str = ‘cost’

What column to sort by. If that column is not present, the sorting defaults to the "model_id" index column.

sort_order: “auto” or “ascending” or “descending” = “auto”

Which sort order to apply to the sort_by column. If left as "auto", it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.

include: Optional[str or Iterable[str]]

Items to include, other items not specified will be excluded. The exception is the "model_id" index column which is always included.

If left as None, it will resort back to using the detailed param to decide the columns to include.

Returns
pd.DataFrame

A dataframe of statistics for the models, ordered by sort_by.

predict(X, batch_size=None, n_jobs=1)[source]

Predict regression target for X.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]
Returns
yarray of shape = [n_samples] or [n_samples, n_outputs]

The predicted values.

refit(X, y)

Refit all models found with fit to new data.

Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.

Parameters
Xarray-like or sparse matrix of shape = [n_samples, n_features]

The training input samples.

yarray-like, shape = [n_samples] or [n_samples, n_outputs]

The targets.

Returns
self
score(X, y)

Return the coefficient of determination \(R^2\) of the prediction.

The coefficient \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred) ** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

\(R^2\) of self.predict(X) wrt. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

show_models()

Return a representation of the final ensemble found by auto-sklearn.

Returns
str
sprint_statistics()

Return the following statistics of the training result:

  • dataset name

  • metric used

  • best validation score

  • number of target algorithm runs

  • number of successful target algorithm runs

  • number of crashed target algorithm runs

  • number of target algorithm runs that exceeded the memory limit

  • number of target algorithm runs that exceeded the time limit

Returns
str

Metrics

autosklearn.metrics.make_scorer(name: str, score_func: Callable, optimum: float = 1.0, worst_possible_result: float = 0.0, greater_is_better: bool = True, needs_proba: bool = False, needs_threshold: bool = False, **kwargs: Any) autosklearn.metrics.Scorer[source]

Make a scorer from a performance metric or loss function.

Factory inspired by scikit-learn which wraps scikit-learn scoring functions to be used in auto-sklearn.

Parameters
score_funccallable

Score function (or loss function) with signature score_func(y, y_pred, **kwargs).

optimumint or float, default=1

The best score achievable by the score function, i.e. maximum in case of scorer function and minimum in case of loss function.

greater_is_betterboolean, default=True

Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

needs_probaboolean, default=False

Whether score_func requires predict_proba to get probability estimates out of a classifier.

needs_thresholdboolean, default=False

Whether score_func takes a continuous decision certainty. This only works for binary classification.

**kwargsadditional arguments

Additional parameters to be passed to score_func.

Returns
scorercallable

Callable object that returns a scalar score; greater is better.

Built-in Metrics

Classification metrics

Note: The default autosklearn.metrics.f1, autosklearn.metrics.precision and autosklearn.metrics.recall built-in metrics are applicable only for binary classification. In order to apply them on multilabel and multiclass classification, please use the corresponding metrics with an appropriate averaging mechanism, such as autosklearn.metrics.f1_macro. For more information about how these metrics are used, please read this scikit-learn documentation.

autosklearn.metrics.accuracy

alias of accuracy

autosklearn.metrics.balanced_accuracy

alias of balanced_accuracy

autosklearn.metrics.f1

alias of f1

autosklearn.metrics.f1_macro

alias of f1_macro

autosklearn.metrics.f1_micro

alias of f1_micro

autosklearn.metrics.f1_samples

alias of f1_samples

autosklearn.metrics.f1_weighted

alias of f1_weighted

autosklearn.metrics.roc_auc

alias of roc_auc

autosklearn.metrics.precision

alias of precision

autosklearn.metrics.precision_macro

alias of precision_macro

autosklearn.metrics.precision_micro

alias of precision_micro

autosklearn.metrics.precision_samples

alias of precision_samples

autosklearn.metrics.precision_weighted

alias of precision_weighted

autosklearn.metrics.average_precision

alias of average_precision

autosklearn.metrics.recall

alias of recall

autosklearn.metrics.recall_macro

alias of recall_macro

autosklearn.metrics.recall_micro

alias of recall_micro

autosklearn.metrics.recall_samples

alias of recall_samples

autosklearn.metrics.recall_weighted

alias of recall_weighted

autosklearn.metrics.log_loss

alias of log_loss

Regression metrics

autosklearn.metrics.r2

alias of r2

autosklearn.metrics.mean_squared_error

alias of mean_squared_error

autosklearn.metrics.mean_absolute_error

alias of mean_absolute_error

autosklearn.metrics.median_absolute_error

alias of median_absolute_error

Extension Interfaces

class autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm[source]

Provide an abstract interface for classification algorithms in auto-sklearn.

See Extending auto-sklearn for more information.

get_estimator()[source]

Return the underlying estimator object.

Returns
estimatorthe underlying estimator object
predict(X)[source]

The predict function calls the predict function of the underlying scikit-learn model and returns an array with the predictions.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
array, shape = (n_samples,) or shape = (n_samples, n_labels)

Returns the predicted values

Notes

Please see the scikit-learn API documentation for further information.

predict_proba(X)[source]

Predict probabilities.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes)
class autosklearn.pipeline.components.base.AutoSklearnRegressionAlgorithm[source]

Provide an abstract interface for regression algorithms in auto-sklearn.

Make a subclass of this and put it into the directory autosklearn/pipeline/components/regression to make it available.

get_estimator()[source]

Return the underlying estimator object.

Returns
estimatorthe underlying estimator object
predict(X)[source]

The predict function calls the predict function of the underlying scikit-learn model and returns an array with the predictions.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
array, shape = (n_samples,) or shape = (n_samples, n_targets)

Returns the predicted values

Notes

Please see the scikit-learn API documentation for further information.

class autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm[source]

Provide an abstract interface for preprocessing algorithms in auto-sklearn.

See Extending auto-sklearn for more information.

get_preprocessor()[source]

Return the underlying preprocessor object.

Returns
preprocessorthe underlying preprocessor object
transform(X)[source]

The transform function calls the transform function of the underlying scikit-learn model and returns the transformed array.

Parameters
Xarray-like, shape = (n_samples, n_features)
Returns
Xarray

Return the transformed training data

Notes

Please see the scikit-learn API documentation for further information.