APIs¶
Main modules¶
Classification¶
- class autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int | None = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include: Optional[Dict[str, List[str]]] = None, exclude: Optional[Dict[str, List[str]]] = None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric: Scorer | Sequence[Scorer] | None = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, get_trials_callback: SMACCallback | None = None, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True, disable_progress_bar: bool = False)[source]¶
This class implements the classification task.
- Parameters
- time_left_for_this_taskint, optional (default=3600)
Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
- per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- initial_configurations_via_metalearningint, optional (default=25)
Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.
- ensemble_sizeint, optional
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to
0
no ensemble is fit.Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via
ensemble_kwargs={"ensemble_size": int}
if you want to change the ensemble size for ensemble selection.- ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)
Class implementing the post-hoc ensemble algorithm. Set to
None
to disable ensemble building or useSingleBest
to obtain only use the single best model instead of an ensemble.If set to “default” it will use
EnsembleSelection
for single-objective problems andMultiObjectiveDummyEnsemble
for multi-objective problems.- ensemble_kwargsDict, optional
Keyword arguments that are passed to the ensemble class upon initialization.
- ensemble_nbestint, optional (default=50)
Only consider the
ensemble_nbest
models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of theensemble_class
argument and this pruning step is done prior to constructing an ensemble.- max_models_on_disc: int, optional (default=50),
Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
- seedint, optional (default=1)
Used to seed SMAC. Will determine the output file names.
- memory_limitint, optional (3072)
Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than
memory_limit
MB.Important notes:
If
None
is provided, no memory limit is set.In case of multi-processing,
memory_limit
will be per job, so the total usage isn_jobs x memory_limit
.The memory limit also applies to the ensemble creation process.
- includeOptional[Dict[str, List[str]]] = None
If None, all possible algorithms are used.
Otherwise, specifies a step and the components that are included in search. See
/pipeline/components/<step>/*
for available components.Incompatible with parameter
exclude
.Possible Steps:
"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier"
- Only for when when usingAutoSklearnClasssifier
"regressor"
- Only for when when usingAutoSklearnRegressor
Example:
include = { 'classifier': ["random_forest"], 'feature_preprocessor': ["no_preprocessing"] }
- excludeOptional[Dict[str, List[str]]] = None
If None, all possible algorithms are used.
Otherwise, specifies a step and the components that are excluded from search. See
/pipeline/components/<step>/*
for available components.Incompatible with parameter
include
.Possible Steps:
"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier"
- Only for when when usingAutoSklearnClasssifier
"regressor"
- Only for when when usingAutoSklearnRegressor
Example:
exclude = { 'classifier': ["random_forest"], 'feature_preprocessor': ["no_preprocessing"] }
- resampling_strategystr | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = “holdout”
How to to handle overfitting, might need to use
resampling_strategy_arguments
if using"cv"
based method or a Splitter object.- Options
"holdout"
- Use a 67:33 (train:test) split"cv"
: perform cross validation, requires “folds” inresampling_strategy_arguments
"holdout-iterative-fit"
- Same as “holdout” but iterative fit where possible"cv-iterative-fit"
: Same as “cv” but iterative fit where possible"partial-cv"
: Same as “cv” but uses intensification.BaseCrossValidator
- any BaseCrossValidator subclass (found in scikit-learn model_selection module)_RepeatedSplits
- any _RepeatedSplits subclass (found in scikit-learn model_selection module)BaseShuffleSplit
- any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
If using a Splitter object that relies on the dataset retaining it’s current size and order, you will need to look at the
dataset_compression
argument and ensure that"subsample"
is not included in the applied compression"methods"
or disable it entirely withFalse
.- resampling_strategy_argumentsOptional[Dict] = None
Additional arguments for
resampling_strategy
, this is required if using acv
based strategy. The default arguments if left asNone
are:{ "train_size": 0.67, # The size of the training set "shuffle": True, # Whether to shuffle before splitting data "folds": 5 # Used in 'cv' based resampling strategies }
If using a custom splitter class, which takes
n_splits
such as PredefinedSplit, the value of"folds"
will be used.- tmp_folderstring, optional (None)
folder to store configuration output and log files, if
None
automatically use/tmp/autosklearn_tmp_$pid_$random_number
- delete_tmp_folder_after_terminate: bool, optional (True)
remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
- n_jobsint, optional, experimental
The number of jobs to run in parallel for
fit()
.-1
means using all processors.Important notes:
By default, Auto-sklearn uses one core.
Ensemble building is not affected by
n_jobs
but can be controlled by the number of models in the ensemble.predict()
is not affected byn_jobs
(in contrast to most scikit-learn models)If
dask_client
isNone
, a new dask client is created.
- dask_clientdask.distributed.Client, optional
User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
- disable_evaluator_output: bool or list, optional (False)
If True, disable model and prediction output. Cannot be used together with ensemble building.
predict()
cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:'y_optimization'
: do not save the predictions for the optimization set, which would later on be used to build an ensemble.model
: do not save any model files
- smac_scenario_argsdict, optional (None)
Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.
- get_smac_object_callbackcallable
Callback function to create an object of class smac.facade.AbstractFacade. The function must accept the arguments
scenario_dict
,instances
,num_params
,runhistory
,seed
andta
. This is an advanced feature. Use only if you are familiar with SMAC.- logging_configdict, optional (None)
dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory
util/logging.yaml
relative to the installation.- metadata_directorystr, optional (None)
path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.
- metricScorer, optional (None)
An instance of
autosklearn.metrics.Scorer
as created byautosklearn.metrics.make_scorer()
. These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.- scoring_functionsList[Scorer], optional (None)
List of scorers which will be calculated for each pipeline and results will be available via
cv_results
- load_modelsbool, optional (True)
Whether to load the models after fitting Auto-sklearn.
- get_trials_callback: callable
A callable with the following definition.
(smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None
This will be called after SMAC, the underlying optimizer for autosklearn, finishes training each run.
You can use this to record your own information about the optimization process. You can also use this to enable a early stopping based on some critera.
See the example: Early Stopping And Callbacks.
- dataset_compression: Union[bool, Mapping[str, Any]] = True
We compress datasets so that they fit into some predefined amount of memory. Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.
NOTE - If using a custom
resampling_strategy
that relies on specific size or ordering of data, this must be disabled to preserve these properties.You can disable this entirely by passing
False
or leave as the defaultTrue
for configuration below.{ "memory_allocation": 0.1, "methods": ["precision", "subsample"] }
You can also pass your own configuration with the same keys and choosing from the available
"methods"
.The available options are described here:
- memory_allocation
By default, we attempt to fit the dataset into
0.1 * memory_limit
. This float value can be set with"memory_allocation": 0.1
. We also allow for specifying absolute memory in MB, e.g. 10MB is"memory_allocation": 10
.The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in
"methods"
will not be performed.For example, if
methods: ["precision", "subsample"]
and the"precision"
reduction step was enough to make the dataset fit into memory, then the"subsample"
reduction step will not be performed.
- methods
We provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given.
"precision"
- We reduce floating point precision as follows: *np.float128 -> np.float64
*np.float96 -> np.float64
*np.float64 -> np.float32
subsample
- We subsample data such that it fits directly into the memory allocationmemory_allocation * memory_limit
. Therefore, this should likely be the last method listed in"methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
- allow_string_features: bool = True
Whether autosklearn should process string features. By default the textpreprocessing is enabled.
- disable_progress_bar: bool = False
Whether to disable the progress bar that is displayed in the console while fitting to the training data.
- Attributes
- cv_results_dict of numpy (masked) ndarrays
A dict with keys as column headers and values as columns, that can be imported into a pandas
DataFrame
.Not all keys returned by scikit-learn are supported yet.
- performance_over_time_pandas.core.frame.DataFrame
A
DataFrame
containing the models performance over time data. Can be used for plotting directly. Please refer to the example Train and Test Inputs.
- fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]¶
Fit auto-sklearn to given training set (X, y).
Fit both optimizes the machine learning models and builds an ensemble out of them.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
The training input samples.
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
The target classes.
- X_testarray-like or sparse matrix of shape = [n_samples, n_features]
Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.
- y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]
Test data target classes. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.
- feat_typelist, optional (default=None)
List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.
- dataset_namestr, optional (default=None)
Create nicer output. If None, a string will be determined by the md5 hash of the dataset.
- Returns
- self
- fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)¶
Fit an ensemble to models trained during an optimization process.
All parameters are
None
by default. If no other value is given, the default values which were set in a call tofit()
are used.- Parameters
- yarray-like
Target values.
- taskint
A constant from the module
autosklearn.constants
. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).- precisionint
Numeric precision used when loading ensemble data. Can be either
16
,32
or64
.- dataset_namestr
Name of the current data set.
- ensemble_sizeint, optional
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to
0
no ensemble is fit.Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via
ensemble_kwargs={"ensemble_size": int}
if you want to change the ensemble size for ensemble selection.- ensemble_kwargsDict, optional
Keyword arguments that are passed to the ensemble class upon initialization.
- ensemble_nbestint
Only consider the
ensemble_nbest
models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of theensemble_class
argument and this pruning step is done prior to constructing an ensemble.- ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)
Class implementing the post-hoc ensemble algorithm. Set to
None
to disable ensemble building or use class:SingleBest to obtain only use the single best model instead of an ensemble.If set to “default” it will use
EnsembleSelection
for single-objective problems andMultiObjectiveDummyEnsemble
for multi-objective problems.- metric: Scorer | Sequence[Scorer] | None = None
A metric or list of metrics to score the ensemble with
- Returns
- self
- fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue] ¶
Fits and individual pipeline configuration and returns the result to the user.
The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.
Any additional argument provided is directly passed to the worker exercising the run.
- Parameters
- X: array-like, shape = (n_samples, n_features)
The features used for training
- y: array-like
The labels used for training
- X_test: Optionalarray-like, shape = (n_samples, n_features)
If provided, the testing performance will be tracked on this features.
- y_test: array-like
If provided, the testing performance will be tracked on this labels
- config: Union[Configuration, Dict[str, Union[str, float, int]]]
A configuration object used to define the pipeline steps. If a dict is passed, a configuration is created based on this dict.
- dataset_name: Optional[str]
Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run
- feat_typelist, optional (default=None)
List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.
- Returns
- pipeline: Optional[BasePipeline]
The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.
- run_info: RunInFo
A named tuple that contains the configuration launched
- run_value: RunValue
A named tuple that contains the result of the run
- get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace ¶
Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
Array with the training features, used to get characteristics like data sparsity
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
Array with the problem labels
- X_testarray-like or sparse matrix of shape = [n_samples, n_features]
Array with features used for performance estimation
- y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]
Array with the problem labels for the testing split
- dataset_name: Optional[str]
A string to tag the Auto-Sklearn run
- get_models_with_weights()¶
Return a list of the final ensemble found by auto-sklearn.
- Returns
- [(weight_1, model_1), …, (weight_n, model_n)]
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame ¶
Returns a pandas table of results for all evaluated models.
Gives an overview of all models trained during the search process along with various statistics about their training.
The available statistics are:
Simple:
"model_id"
- The id given to a model byautosklearn
."rank"
- The rank of the model based on it’s"cost"
."ensemble_weight"
- The weight given to the model in the ensemble."type"
- The type of classifier/regressor used."cost"
- The loss of the model on the validation set."duration"
- Length of time the model was optimized for.
Detailed: The detailed view includes all of the simple statistics along with the following.
"config_id"
- The id used by SMAC for optimization."budget"
- How much budget was allocated to this model."status"
- The return status of training the model with SMAC."train_loss"
- The loss of the model on the training set."balancing_strategy"
- The balancing strategy used for data preprocessing."start_time"
- Time the model began being optimized"end_time"
- Time the model ended being optimized"data_preprocessors"
- The preprocessors used on the data"feature_preprocessors"
- The preprocessors for features types
- Parameters
- detailed: bool = False
Whether to give detailed information or just a simple overview.
- ensemble_only: bool = True
Whether to view only models included in the ensemble or all models trained.
- top_k: int or “all” = “all”
How many models to display.
- sort_by: str = ‘cost’
What column to sort by. If that column is not present, the sorting defaults to the
"model_id"
index column.Defaults to the metric optimized. Sort by the first objective in case of a multi-objective optimization problem
- sort_order: “auto” or “ascending” or “descending” = “auto”
Which sort order to apply to the
sort_by
column. If left as"auto"
, it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.- include: Optional[str or Iterable[str]]
Items to include, other items not specified will be excluded. The exception is the
"model_id"
index column which is always included.If left as
None
, it will resort back to using thedetailed
param to decide the columns to include.
- Returns
- pd.DataFrame
A dataframe of statistics for the models, ordered by
sort_by
.
- predict(X, batch_size=None, n_jobs=1)[source]¶
Predict classes for X.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
- Returns
- yarray of shape = [n_samples] or [n_samples, n_labels]
The predicted classes.
- predict_proba(X, batch_size=None, n_jobs=1)[source]¶
Predict probabilities of classes for all samples X.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
- batch_sizeint (optional)
Number of data points to predict for (predicts all points at once if
None
.- n_jobsint
- Returns
- yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]
The predicted class probabilities.
- refit(X, y)¶
Refit all models found with fit to new data.
Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
The training input samples.
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
The targets.
- Returns
- self
- score(X, y)¶
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True labels for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- show_models()¶
Returns a dictionary containing dictionaries of ensemble models.
Each model in the ensemble can be accessed by giving its
model_id
as key.A model dictionary contains the following:
"model_id"
- The id given to a model byautosklearn
."rank"
- The rank of the model based on it’s"cost"
."cost"
- The loss of the model on the validation set."ensemble_weight"
- The weight given to the model in the ensemble."voting_model"
- Thecv_voting_ensemble
model (for ‘cv’ resampling)."estimators"
- List of models (dicts) incv_voting_ensemble
(‘cv’ resampling).
"data_preprocessor"
- The preprocessor used on the data."balancing"
- The balancing used on the data (for classification)."feature_preprocessor"
- The preprocessor for features types."classifier"
/"regressor"
- The autosklearn wrapped classifier or regressor."sklearn_classifier"
or"sklearn_regressor"
- The sklearn classifier or regressor.
Example
import sklearn.datasets import sklearn.metrics import autosklearn.regression X, y = sklearn.datasets.load_diabetes(return_X_y=True) automl = autosklearn.regression.AutoSklearnRegressor( time_left_for_this_task=120 ) automl.fit(X_train, y_train, dataset_name='diabetes') ensemble_dict = automl.show_models() print(ensemble_dict)
Output:
{ 25: {'model_id': 25.0, 'rank': 1, 'cost': 0.43667876507897496, 'ensemble_weight': 0.38, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>, 'feature_preprocessor': <autosklearn.pipeline.components....>, 'regressor': <autosklearn.pipeline.components.regression....>, 'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...) }, 6: {'model_id': 6.0, 'rank': 2, 'cost': 0.4550418898836528, 'ensemble_weight': 0.3, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>, 'feature_preprocessor': <autosklearn.pipeline.components....>, 'regressor': <autosklearn.pipeline.components.regression....>, 'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...) }... }
- Returns
- Dict(int, Any)dictionary of length = number of models in the ensemble
A dictionary of models in the ensemble, where
model_id
is the key.
- sprint_statistics()¶
Return the following statistics of the training result:
dataset name
metric used
best validation score
number of target algorithm runs
number of successful target algorithm runs
number of crashed target algorithm runs
number of target algorithm runs that exceeded the memory limit
number of target algorithm runs that exceeded the time limit
- Returns
- str
- class autosklearn.experimental.askl2.AutoSklearn2Classifier(time_left_for_this_task: int = 3600, per_run_time_limit=None, ensemble_size: int | None = None, ensemble_class: AbstractEnsemble | None = <class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>, ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest: Union[float, int] = 50, max_models_on_disc: int = 50, seed: int = 1, memory_limit: int = 3072, tmp_folder: Optional[str] = None, delete_tmp_folder_after_terminate: bool = True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output: bool = False, smac_scenario_args: Optional[Dict[str, Any]] = None, logging_config: Optional[Dict[str, Any]] = None, metric: Optional[Scorer] = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True, disable_progress_bar: bool = False)[source]¶
- Parameters
- time_left_for_this_taskint, optional (default=3600)
Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
- per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- ensemble_sizeint, optional
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to
0
no ensemble is fit.Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via
ensemble_kwargs={"ensemble_size": int}
if you want to change the ensemble size for ensemble selection.- ensemble_classType[AbstractEnsemble], optional (default=EnsembleSelection)
Class implementing the post-hoc ensemble algorithm. Set to
None
to disable ensemble building or useSingleBest
to obtain only use the single best model instead of an ensemble.- ensemble_kwargsDict, optional
Keyword arguments that are passed to the ensemble class upon initialization.
- max_models_on_disc: int, optional (default=50),
Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
- seedint, optional (default=1)
Used to seed SMAC. Will determine the output file names.
- memory_limitint, optional (3072)
Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than
memory_limit
MB.Important notes:
If
None
is provided, no memory limit is set.In case of multi-processing,
memory_limit
will be per job, so the total usage isn_jobs x memory_limit
.The memory limit also applies to the ensemble creation process.
- tmp_folderstring, optional (None)
folder to store configuration output and log files, if
None
automatically use/tmp/autosklearn_tmp_$pid_$random_number
- delete_tmp_folder_after_terminate: string, optional (True)
remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
- n_jobsint, optional, experimental
The number of jobs to run in parallel for
fit()
.-1
means using all processors.Important notes:
By default, Auto-sklearn uses one core.
Ensemble building is not affected by
n_jobs
but can be controlled by the number of models in the ensemble.predict()
is not affected byn_jobs
(in contrast to most scikit-learn models)If
dask_client
isNone
, a new dask client is created.
- dask_clientdask.distributed.Client, optional
User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
- disable_evaluator_output: bool or list, optional (False)
If True, disable model and prediction output. Cannot be used together with ensemble building.
predict()
cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:'y_optimization'
: do not save the predictions for the optimization/validation set, which would later on be used to build an ensemble.model
: do not save any model files
- smac_scenario_argsdict, optional (None)
Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.
- logging_configdict, optional (None)
dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory
util/logging.yaml
relative to the installation.- metricScorer, optional (None)
An instance of
autosklearn.metrics.Scorer
as created byautosklearn.metrics.make_scorer()
. These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.- scoring_functionsList[Scorer], optional (None)
List of scorers which will be calculated for each pipeline and results will be available via
cv_results
- load_modelsbool, optional (True)
Whether to load the models after fitting Auto-sklearn.
- disable_progress_bar: bool = False
Whether to disable the progress bar that is displayed in the console while fitting to the training data.
- Attributes
- cv_results_dict of numpy (masked) ndarrays
A dict with keys as column headers and values as columns, that can be imported into a pandas
DataFrame
.Not all keys returned by scikit-learn are supported yet.
- fit(X, y, X_test=None, y_test=None, metric=None, feat_type=None, dataset_name=None)[source]¶
Fit auto-sklearn to given training set (X, y).
Fit both optimizes the machine learning models and builds an ensemble out of them.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
The training input samples.
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
The target classes.
- X_testarray-like or sparse matrix of shape = [n_samples, n_features]
Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.
- y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]
Test data target classes. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.
- feat_typelist, optional (default=None)
List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.
- dataset_namestr, optional (default=None)
Create nicer output. If None, a string will be determined by the md5 hash of the dataset.
- Returns
- self
- fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)¶
Fit an ensemble to models trained during an optimization process.
All parameters are
None
by default. If no other value is given, the default values which were set in a call tofit()
are used.- Parameters
- yarray-like
Target values.
- taskint
A constant from the module
autosklearn.constants
. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).- precisionint
Numeric precision used when loading ensemble data. Can be either
16
,32
or64
.- dataset_namestr
Name of the current data set.
- ensemble_sizeint, optional
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to
0
no ensemble is fit.Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via
ensemble_kwargs={"ensemble_size": int}
if you want to change the ensemble size for ensemble selection.- ensemble_kwargsDict, optional
Keyword arguments that are passed to the ensemble class upon initialization.
- ensemble_nbestint
Only consider the
ensemble_nbest
models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of theensemble_class
argument and this pruning step is done prior to constructing an ensemble.- ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)
Class implementing the post-hoc ensemble algorithm. Set to
None
to disable ensemble building or use class:SingleBest to obtain only use the single best model instead of an ensemble.If set to “default” it will use
EnsembleSelection
for single-objective problems andMultiObjectiveDummyEnsemble
for multi-objective problems.- metric: Scorer | Sequence[Scorer] | None = None
A metric or list of metrics to score the ensemble with
- Returns
- self
- fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue] ¶
Fits and individual pipeline configuration and returns the result to the user.
The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.
Any additional argument provided is directly passed to the worker exercising the run.
- Parameters
- X: array-like, shape = (n_samples, n_features)
The features used for training
- y: array-like
The labels used for training
- X_test: Optionalarray-like, shape = (n_samples, n_features)
If provided, the testing performance will be tracked on this features.
- y_test: array-like
If provided, the testing performance will be tracked on this labels
- config: Union[Configuration, Dict[str, Union[str, float, int]]]
A configuration object used to define the pipeline steps. If a dict is passed, a configuration is created based on this dict.
- dataset_name: Optional[str]
Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run
- feat_typelist, optional (default=None)
List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.
- Returns
- pipeline: Optional[BasePipeline]
The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.
- run_info: RunInFo
A named tuple that contains the configuration launched
- run_value: RunValue
A named tuple that contains the result of the run
- get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace ¶
Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
Array with the training features, used to get characteristics like data sparsity
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
Array with the problem labels
- X_testarray-like or sparse matrix of shape = [n_samples, n_features]
Array with features used for performance estimation
- y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]
Array with the problem labels for the testing split
- dataset_name: Optional[str]
A string to tag the Auto-Sklearn run
- get_models_with_weights()¶
Return a list of the final ensemble found by auto-sklearn.
- Returns
- [(weight_1, model_1), …, (weight_n, model_n)]
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame ¶
Returns a pandas table of results for all evaluated models.
Gives an overview of all models trained during the search process along with various statistics about their training.
The available statistics are:
Simple:
"model_id"
- The id given to a model byautosklearn
."rank"
- The rank of the model based on it’s"cost"
."ensemble_weight"
- The weight given to the model in the ensemble."type"
- The type of classifier/regressor used."cost"
- The loss of the model on the validation set."duration"
- Length of time the model was optimized for.
Detailed: The detailed view includes all of the simple statistics along with the following.
"config_id"
- The id used by SMAC for optimization."budget"
- How much budget was allocated to this model."status"
- The return status of training the model with SMAC."train_loss"
- The loss of the model on the training set."balancing_strategy"
- The balancing strategy used for data preprocessing."start_time"
- Time the model began being optimized"end_time"
- Time the model ended being optimized"data_preprocessors"
- The preprocessors used on the data"feature_preprocessors"
- The preprocessors for features types
- Parameters
- detailed: bool = False
Whether to give detailed information or just a simple overview.
- ensemble_only: bool = True
Whether to view only models included in the ensemble or all models trained.
- top_k: int or “all” = “all”
How many models to display.
- sort_by: str = ‘cost’
What column to sort by. If that column is not present, the sorting defaults to the
"model_id"
index column.Defaults to the metric optimized. Sort by the first objective in case of a multi-objective optimization problem
- sort_order: “auto” or “ascending” or “descending” = “auto”
Which sort order to apply to the
sort_by
column. If left as"auto"
, it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.- include: Optional[str or Iterable[str]]
Items to include, other items not specified will be excluded. The exception is the
"model_id"
index column which is always included.If left as
None
, it will resort back to using thedetailed
param to decide the columns to include.
- Returns
- pd.DataFrame
A dataframe of statistics for the models, ordered by
sort_by
.
- predict(X, batch_size=None, n_jobs=1)¶
Predict classes for X.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
- Returns
- yarray of shape = [n_samples] or [n_samples, n_labels]
The predicted classes.
- predict_proba(X, batch_size=None, n_jobs=1)¶
Predict probabilities of classes for all samples X.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
- batch_sizeint (optional)
Number of data points to predict for (predicts all points at once if
None
.- n_jobsint
- Returns
- yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]
The predicted class probabilities.
- refit(X, y)¶
Refit all models found with fit to new data.
Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
The training input samples.
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
The targets.
- Returns
- self
- score(X, y)¶
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True labels for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- show_models()¶
Returns a dictionary containing dictionaries of ensemble models.
Each model in the ensemble can be accessed by giving its
model_id
as key.A model dictionary contains the following:
"model_id"
- The id given to a model byautosklearn
."rank"
- The rank of the model based on it’s"cost"
."cost"
- The loss of the model on the validation set."ensemble_weight"
- The weight given to the model in the ensemble."voting_model"
- Thecv_voting_ensemble
model (for ‘cv’ resampling)."estimators"
- List of models (dicts) incv_voting_ensemble
(‘cv’ resampling).
"data_preprocessor"
- The preprocessor used on the data."balancing"
- The balancing used on the data (for classification)."feature_preprocessor"
- The preprocessor for features types."classifier"
/"regressor"
- The autosklearn wrapped classifier or regressor."sklearn_classifier"
or"sklearn_regressor"
- The sklearn classifier or regressor.
Example
import sklearn.datasets import sklearn.metrics import autosklearn.regression X, y = sklearn.datasets.load_diabetes(return_X_y=True) automl = autosklearn.regression.AutoSklearnRegressor( time_left_for_this_task=120 ) automl.fit(X_train, y_train, dataset_name='diabetes') ensemble_dict = automl.show_models() print(ensemble_dict)
Output:
{ 25: {'model_id': 25.0, 'rank': 1, 'cost': 0.43667876507897496, 'ensemble_weight': 0.38, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>, 'feature_preprocessor': <autosklearn.pipeline.components....>, 'regressor': <autosklearn.pipeline.components.regression....>, 'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...) }, 6: {'model_id': 6.0, 'rank': 2, 'cost': 0.4550418898836528, 'ensemble_weight': 0.3, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>, 'feature_preprocessor': <autosklearn.pipeline.components....>, 'regressor': <autosklearn.pipeline.components.regression....>, 'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...) }... }
- Returns
- Dict(int, Any)dictionary of length = number of models in the ensemble
A dictionary of models in the ensemble, where
model_id
is the key.
- sprint_statistics()¶
Return the following statistics of the training result:
dataset name
metric used
best validation score
number of target algorithm runs
number of successful target algorithm runs
number of crashed target algorithm runs
number of target algorithm runs that exceeded the memory limit
number of target algorithm runs that exceeded the time limit
- Returns
- str
Regression¶
- class autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int | None = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include: Optional[Dict[str, List[str]]] = None, exclude: Optional[Dict[str, List[str]]] = None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric: Scorer | Sequence[Scorer] | None = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, get_trials_callback: SMACCallback | None = None, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True, disable_progress_bar: bool = False)[source]¶
This class implements the regression task.
- Parameters
- time_left_for_this_taskint, optional (default=3600)
Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
- per_run_time_limitint, optional (default=1/10 of time_left_for_this_task)
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- initial_configurations_via_metalearningint, optional (default=25)
Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.
- ensemble_sizeint, optional
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to
0
no ensemble is fit.Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via
ensemble_kwargs={"ensemble_size": int}
if you want to change the ensemble size for ensemble selection.- ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)
Class implementing the post-hoc ensemble algorithm. Set to
None
to disable ensemble building or useSingleBest
to obtain only use the single best model instead of an ensemble.If set to “default” it will use
EnsembleSelection
for single-objective problems andMultiObjectiveDummyEnsemble
for multi-objective problems.- ensemble_kwargsDict, optional
Keyword arguments that are passed to the ensemble class upon initialization.
- ensemble_nbestint, optional (default=50)
Only consider the
ensemble_nbest
models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of theensemble_class
argument and this pruning step is done prior to constructing an ensemble.- max_models_on_disc: int, optional (default=50),
Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
- seedint, optional (default=1)
Used to seed SMAC. Will determine the output file names.
- memory_limitint, optional (3072)
Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than
memory_limit
MB.Important notes:
If
None
is provided, no memory limit is set.In case of multi-processing,
memory_limit
will be per job, so the total usage isn_jobs x memory_limit
.The memory limit also applies to the ensemble creation process.
- includeOptional[Dict[str, List[str]]] = None
If None, all possible algorithms are used.
Otherwise, specifies a step and the components that are included in search. See
/pipeline/components/<step>/*
for available components.Incompatible with parameter
exclude
.Possible Steps:
"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier"
- Only for when when usingAutoSklearnClasssifier
"regressor"
- Only for when when usingAutoSklearnRegressor
Example:
include = { 'classifier': ["random_forest"], 'feature_preprocessor': ["no_preprocessing"] }
- excludeOptional[Dict[str, List[str]]] = None
If None, all possible algorithms are used.
Otherwise, specifies a step and the components that are excluded from search. See
/pipeline/components/<step>/*
for available components.Incompatible with parameter
include
.Possible Steps:
"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier"
- Only for when when usingAutoSklearnClasssifier
"regressor"
- Only for when when usingAutoSklearnRegressor
Example:
exclude = { 'classifier': ["random_forest"], 'feature_preprocessor': ["no_preprocessing"] }
- resampling_strategystr | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = “holdout”
How to to handle overfitting, might need to use
resampling_strategy_arguments
if using"cv"
based method or a Splitter object.- Options
"holdout"
- Use a 67:33 (train:test) split"cv"
: perform cross validation, requires “folds” inresampling_strategy_arguments
"holdout-iterative-fit"
- Same as “holdout” but iterative fit where possible"cv-iterative-fit"
: Same as “cv” but iterative fit where possible"partial-cv"
: Same as “cv” but uses intensification.BaseCrossValidator
- any BaseCrossValidator subclass (found in scikit-learn model_selection module)_RepeatedSplits
- any _RepeatedSplits subclass (found in scikit-learn model_selection module)BaseShuffleSplit
- any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
If using a Splitter object that relies on the dataset retaining it’s current size and order, you will need to look at the
dataset_compression
argument and ensure that"subsample"
is not included in the applied compression"methods"
or disable it entirely withFalse
.- resampling_strategy_argumentsOptional[Dict] = None
Additional arguments for
resampling_strategy
, this is required if using acv
based strategy. The default arguments if left asNone
are:{ "train_size": 0.67, # The size of the training set "shuffle": True, # Whether to shuffle before splitting data "folds": 5 # Used in 'cv' based resampling strategies }
If using a custom splitter class, which takes
n_splits
such as PredefinedSplit, the value of"folds"
will be used.- tmp_folderstring, optional (None)
folder to store configuration output and log files, if
None
automatically use/tmp/autosklearn_tmp_$pid_$random_number
- delete_tmp_folder_after_terminate: bool, optional (True)
remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
- n_jobsint, optional, experimental
The number of jobs to run in parallel for
fit()
.-1
means using all processors.Important notes:
By default, Auto-sklearn uses one core.
Ensemble building is not affected by
n_jobs
but can be controlled by the number of models in the ensemble.predict()
is not affected byn_jobs
(in contrast to most scikit-learn models)If
dask_client
isNone
, a new dask client is created.
- dask_clientdask.distributed.Client, optional
User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
- disable_evaluator_output: bool or list, optional (False)
If True, disable model and prediction output. Cannot be used together with ensemble building.
predict()
cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the list are:'y_optimization'
: do not save the predictions for the optimization set, which would later on be used to build an ensemble.model
: do not save any model files
- smac_scenario_argsdict, optional (None)
Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.
- get_smac_object_callbackcallable
Callback function to create an object of class smac.facade.AbstractFacade. The function must accept the arguments
scenario_dict
,instances
,num_params
,runhistory
,seed
andta
. This is an advanced feature. Use only if you are familiar with SMAC.- logging_configdict, optional (None)
dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory
util/logging.yaml
relative to the installation.- metadata_directorystr, optional (None)
path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.
- metricScorer, optional (None)
An instance of
autosklearn.metrics.Scorer
as created byautosklearn.metrics.make_scorer()
. These are the Built-in Metrics. If None is provided, a default metric is selected depending on the task.- scoring_functionsList[Scorer], optional (None)
List of scorers which will be calculated for each pipeline and results will be available via
cv_results
- load_modelsbool, optional (True)
Whether to load the models after fitting Auto-sklearn.
- get_trials_callback: callable
A callable with the following definition.
(smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None
This will be called after SMAC, the underlying optimizer for autosklearn, finishes training each run.
You can use this to record your own information about the optimization process. You can also use this to enable a early stopping based on some critera.
See the example: Early Stopping And Callbacks.
- dataset_compression: Union[bool, Mapping[str, Any]] = True
We compress datasets so that they fit into some predefined amount of memory. Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.
NOTE - If using a custom
resampling_strategy
that relies on specific size or ordering of data, this must be disabled to preserve these properties.You can disable this entirely by passing
False
or leave as the defaultTrue
for configuration below.{ "memory_allocation": 0.1, "methods": ["precision", "subsample"] }
You can also pass your own configuration with the same keys and choosing from the available
"methods"
.The available options are described here:
- memory_allocation
By default, we attempt to fit the dataset into
0.1 * memory_limit
. This float value can be set with"memory_allocation": 0.1
. We also allow for specifying absolute memory in MB, e.g. 10MB is"memory_allocation": 10
.The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in
"methods"
will not be performed.For example, if
methods: ["precision", "subsample"]
and the"precision"
reduction step was enough to make the dataset fit into memory, then the"subsample"
reduction step will not be performed.
- methods
We provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given.
"precision"
- We reduce floating point precision as follows: *np.float128 -> np.float64
*np.float96 -> np.float64
*np.float64 -> np.float32
subsample
- We subsample data such that it fits directly into the memory allocationmemory_allocation * memory_limit
. Therefore, this should likely be the last method listed in"methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
- allow_string_features: bool = True
Whether autosklearn should process string features. By default the textpreprocessing is enabled.
- disable_progress_bar: bool = False
Whether to disable the progress bar that is displayed in the console while fitting to the training data.
- Attributes
- cv_results_dict of numpy (masked) ndarrays
A dict with keys as column headers and values as columns, that can be imported into a pandas
DataFrame
.Not all keys returned by scikit-learn are supported yet.
- performance_over_time_pandas.core.frame.DataFrame
A
DataFrame
containing the models performance over time data. Can be used for plotting directly. Please refer to the example Train and Test Inputs.
- fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]¶
Fit Auto-sklearn to given training set (X, y).
Fit both optimizes the machine learning models and builds an ensemble out of them.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
The training input samples.
- yarray-like, shape = [n_samples] or [n_samples, n_targets]
The regression target.
- X_testarray-like or sparse matrix of shape = [n_samples, n_features]
Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.
- y_testarray-like, shape = [n_samples] or [n_samples, n_targets]
The regression target. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.
- feat_typelist, optional (default=None)
List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded.
- dataset_namestr, optional (default=None)
Create nicer output. If None, a string will be determined by the md5 hash of the dataset.
- Returns
- self
- fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)¶
Fit an ensemble to models trained during an optimization process.
All parameters are
None
by default. If no other value is given, the default values which were set in a call tofit()
are used.- Parameters
- yarray-like
Target values.
- taskint
A constant from the module
autosklearn.constants
. Determines the task type (binary classification, multiclass classification, multilabel classification or regression).- precisionint
Numeric precision used when loading ensemble data. Can be either
16
,32
or64
.- dataset_namestr
Name of the current data set.
- ensemble_sizeint, optional
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement. If set to
0
no ensemble is fit.Deprecated - will be removed in Auto-sklearn 0.16. Please pass this argument via
ensemble_kwargs={"ensemble_size": int}
if you want to change the ensemble size for ensemble selection.- ensemble_kwargsDict, optional
Keyword arguments that are passed to the ensemble class upon initialization.
- ensemble_nbestint
Only consider the
ensemble_nbest
models when building an ensemble. This is inspired by a concept called library pruning introduced in Getting Most out of Ensemble Selection. This is independent of theensemble_class
argument and this pruning step is done prior to constructing an ensemble.- ensemble_classType[AbstractEnsemble] | “default”, optional (default=”default”)
Class implementing the post-hoc ensemble algorithm. Set to
None
to disable ensemble building or use class:SingleBest to obtain only use the single best model instead of an ensemble.If set to “default” it will use
EnsembleSelection
for single-objective problems andMultiObjectiveDummyEnsemble
for multi-objective problems.- metric: Scorer | Sequence[Scorer] | None = None
A metric or list of metrics to score the ensemble with
- Returns
- self
- fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue] ¶
Fits and individual pipeline configuration and returns the result to the user.
The Estimator constraints are honored, for example the resampling strategy, or memory constraints, unless directly provided to the method. By default, this method supports the same signature as fit(), and any extra arguments are redirected to the TAE evaluation function, which allows for further customization while building a pipeline.
Any additional argument provided is directly passed to the worker exercising the run.
- Parameters
- X: array-like, shape = (n_samples, n_features)
The features used for training
- y: array-like
The labels used for training
- X_test: Optionalarray-like, shape = (n_samples, n_features)
If provided, the testing performance will be tracked on this features.
- y_test: array-like
If provided, the testing performance will be tracked on this labels
- config: Union[Configuration, Dict[str, Union[str, float, int]]]
A configuration object used to define the pipeline steps. If a dict is passed, a configuration is created based on this dict.
- dataset_name: Optional[str]
Name that will be used to tag the Auto-Sklearn run and identify the Auto-Sklearn run
- feat_typelist, optional (default=None)
List of str of len(X.shape[1]) describing the attribute type. Possible types are Categorical and Numerical. Categorical attributes will be automatically One-Hot encoded. The values used for a categorical attribute must be integers, obtained for example by sklearn.preprocessing.LabelEncoder.
- Returns
- pipeline: Optional[BasePipeline]
The fitted pipeline. In case of failure while fitting the pipeline, a None is returned.
- run_info: RunInFo
A named tuple that contains the configuration launched
- run_value: RunValue
A named tuple that contains the result of the run
- get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) ConfigSpace.configuration_space.ConfigurationSpace ¶
Returns the Configuration Space object, from which Auto-Sklearn will sample configurations and build pipelines.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
Array with the training features, used to get characteristics like data sparsity
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
Array with the problem labels
- X_testarray-like or sparse matrix of shape = [n_samples, n_features]
Array with features used for performance estimation
- y_testarray-like, shape = [n_samples] or [n_samples, n_outputs]
Array with the problem labels for the testing split
- dataset_name: Optional[str]
A string to tag the Auto-Sklearn run
- get_models_with_weights()¶
Return a list of the final ensemble found by auto-sklearn.
- Returns
- [(weight_1, model_1), …, (weight_n, model_n)]
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) pandas.core.frame.DataFrame ¶
Returns a pandas table of results for all evaluated models.
Gives an overview of all models trained during the search process along with various statistics about their training.
The available statistics are:
Simple:
"model_id"
- The id given to a model byautosklearn
."rank"
- The rank of the model based on it’s"cost"
."ensemble_weight"
- The weight given to the model in the ensemble."type"
- The type of classifier/regressor used."cost"
- The loss of the model on the validation set."duration"
- Length of time the model was optimized for.
Detailed: The detailed view includes all of the simple statistics along with the following.
"config_id"
- The id used by SMAC for optimization."budget"
- How much budget was allocated to this model."status"
- The return status of training the model with SMAC."train_loss"
- The loss of the model on the training set."balancing_strategy"
- The balancing strategy used for data preprocessing."start_time"
- Time the model began being optimized"end_time"
- Time the model ended being optimized"data_preprocessors"
- The preprocessors used on the data"feature_preprocessors"
- The preprocessors for features types
- Parameters
- detailed: bool = False
Whether to give detailed information or just a simple overview.
- ensemble_only: bool = True
Whether to view only models included in the ensemble or all models trained.
- top_k: int or “all” = “all”
How many models to display.
- sort_by: str = ‘cost’
What column to sort by. If that column is not present, the sorting defaults to the
"model_id"
index column.Defaults to the metric optimized. Sort by the first objective in case of a multi-objective optimization problem
- sort_order: “auto” or “ascending” or “descending” = “auto”
Which sort order to apply to the
sort_by
column. If left as"auto"
, it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.- include: Optional[str or Iterable[str]]
Items to include, other items not specified will be excluded. The exception is the
"model_id"
index column which is always included.If left as
None
, it will resort back to using thedetailed
param to decide the columns to include.
- Returns
- pd.DataFrame
A dataframe of statistics for the models, ordered by
sort_by
.
- predict(X, batch_size=None, n_jobs=1)[source]¶
Predict regression target for X.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
- Returns
- yarray of shape = [n_samples] or [n_samples, n_outputs]
The predicted values.
- refit(X, y)¶
Refit all models found with fit to new data.
Necessary when using cross-validation. During training, auto-sklearn fits each model k times on the dataset, but does not keep any trained model and can therefore not be used to predict for new data points. This methods fits all models found during a call to fit on the data given. This method may also be used together with holdout to avoid only using 66% of the training data to fit the final model.
- Parameters
- Xarray-like or sparse matrix of shape = [n_samples, n_features]
The training input samples.
- yarray-like, shape = [n_samples] or [n_samples, n_outputs]
The targets.
- Returns
- self
- score(X, y)¶
Return the coefficient of determination \(R^2\) of the prediction.
The coefficient \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred) ** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True values for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
\(R^2\) of
self.predict(X)
wrt. y.
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- show_models()¶
Returns a dictionary containing dictionaries of ensemble models.
Each model in the ensemble can be accessed by giving its
model_id
as key.A model dictionary contains the following:
"model_id"
- The id given to a model byautosklearn
."rank"
- The rank of the model based on it’s"cost"
."cost"
- The loss of the model on the validation set."ensemble_weight"
- The weight given to the model in the ensemble."voting_model"
- Thecv_voting_ensemble
model (for ‘cv’ resampling)."estimators"
- List of models (dicts) incv_voting_ensemble
(‘cv’ resampling).
"data_preprocessor"
- The preprocessor used on the data."balancing"
- The balancing used on the data (for classification)."feature_preprocessor"
- The preprocessor for features types."classifier"
/"regressor"
- The autosklearn wrapped classifier or regressor."sklearn_classifier"
or"sklearn_regressor"
- The sklearn classifier or regressor.
Example
import sklearn.datasets import sklearn.metrics import autosklearn.regression X, y = sklearn.datasets.load_diabetes(return_X_y=True) automl = autosklearn.regression.AutoSklearnRegressor( time_left_for_this_task=120 ) automl.fit(X_train, y_train, dataset_name='diabetes') ensemble_dict = automl.show_models() print(ensemble_dict)
Output:
{ 25: {'model_id': 25.0, 'rank': 1, 'cost': 0.43667876507897496, 'ensemble_weight': 0.38, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>, 'feature_preprocessor': <autosklearn.pipeline.components....>, 'regressor': <autosklearn.pipeline.components.regression....>, 'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...) }, 6: {'model_id': 6.0, 'rank': 2, 'cost': 0.4550418898836528, 'ensemble_weight': 0.3, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>, 'feature_preprocessor': <autosklearn.pipeline.components....>, 'regressor': <autosklearn.pipeline.components.regression....>, 'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...) }... }
- Returns
- Dict(int, Any)dictionary of length = number of models in the ensemble
A dictionary of models in the ensemble, where
model_id
is the key.
- sprint_statistics()¶
Return the following statistics of the training result:
dataset name
metric used
best validation score
number of target algorithm runs
number of successful target algorithm runs
number of crashed target algorithm runs
number of target algorithm runs that exceeded the memory limit
number of target algorithm runs that exceeded the time limit
- Returns
- str
Metrics¶
- autosklearn.metrics.make_scorer(name: str, score_func: Callable, *, optimum: float = 1.0, worst_possible_result: float = 0.0, greater_is_better: bool = True, needs_proba: bool = False, needs_threshold: bool = False, needs_X: bool = False, **kwargs: Any) autosklearn.metrics.Scorer [source]¶
Make a scorer from a performance metric or loss function.
Factory inspired by scikit-learn which wraps scikit-learn scoring functions to be used in auto-sklearn.
- Parameters
- name: str
Descriptive name of the metric
- score_funccallable
Score function (or loss function) with signature
score_func(y, y_pred, **kwargs)
.- optimumint or float, default=1
The best score achievable by the score function, i.e. maximum in case of scorer function and minimum in case of loss function.
- worst_possible_resultint of float, default=0
The worst score achievable by the score function, i.e. minimum in case of scorer function and maximum in case of loss function.
- greater_is_betterboolean, default=True
Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.
- needs_probaboolean, default=False
Whether score_func requires predict_proba to get probability estimates out of a classifier.
- needs_thresholdboolean, default=False
Whether score_func takes a continuous decision certainty. This only works for binary classification.
- needs_Xboolean, default=False
Whether score_func requires X in __call__ to compute a metric.
- **kwargsadditional arguments
Additional parameters to be passed to score_func.
- Returns
- scorercallable
Callable object that returns a scalar score; greater is better or set greater_is_better to False.
Built-in Metrics¶
Classification metrics¶
Note: The default autosklearn.metrics.f1
, autosklearn.metrics.precision
and autosklearn.metrics.recall
built-in metrics are applicable only for binary classification. In order to apply them on multilabel and multiclass
classification, please use the corresponding metrics with an appropriate averaging mechanism, such as autosklearn.metrics.f1_macro
.
For more information about how these metrics are used, please read
this scikit-learn documentation.
- autosklearn.metrics.accuracy¶
alias of accuracy
- autosklearn.metrics.balanced_accuracy¶
alias of balanced_accuracy
- autosklearn.metrics.f1¶
alias of f1
- autosklearn.metrics.f1_macro¶
alias of f1_macro
- autosklearn.metrics.f1_micro¶
alias of f1_micro
- autosklearn.metrics.f1_samples¶
alias of f1_samples
- autosklearn.metrics.f1_weighted¶
alias of f1_weighted
- autosklearn.metrics.roc_auc¶
alias of roc_auc
- autosklearn.metrics.precision¶
alias of precision
- autosklearn.metrics.precision_macro¶
alias of precision_macro
- autosklearn.metrics.precision_micro¶
alias of precision_micro
- autosklearn.metrics.precision_samples¶
alias of precision_samples
- autosklearn.metrics.precision_weighted¶
alias of precision_weighted
- autosklearn.metrics.average_precision¶
alias of average_precision
- autosklearn.metrics.recall¶
alias of recall
- autosklearn.metrics.recall_macro¶
alias of recall_macro
- autosklearn.metrics.recall_micro¶
alias of recall_micro
- autosklearn.metrics.recall_samples¶
alias of recall_samples
- autosklearn.metrics.recall_weighted¶
alias of recall_weighted
- autosklearn.metrics.log_loss¶
alias of log_loss
Extension Interfaces¶
- class autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm[source]¶
Provide an abstract interface for classification algorithms in auto-sklearn.
See Extending auto-sklearn for more information.
- get_estimator()[source]¶
Return the underlying estimator object.
- Returns
- estimatorthe underlying estimator object
- predict(X)[source]¶
The predict function calls the predict function of the underlying scikit-learn model and returns an array with the predictions.
- Parameters
- Xarray-like, shape = (n_samples, n_features)
- Returns
- array, shape = (n_samples,) or shape = (n_samples, n_labels)
Returns the predicted values
Notes
Please see the scikit-learn API documentation for further information.
- class autosklearn.pipeline.components.base.AutoSklearnRegressionAlgorithm[source]¶
Provide an abstract interface for regression algorithms in auto-sklearn.
Make a subclass of this and put it into the directory autosklearn/pipeline/components/regression to make it available.
- get_estimator()[source]¶
Return the underlying estimator object.
- Returns
- estimatorthe underlying estimator object
- predict(X)[source]¶
The predict function calls the predict function of the underlying scikit-learn model and returns an array with the predictions.
- Parameters
- Xarray-like, shape = (n_samples, n_features)
- Returns
- array, shape = (n_samples,) or shape = (n_samples, n_targets)
Returns the predicted values
Notes
Please see the scikit-learn API documentation for further information.
- class autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm[source]¶
Provide an abstract interface for preprocessing algorithms in auto-sklearn.
See Extending auto-sklearn for more information.
- get_preprocessor()[source]¶
Return the underlying preprocessor object.
- Returns
- preprocessorthe underlying preprocessor object
- transform(X)[source]¶
The transform function calls the transform function of the underlying scikit-learn model and returns the transformed array.
- Parameters
- Xarray-like, shape = (n_samples, n_features)
- Returns
- Xarray
Return the transformed training data
Notes
Please see the scikit-learn API documentation for further information.
Ensembles¶
Single objective¶
- class autosklearn.ensembles.EnsembleSelection(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, ensemble_size: int = 50, bagging: bool = False, mode: str = 'fast', random_state: int | np.random.RandomState | None = None)[source]¶
An ensemble of selected algorithms
Fitting an EnsembleSelection generates an ensemble from the the models generated during the search process. Can be further used for prediction.
- Parameters
- task_type: int
An identifier indicating which task is being performed.
- metrics: Sequence[Scorer] | Scorer
The metric used to evaluate the models. If multiple metrics are passed, ensemble selection only optimizes for the first
- backendBackend
Gives access to the backend of Auto-sklearn. Not used by Ensemble Selection.
- bagging: bool = False
Whether to use bagging in ensemble selection
- mode: str in [‘fast’, ‘slow’] = ‘fast’
Which kind of ensemble generation to use * ‘slow’ - The original method used in Rich Caruana’s ensemble selection. * ‘fast’ - A faster version of Rich Caruanas’ ensemble selection.
- random_state: int | RandomState | None = None
The random_state used for ensemble selection.
None - Uses numpy’s default RandomState object
int - Successive calls to fit will produce the same results
RandomState - Truly random, each call to fit will produce different results, even with the same object.
References
Ensemble selection from libraries of modelsRich Caruana, Alexandru Niculescu-Mizil, Geoff Crew and Alex KsikesICML 2004- fit(base_models_predictions: List[np.ndarray], true_targets: np.ndarray, model_identifiers: List[Tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) EnsembleSelection [source]¶
Fit an ensemble given predictions of base models and targets.
Ensemble building maximizes performance (in contrast to hyperparameter optimization)!
- Parameters
- base_models_predictions: np.ndarray
shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression
Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.
- X_datalist-like or sparse data
- true_targetsarray of shape [n_targets]
- model_identifiersidentifier for each base model.
Can be used for practical text output of the ensemble.
- runs: Sequence[Run]
Additional information for each run executed by SMAC that was considered by the ensemble builder.
- Returns
- self
- get_identifiers_with_weights() List[Tuple[Tuple[int, int, float], float]] [source]¶
Return a (identifier, weight)-pairs for all models that were passed to the ensemble builder.
- Parameters
- modelsdict {identifiermodel object}
The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.
- Returns
- List[Tuple[Tuple[int, int, float], float]
- get_models_with_weights(models: Dict[Tuple[int, int, float], autosklearn.pipeline.base.BasePipeline]) List[Tuple[float, autosklearn.pipeline.base.BasePipeline]] [source]¶
List of (weight, model) pairs for all models included in the ensemble.
- Parameters
- modelsdict {identifiermodel object}
The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.
- Returns
- List[Tuple[float, BasePipeline]]
- get_selected_model_identifiers() List[Tuple[int, int, float]] [source]¶
Return identifiers of models in the ensemble.
This includes models which have a weight of zero!
- Returns
- list
- get_validation_performance() float [source]¶
Return validation performance of ensemble.
- Returns
- float
- predict(base_models_predictions: Union[numpy.ndarray, List[numpy.ndarray]]) numpy.ndarray [source]¶
Create ensemble predictions from the base model predictions.
- Parameters
- base_models_predictionsnp.ndarray
shape = (n_base_models, n_data_points, n_targets) Same as in the fit method.
- Returns
- np.ndarray
Single model classes¶
These classes wrap a single model to provide a unified interface in Auto-sklearn.
- class autosklearn.ensembles.SingleBest(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, random_state: int | np.random.RandomState | None = None)[source]¶
Ensemble consisting of the single best model.
- Parameters
- task_type: int
An identifier indicating which task is being performed.
- metrics: Sequence[Scorer] | Scorer
The metrics used to evaluate the models.
- random_state: int | RandomState | None = None
Not used.
- backendBackend
Gives access to the backend of Auto-sklearn. Not used.
- fit(base_models_predictions: np.ndarray | list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) SingleBest [source]¶
Select the single best model.
- Parameters
- base_models_predictions: np.ndarray
shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression
Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.
- true_targetsarray of shape [n_targets]
- model_identifiersidentifier for each base model.
Can be used for practical text output of the ensemble.
- runs: Sequence[Run]
Additional information for each run executed by SMAC that was considered by the ensemble builder. Not used.
- X_dataarray-like | sparse matrix | None = None
- Returns
- self
- class autosklearn.ensembles.SingleModelEnsemble(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, model_index: int, random_state: int | np.random.RandomState | None = None)[source]¶
Ensemble consisting of a single model.
This class is used by the
MultiObjectiveDummyEnsemble
to represent ensembles consisting of a single model, and this class should not be used on its own.Do not use by yourself!
- Parameters
- task_type: int
An identifier indicating which task is being performed.
- metrics: Sequence[Scorer] | Scorer
The metrics used to evaluate the models.
- backendBackend
Gives access to the backend of Auto-sklearn. Not used.
- model_indexint
Index of the model that constitutes the ensemble. This index will be used to select the correct predictions that will be passed during
fit
andpredict
.- random_state: int | RandomState | None = None
Not used.
- fit(base_models_predictions: np.ndarray | list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) SingleModelEnsemble [source]¶
Dummy implementation of the
fit
method.Actualy work of passing the model index is done in the constructor. This method only stores the identifier of the selected model and computes it’s validation loss.
- Parameters
- base_models_predictions: np.ndarray
shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression
Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.
- true_targetsarray of shape [n_targets]
- model_identifiersidentifier for each base model.
Can be used for practical text output of the ensemble.
- runs: Sequence[Run]
Additional information for each run executed by SMAC that was considered by the ensemble builder. Not used.
- X_datalist-like | spmatrix | None = None
X data to feed to a metric if it requires it
- Returns
- self
- class autosklearn.ensembles.SingleBestFromRunhistory(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, run_history: RunHistory, seed: int, random_state: int | np.random.RandomState | None = None)[source]¶
In the case of a crash, this class searches for the best individual model.
Such model is returned as an ensemble of a single object, to comply with the expected interface of an AbstractEnsemble.
Do not use by yourself!
Multi-objective¶
- class autosklearn.ensembles.MultiObjectiveDummyEnsemble(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, random_state: int | np.random.RandomState | None = None)[source]¶
A dummy implementation of a multi-objective ensemble.
Builds ensembles that are individual models on the Pareto front each.
- Parameters
- task_type: int
An identifier indicating which task is being performed.
- metrics: Sequence[Scorer] | Scorer
The metrics used to evaluate the models.
- backendBackend
Gives access to the backend of Auto-sklearn. Not used.
- random_state: int | RandomState | None = None
Not used.
- fit(base_models_predictions: list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) MultiObjectiveDummyEnsemble [source]¶
Select dummy ensembles given predictions of base models and targets.
- Parameters
- base_models_predictions: np.ndarray
shape = (n_base_models, n_data_points, n_targets) n_targets is the number of classes in case of classification, n_targets is 0 or 1 in case of regression
Can be a list of 2d numpy arrays as well to prevent copying all predictions into a single, large numpy array.
- true_targetsarray of shape [n_targets]
- model_identifiersidentifier for each base model.
Can be used for practical text output of the ensemble.
- runs: Sequence[Run]
Additional information for each run executed by SMAC that was considered by the ensemble builder. Not used.
- X_datalist-like | sparse matrix | None = None
X data to give to the metric if required
- Returns
- self
- get_identifiers_with_weights() list[tuple[tuple[int, int, float], float]] [source]¶
Return a (identifier, weight)-pairs for all models that were passed to the ensemble builder based on the ensemble that is best for the 1st metric.
- Parameters
- modelsdict {identifiermodel object}
The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.
- Returns
- list[tuple[tuple[int, int, float], float]
- get_models_with_weights(models: dict[tuple[int, int, float], BasePipeline]) list[tuple[float, BasePipeline]] [source]¶
Return a list of (weight, model) pairs for the ensemble that is best for the 1st metric.
- Parameters
- modelsdict {identifiermodel object}
The identifiers are the same as the one presented to the fit() method. Models can be used for nice printing.
- Returns
- list[tuple[float, BasePipeline]]
- get_selected_model_identifiers() list[tuple[int, int, float]] [source]¶
Return identifiers of models in the ensemble that is best for the 1st metric.
This includes models which have a weight of zero!
- Returns
- list
- get_validation_performance() float [source]¶
Validation performance of the ensemble that is best for the 1st metric.
- Returns
- float
- property pareto_set: Sequence[autosklearn.ensembles.abstract_ensemble.AbstractEnsemble]¶
Get a sequence on ensembles that are on the pareto front
- Returns
- Sequence[AbstractEnsemble]
- Raises
- SklearnNotFittedError
If
fit
has not been called and the pareto set does not exist yet