FAQ

General

Where can I find examples on how to use auto-sklearn?

We provide examples on using auto-sklearn for multiple use cases ranging from simple classification to advanced uses such as feature importance, parallel runs and customization. They can be found in the Examples.

What type of tasks can auto-sklearn tackle?

auto-sklearn can accept targets for the following tasks (more details on Sklearn algorithms):

  • Binary Classification

  • Multiclass Classification

  • Multilabel Classification

  • Regression

  • Multioutput Regression

You can provide feature and target training pairs (X_train/y_train) to auto-sklearn to fit an ensemble of pipelines as described in the next section. This X_train/y_train dataset must belong to one of the supported formats: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists. Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the Example Performance-over-time plot. Supported formats for these training and testing pairs are: np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix and python lists.

If your data contains categorical values (in the features or targets), autosklearn will automatically encode your data using a sklearn.preprocessing.LabelEncoder for unidimensional data and a sklearn.preprocessing.OrdinalEncoder for multidimensional data.

Regarding the features, there are two methods to guide auto-sklearn to properly encode categorical columns:

  • Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the Example Feature Types.

  • You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical dtype, auto-sklearn will not encode it and it will be passed directly to scikit-learn. If the column has a categorical/boolean class, it will be encoded. If the column is of any other type (Object or Timeseries), an error will be raised. For further details on how to properly encode your data, you can check the Pandas Example Working with categorical data). If you are working with time series, it is recommended that you follow this approach Working with time data.

Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding is created between these splits (if only y_train is provided during fit, the categorical encoder will not be able to handle new classes that are exclusive to y_test). If the task is regression, no encoding happens on the targets.

Where can I find slides and notebooks from talks and tutorials?

We provide resources for talks, tutorials and presentations on auto-sklearn under auto-sklearn-talks

How should I cite auto-sklearn in a scientific publication?

If you’ve used auto-sklearn in scientific publications, we would appreciate citations.

@inproceedings{feurer-neurips15a,
    title     = {Efficient and Robust Automated Machine Learning},
    author    = {Feurer, Matthias and Klein, Aaron and Eggensperger, Katharina  Springenberg, Jost and Blum, Manuel and Hutter, Frank},
    booktitle = {Advances in Neural Information Processing Systems 28 (2015)},
    pages     = {2962--2970},
    year      = {2015}
}

Or this, if you’ve used auto-sklearn 2.0 in your work:

@article{feurer-arxiv20a,
    title     = {Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning},
    author    = {Feurer, Matthias and Eggensperger, Katharina and Falkner, Stefan and Lindauer, Marius and Hutter, Frank},
    booktitle = {arXiv:2007.04074 [cs.LG]},
    year      = {2020}
}
I want to contribute. What can I do?

This sounds great. Please have a look at our contribution guide

I have a question which is not answered here. What should I do?

Thanks a lot. We regularly update this section with questions from our issue tracker. So please use the issue tracker

Resource Management

How should I set the time and memory limits?

While auto-sklearn alleviates manual hyperparameter tuning, the user still has to set memory and time limits. For most datasets a memory limit of 3GB or 6GB as found on most modern computers is sufficient. For the time limits it is harder to give clear guidelines. If possible, a good default is a total time limit of one day, and a time limit of 30 minutes for a single run.

Further guidelines can be found in auto-sklearn/issues/142.

How many CPU cores does auto-sklearn use by default?

By default, auto-sklearn uses one core. See also Parallel computation on how to configure this.

How can I run auto-sklearn in parallel?

Nevertheless, auto-sklearn also supports parallel Bayesian optimization via the use of Dask.distributed. By providing the arguments n_jobs to the estimator construction, one can control the number of cores available to auto-sklearn (As shown in the Example Parallel Usage on a single machine). Distributed processes are also supported by providing a custom client object to auto-sklearn like in the Example: Parallel Usage: Spawning workers from the command line. When multiple cores are available, auto-sklearn will create a worker per core, and use the available workers to both search for better machine learning models as well as building an ensemble with them until the time resource is exhausted.

Note: auto-sklearn requires all workers to have access to a shared file system for storing training data and models.

auto-sklearn employs threadpoolctl to control the number of threads employed by scientific libraries like numpy or scikit-learn. This is done exclusively during the building procedure of models, not during inference. In particular, auto-sklearn allows each pipeline to use at most 1 thread during training. At predicting and scoring time this limitation is not enforced by auto-sklearn. You can control the number of resources employed by the pipelines by setting the following variables in your environment, prior to running auto-sklearn:

$ export OPENBLAS_NUM_THREADS=1
$ export MKL_NUM_THREADS=1
$ export OMP_NUM_THREADS=1

For further information about how scikit-learn handles multiprocessing, please check the Parallelism, resource management, and configuration documentation from the library.

Auto-sklearn is extremely memory hungry in a sequential setting

Auto-sklearn can appear very memory hungry (i.e. requiring a lot of memory for small datasets) due to the use of fork for creating new processes when running in sequential manner (if this happens in a parallel setting or if you pass your own dask client this is due to a different issue, see the other issues below).

Let’s go into some more detail and discuss how to fix it: Auto-sklearn executes each machine learning algorithm in its own process to be able to apply a memory limit and a time limit. To start such a process, Python gives three options: fork, forkserver and spawn. The default fork copies the whole process memory into the subprocess. If the main process already uses 1.5GB of main memory and we apply a 3GB memory limit to Auto-sklearn, executing a machine learning pipeline is limited to use at most 1.5GB. We would have loved to use forkserver or spawn as the default option instead, which both copy only relevant data into the subprocess and thereby alleaviate the issue of eating up a lot of your main memory (and also do not suffer from potential deadlocks as fork does, see here), but they have the downside that code must be guarded by if __name__ == "__main__" or executed in a notebook, and we decided that we do not want to require this by default.

There are now two possible solutions:

  1. Use Auto-sklearn in parallel: if you use Auto-sklean in parallel, it defaults to forkserver as the parallelization mechanism itself requires Auto-sklearn the code to be guarded. Please find more information on how to do this in the following two examples:

    1. Parallel Usage on a single machine

    2. Parallel Usage: Spawning workers from the command line

    Note

    This requires all code to be guarded by if __name__ == "__main__".

  2. Pass a dask client. If the user passes a dask client, Auto-sklearn can no longer assume that it runs in sequential mode and will use a forkserver to start new processes.

    Note

    This requires all code to be guarded by if __name__ == "__main__".

We therefore suggest using one of the above settings by default.

Auto-sklearn is extremely memory hungry in a parallel setting

When running Auto-sklearn in a parallel setting it starts new processes for evaluating machine learning models using the forkserver mechanism. Code that is in the main script and that is not guarded by if __name__ == "__main__" will be executed for each subprocess. If, for example, you are loading your dataset outside of the guarded code, your dataset will be loaded for each evaluation of a machine learning algorithm and thus blocking your RAM.

We therefore suggest moving all code inside functions or the main block.

Auto-sklearn crashes with a segmentation fault

Please make sure that you have read and followed the Installation section! In case everything is set up correctly, this is most likely due to the dependency pyrfr not being compiled correctly. If this is the case please execute:

import pyrfr.regression as reg
data = reg.default_data_container(64)

If this fails, the pyrfr dependency is most likely not compiled correctly. We advice you to do the following:

  1. Check if you can use a pre-compiled version of the pyrfr to avoid compiling it yourself. We provide pre-compiled versions of the pyrfr on pypi.

  2. Check if the dependencies specified under Installation are correctly installed, especially that you have swig and a C++ compiler.

  3. If you are not yet using Conda, consider using it; it simplifies installation of the correct dependencies.

  4. Install correct build dependencies before installing the pyrfr, you can check the following github issues for suggestions: 1025, 856

Results, Log Files and Output

How can I get an overview of the run statistics?

sprint_statistics() is a method that prints the name of the dataset, the metric used, and the best validation score obtained by running auto-sklearn. It additionally prints the number of both successful and unsuccessful algorithm runs.

What was the performance over time?

performance_over_time_ returns a DataFrame containing the models performance over time data, which can be used for plotting directly (Here is an example: Performance-over-time plot).

automl.performance_over_time_.plot(
        x='Timestamp',
        kind='line',
        legend=True,
        title='Auto-sklearn accuracy over time',
        grid=True,
    )
    plt.show()
Which models were evaluated?

You can see all models evaluated using automl.leaderboard(ensemble_only=False).

Which models are in the final ensemble?

Use either automl.leaderboard(ensemble_only=True) or automl.show_models()

Is there more data I can look at?

cv_results_ returns a dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame, e.g. df = pd.DataFrame(automl.cv_results_)

Where does Auto-sklearn output files by default?

Auto-sklearn heavily uses the hard drive to store temporary data, models and log files which can be used to inspect the behavior of Auto-sklearn. Each run of Auto-sklearn requires its own directory. If not provided by the user, Auto-sklearn requests a temporary directory from Python, which by default is located under /tmp and starts with autosklearn_tmp_ followed by a random string. By default, this directory is deleted when the Auto-sklearn object is finished fitting. If you want to keep these files you can pass the argument delete_tmp_folder_after_terminate=True to the Auto-sklearn object.

The autosklearn.classification.AutoSklearnClassifier and all other auto-sklearn estimators accept the argument tmp_folder which change where such output is written to.

There’s an additional argument output_directory which can be passed to Auto-sklearn and it controls where test predictions of the ensemble are stored if the test set is passed to fit().

Auto-sklearn's logfiles eat up all my disk space. What can I do?

Auto-sklearn heavily uses the hard drive to store temporary data, models and log files which can be used to inspect the behavior of Auto-sklearn. By default, Auto-sklearn stores 50 models and their predictions on the validation data (which is a subset of the training data in case of holdout and the full training data in case of cross-validation) on the hard drive. Redundant models and their predictions (i.e. when we have more than 50 models) are removed everytime the ensemble builder finishes an iteration, which means that the number of models stored on disk can temporarily be higher if a model is output while the ensemble builder is running.

One can therefore change the number of models that will be stored on disk by passing an integer for the argument max_models_on_disc to Auto-sklearn, for example reduce the number of models stored on disk if you have space issues.

As the number of models is only an indicator of the disk space used it is also possible to pass the memory in MB the models are allowed to use as a float (also via the max_models_on_disc arguments). As above, this is rather a guideline on how much memory is used as redundant models are only removed from disk when the ensemble builder finishes an iteration.

Note

Especially when running in parallel it can happen that multiple models are constructed during one run of the ensemble builder and thus Auto-sklearn can exceed the given limit.

Note

These limits do only apply to models and their predictions, but not to other files stored in the temporary directory such as the log files.

The Search Space

How can I restrict the searchspace?

The following shows an example of how to exclude all preprocessing methods and restrict the configuration space to only random forests.

import autosklearn.classification
automl = autosklearn.classification.AutoSklearnClassifier(
    include = {
        'classifier': ["random_forest"],
        'feature_preprocessor': ["no_preprocessing"]
    },
    exclude=None
)
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)

Note: The strings used to identify estimators and preprocessors are the filenames without .py.

For a full list please have a look at the source code (in autosklearn/pipeline/components/):

We do also provide an example on how to restrict the classifiers to search over Interpretable models.

How can I turn off data preprocessing?

Data preprocessing includes One-Hot encoding of categorical features, imputation of missing values and the normalization of features or samples. These ensure that the data the gets to the sklearn models is well formed and can be used for training models.

While this is necessary in general, if you’d like to disable this step, please refer to this example.

How can I turn off feature preprocessing?

Feature preprocessing is a single transformer which implements for example feature selection or transformation of features into a different space (i.e. PCA).

This can be turned off by setting include={'feature_preprocessor'=["no_preprocessing"]} as shown in the example above.

Will non-scikit-learn models be added to Auto-sklearn?

The short answer: no.

The long answer answer is a bit more nuanced: maintaining Auto-sklearn requires a lot of time and effort, which would grow even larger when depending on more libraries. Also, adding more libraries would require us to generate meta-data more often. Lastly, having more choices does not guarantee a better performance for most users as having more choices demands a longer search for good models and can lead to more overfitting.

Nevertheless, everyone can still add their favorite model to Auto-sklearn’s search space by following the examples on how to extend Auto-sklearn.

If there is interest in creating a Auto-sklearn-contrib repository with 3rd-party models please open an issue for that.

How can I only search for interpretable models

Auto-sklearn can be restricted to only use interpretable models and preprocessing algorithms. Please see the Section The search space to learn how to restrict the models which are searched over or see the Example Interpretable models.

We don’t provide a judgement which of the models are interpretable as this is very much up to the specific use case, but would like to note that decision trees and linear models usually most interpretable.

Ensembling

What can I configure wrt the ensemble building process?

The following hyperparameters control how the ensemble is constructed:

  • ensemble_size determines the maximal size of the ensemble. If it is set to zero, no ensemble will be constructed.

  • ensemble_nbest allows the user to directly specify the number of models considered for the ensemble. This hyperparameter can be an integer n, such that only the best n models are used in the final ensemble. If a float between 0.0 and 1.0 is provided, ensemble_nbest would be interpreted as a fraction suggesting the percentage of models to use in the ensemble building process (namely, if ensemble_nbest is a float, library pruning is implemented as described in Caruana et al. (2006)).

  • max_models_on_disc defines the maximum number of models that are kept on the disc, as a mechanism to control the amount of disc space consumed by auto-sklearn. Throughout the automl process, different individual models are optimized, and their predictions (and other metadata) is stored on disc. The user can set the upper bound on how many models are acceptable to keep on disc, yet this variable takes priority in the definition of the number of models used by the ensemble builder (that is, the minimum of ensemble_size, ensemble_nbest and max_models_on_disc determines the maximal amount of models used in the ensemble). If set to None, this feature is disabled.

Which models are in the final ensemble?

The results obtained from the final ensemble can be printed by calling show_models() or leaderboard(). The auto-sklearn ensemble is composed of scikit-learn models that can be inspected as exemplified in the Example Obtain run information.

Can I fit an ensemble also only post-hoc?

It is possible to build ensembles post-hoc. An example on how to do this (first searching for individual models, and then building an ensemble from them) can be seen in Sequential Usage.

Configuring the Search Procedure

Can I change the resampling strategy?

Examples for using holdout and cross-validation can be found in example

Can I use a custom metric

Examples for using a custom metric can be found in example

Meta-Learning

Which datasets are used for meta-learning?

We updated the list of datasets used for meta-learning several times and this list now differs significantly from the original 140 datasets we used in 2015 when the paper and the package were released. An up-to-date list of OpenML task IDs can be found on github.

Which meta-features are used for meta-learning?

We do not have a user guide on meta-features but they are all pretty simple and can be found in the source code.

Issues and Debugging

How can I limit the number of model evaluations for debugging?

In certain cases, for example for debugging, it can be helpful to limit the number of model evaluations. We do not provide this as an argument in the API as we believe that it should NOT be used in practice, but that the user should rather provide time limits. An example on how to add the number of models to try as an additional stopping condition can be found in this github issue. Please note that Auto-sklearn will stop when either the time limit or the number of models termination condition is reached.

Why does the final ensemble contains only a dummy model?

This is a symptom of the problem that all runs started by Auto-sklearn failed. Usually, the issue is that the runtime or memory limit were too tight. Please check the output of sprint_statistics() to see the distribution of why runs failed. If there are mostly crashed runs, please check the log file for further details. If there are mostly runs that exceed the memory or time limit, please increase the respective limit and rerun the optimization.

Auto-sklearn does not use the specified amount of resources?

Auto-sklearn wraps scikit-learn and therefore inherits its parallelism implementation. In short, scikit-learn uses two modes of parallelizing computations:

  1. By using joblib to distribute independent function calls on multiple cores.

  2. By using lower level libraries such as OpenMP and numpy to distribute more fine-grained computation.

This means that Auto-sklearn can use more resources than expected by the user. For technical reasons we can only control the 1st way of parallel execution, but not the 2nd. Thus, the user needs to make sure that the lower level parallelization libraries only use as many cores as allocated (on a laptop or workstation running a single copy of Auto-sklearn it can be fine to not adjust this, but when using a compute cluster it is necessary to align the parallelism setting with the number of requested CPUs). This can be done by setting the following environment variables: MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, BLIS_NUM_THREADS and OMP_NUM_THREADS.

More details can be found in the scikit-learn docs.

Other

Model persistence

auto-sklearn is mostly a wrapper around scikit-learn. Therefore, it is possible to follow the persistence Example from scikit-learn.

Vanilla auto-sklearn

In order to obtain vanilla auto-sklearn as used in Efficient and Robust Automated Machine Learning set ensemble_size=1 and initial_configurations_via_metalearning=0:

import autosklearn.classification
automl = autosklearn.classification.AutoSklearnClassifier(
    ensemble_size=1,
    initial_configurations_via_metalearning=0
)

An ensemble of size one will result in always choosing the current best model according to its performance on the validation set. Setting the initial configurations found by meta-learning to zero makes auto-sklearn use the regular SMAC algorithm for suggesting new hyperparameter configurations.