APIs¶
Main modules¶
Classification¶
- class autoPyTorch.api.tabular_classification.TabularClassificationTask(seed: int = 1, n_jobs: int = 1, n_threads: int = 1, logging_config: Optional[Dict] = None, ensemble_size: int = 50, ensemble_nbest: int = 50, max_models_on_disc: int = 50, temporary_directory: Optional[str] = None, output_directory: Optional[str] = None, delete_tmp_folder_after_terminate: bool = True, delete_output_folder_after_terminate: bool = True, include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, resampling_strategy: Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes] = HoldoutValTypes.holdout_validation, resampling_strategy_args: Optional[Dict[str, Any]] = None, backend: Optional[Backend] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None)[source]¶
Tabular Classification API to the pipelines.
- Args:
- seed (int: default=1):
seed to be used for reproducibility.
- n_jobs (int: default=1):
number of consecutive processes to spawn.
- n_threads (int: default=1):
number of threads to use for each process.
- logging_config (Optional[Dict]):
Specifies configuration for logging, if None, it is loaded from the logging.yaml
- ensemble_size (int: default=50):
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.
- ensemble_nbest (int: default=50):
Only consider the ensemble_nbest models to build the ensemble
- max_models_on_disc (int: default=50):
Maximum number of models saved to disc. Also, controls the size of the ensemble as any additional models will be deleted. Must be greater than or equal to 1.
- temporary_directory (str):
Folder to store configuration output and log file
- output_directory (str):
Folder to store predictions for optional test set
- delete_tmp_folder_after_terminate (bool):
Determines whether to delete the temporary directory, when finished
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- resampling_strategy resampling_strategy (RESAMPLING_STRATEGIES),
(default=HoldoutValTypes.holdout_validation): strategy to split the training data.
- resampling_strategy_args (Optional[Dict[str, Any]]): arguments
required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- build_pipeline(dataset_properties: Dict[str, Union[int, float, str, List, bool, Tuple]], include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None) TabularClassificationPipeline [source]¶
Build pipeline according to current task and for the passed dataset properties
- Args:
- dataset_properties (Dict[str, Any]):
Characteristics of the dataset to guide the pipeline choices of components
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- Returns:
TabularClassificationPipeline
- fit_pipeline(configuration: Configuration, *, dataset: Optional[BaseDataset] = None, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, resampling_strategy: Optional[Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes]] = None, resampling_strategy_args: Optional[Dict[str, Any]] = None, run_time_limit_secs: int = 60, memory_limit: Optional[int] = None, eval_metric: Optional[str] = None, all_supported_metrics: bool = False, budget_type: Optional[str] = None, include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None, budget: Optional[float] = None, pipeline_options: Optional[Dict] = None, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None) Tuple[Optional[BasePipeline], RunInfo, RunValue, BaseDataset] ¶
Fit a pipeline on the given task for the budget. A pipeline configuration can be specified if None, uses default
Fit uses the estimator pipeline_options attribute, which the user can interact via the get_pipeline_options()/set_pipeline_options() methods.
- Args:
- configuration (Configuration):
configuration to fit the pipeline with.
- dataset (BaseDataset):
An object of the appropriate child class of BaseDataset, that will be used to fit the pipeline
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- dataset_name (Optional[str]):
Name of the dataset, if None, random value is used.
- resampling_strategy (Optional[RESAMPLING_STRATEGIES]):
Strategy to split the training data. if None, uses HoldoutValTypes.holdout_validation.
- resampling_strategy_args (Optional[Dict[str, Any]]):
Arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- run_time_limit_secs (int: default=60):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- memory_limit (Optional[int]):
Memory limit in MB for the machine learning algorithm. autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- eval_metric (Optional[str]):
Name of the metric that is used to evaluate a pipeline.
- all_supported_metrics (bool: default=True):
if True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- search_space_updates(Optional[HyperparameterSearchSpaceUpdates]):
Updates to be made to the hyperparameter search space of the pipeline
- budget (Optional[float]):
Budget to fit a single run of the pipeline. If not provided, uses the default in the pipeline config
- pipeline_options (Optional[Dict]):
Valid config options include “device”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- Returns:
- (BasePipeline):
fitted pipeline
- (RunInfo):
Run information
- (RunValue):
Result of fitting the pipeline
- (BaseDataset):
Dataset created from the given tensors
- get_dataset(X_train: Union[List, DataFrame, ndarray], y_train: Union[List, DataFrame, ndarray], X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, resampling_strategy: Optional[Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes]] = None, resampling_strategy_args: Optional[Dict[str, Any]] = None, dataset_name: Optional[str] = None, dataset_compression: Optional[Dict[str, Union[int, float, List[str]]]] = None, **kwargs: Any) BaseDataset ¶
Returns an object of a child class of BaseDataset according to the current task.
- Args:
- X_train (Union[List, pd.DataFrame, np.ndarray]):
Training feature set.
- y_train (Union[List, pd.DataFrame, np.ndarray]):
Training target set.
- X_test (Optional[Union[List, pd.DataFrame, np.ndarray]]):
Testing feature set
- y_test (Optional[Union[List, pd.DataFrame, np.ndarray]]):
Testing target set
- resampling_strategy (Optional[RESAMPLING_STRATEGIES]):
Strategy to split the training data. if None, uses HoldoutValTypes.holdout_validation.
- resampling_strategy_args (Optional[Dict[str, Any]]):
arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- dataset_compression (Optional[DatasetCompressionSpec]):
We compress datasets so that they fit into some predefined amount of memory. NOTE
You can also pass your own configuration with the same keys and choosing from the available
"methods"
. The available options are described here: memory_allocationAbsolute memory in MB, e.g. 10MB is
"memory_allocation": 10
. The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in"methods"
will not be performed. It can be either float or int.- methods
We currently provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given. *
"precision"
-- We reduce floating point precision as follows:
np.float128 -> np.float64
np.float96 -> np.float64
np.float64 -> np.float32
pandas dataframes are reduced using the downcast option of pd.to_numeric to the lowest possible precision.
subsample
- We subsample data such that it fits directly into the memory allocationmemory_allocation * memory_limit
. Therefore, this should likely be the last method listed in"methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
- kwargs (Any):
can be used to pass task specific dataset arguments. Currently supports passing feat_types for tabular tasks which specifies whether a feature is ‘numerical’ or ‘categorical’.
- Returns:
- BaseDataset:
the dataset object
- get_incumbent_results(include_traditional: bool = False) Tuple[Configuration, Dict[str, Union[int, str, float]]] ¶
Get Incumbent config and the corresponding results
- Args:
- include_traditional (bool):
Whether to include results from tradtional pipelines
- Returns:
- Configuration (CS.ConfigurationSpace):
The incumbent configuration
- Dict[str, Union[int, str, float]]:
Additional information about the run of the incumbent configuration.
- get_pipeline_options() dict ¶
Returns the current pipeline configuration.
- get_search_results() SearchResults ¶
Get the interface to obtain the search results easily.
- get_search_space(dataset: Optional[BaseDataset] = None) ConfigurationSpace ¶
Returns the current search space as ConfigurationSpace object.
- plot_perf_over_time(metric_name: str, ax: Optional[Axes] = None, plot_setting_params: PlotSettingParams = PlotSettingParams(n_points=20, xscale='linear', yscale='linear', xlabel=None, ylabel=None, title=None, title_kwargs={}, xlim=None, ylim=None, grid=True, legend=True, legend_kwargs={}, show=False, figname=None, figsize=None, savefig_kwargs={}), color_label_settings: ColorLabelSettings = ColorLabelSettings(single_train=('red', None), single_opt=('blue', None), single_test=('green', None), ensemble_train=('brown', None), ensemble_test=('purple', None)), *args: Any, **kwargs: Any) None ¶
Visualize the performance over time using matplotlib. The plot related arguments are based on matplotlib. Please refer to the matplotlib documentation for more details.
- Args:
- metric_name (str):
The name of metric to visualize. The names are available in
autoPyTorch.metrics.CLASSIFICATION_METRICS
autoPyTorch.metrics.REGRESSION_METRICS
- ax (Optional[plt.Axes]):
axis to plot (subplots of matplotlib). If None, it will be created automatically.
- plot_setting_params (PlotSettingParams):
Parameters for the plot.
- color_label_settings (ColorLabelSettings):
The settings of a pair of color and label for each plot.
- args, kwargs (Any):
Arguments for the ax.plot.
- Note:
You might need to run export DISPLAY=:0.0 if you are using non-GUI based environment.
- predict(X_test: ndarray, batch_size: Optional[int] = None, n_jobs: int = 1) ndarray [source]¶
Generate the estimator predictions. Generate the predictions based on the given examples from the test set.
- Args:
- X_test (np.ndarray):
The test set examples.
- Returns:
Array with estimator predictions.
- refit(dataset: Optional[BaseDataset] = None, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, resampling_strategy: Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes] = NoResamplingStrategyTypes.no_resampling, resampling_strategy_args: Optional[Dict[str, Any]] = None, total_walltime_limit: int = 120, run_time_limit_secs: int = 60, memory_limit: Optional[int] = None, eval_metric: Optional[str] = None, all_supported_metrics: bool = False, budget_type: Optional[str] = None, budget: Optional[float] = None, pipeline_options: Optional[Dict] = None, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None) BaseTask ¶
Fit all the models found in the ensemble on the whole training set X_train. Therefore, we recommend using NoResamplingStrategy to be able to do that. Nevertheless, it is still able to fit using other splitting techniques such as hold out or cross validation.
Refit uses the estimator pipeline_options attribute, which the user can interact via the get_pipeline_options()/set_pipeline_options() methods.
- Args:
- dataset (BaseDataset):
An object of the appropriate child class of BaseDataset, that will be used to fit the pipeline
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- dataset_name (Optional[str]):
Name of the dataset, if None, random value is used.
- resampling_strategy (ResamplingStrategies):
Strategy to split the training data. Defaults to NoResamplingStrategyTypes.no_resampling.
- resampling_strategy_args (Optional[Dict[str, Any]]):
Arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- total_walltime_limit (int):
Total time that can be used by all the models to be refitted. Defaults to 120.
- run_time_limit_secs (int: default=60):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- memory_limit (Optional[int]):
Memory limit in MB for the machine learning algorithm. autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- eval_metric (Optional[str]):
Name of the metric that is used to evaluate a pipeline.
- all_supported_metrics (bool: default=True):
if True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
- budget (Optional[float]):
Budget to fit a single run of the pipeline. If not provided, uses the default in the pipeline config
- pipeline_options (Optional[Dict]):
Valid config options include “device”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- Returns:
self
- score(y_pred: ndarray, y_test: Union[ndarray, DataFrame]) Dict[str, float] ¶
Calculate the score on the test set. Calculate the evaluation measure on the test set.
- Args:
- y_pred (np.ndarray):
The test predictions
- y_test (np.ndarray):
The test ground truth labels.
- Returns:
- Dict[str, float]:
Value of the evaluation metric calculated on the test set.
- search(optimize_metric: str, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, feat_types: Optional[List[str]] = None, budget_type: str = 'epochs', min_budget: int = 5, max_budget: int = 50, total_walltime_limit: int = 100, func_eval_time_limit_secs: Optional[int] = None, enable_traditional_pipeline: bool = True, memory_limit: int = 4096, smac_scenario_args: Optional[Dict[str, Any]] = None, get_smac_object_callback: Optional[Callable] = None, all_supported_metrics: bool = True, precision: int = 32, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None, load_models: bool = True, portfolio_selection: Optional[str] = None, dataset_compression: Union[Mapping[str, Any], bool] = False) BaseTask [source]¶
Search for the best pipeline configuration for the given dataset.
Fit both optimizes the machine learning models and builds an ensemble out of them. To disable ensembling, set ensemble_size==0. using the optimizer.
- Args:
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- feat_types (Optional[List[str]]):
Description about the feature types of the columns. Accepts numerical for integers, float data and categorical for categories, strings and bool. Defaults to None.
- optimize_metric (str):
name of the metric that is used to evaluate a pipeline.
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.). budget_type will determine the units of min_budget/max_budget. If budget_type==’epochs’ is used, min_budget will refer to epochs whereas if budget_type==’runtime’ then min_budget will refer to seconds.
- min_budget (int):
Auto-PyTorch uses Hyperband to trade-off resources between running many pipelines at min_budget and running the top performing pipelines on max_budget. min_budget states the minimum resource allocation a pipeline should have so that we can compare and quickly discard bad performing models. For example, if the budget_type is epochs, and min_budget=5, then we will run every pipeline to a minimum of 5 epochs before performance comparison.
- max_budget (int):
Auto-PyTorch uses Hyperband to trade-off resources between running many pipelines at min_budget and running the top performing pipelines on max_budget. max_budget states the maximum resource allocation a pipeline is going to be ran. For example, if the budget_type is epochs, and max_budget=50, then the pipeline training will be terminated after 50 epochs.
- total_walltime_limit (int: default=100):
Time limit in seconds for the search of appropriate models. By increasing this value, autopytorch has a higher chance of finding better models.
- func_eval_time_limit_secs (Optional[int]):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data. When set to None, this time will automatically be set to total_walltime_limit // 2 to allow enough time to fit at least 2 individual machine learning algorithms. Set to np.inf in case no time limit is desired.
- enable_traditional_pipeline (bool: default=True):
We fit traditional machine learning algorithms (LightGBM, CatBoost, RandomForest, ExtraTrees, KNN, SVM) prior building PyTorch Neural Networks. You can disable this feature by turning this flag to False. All machine learning algorithms that are fitted during search() are considered for ensemble building.
- memory_limit (int: default=4096):
Memory limit in MB for the machine learning algorithm. Autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- smac_scenario_args (Optional[Dict]):
Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.
- get_smac_object_callback (Optional[Callable]):
Callback function to create an object of class smac.optimizer.smbo.SMBO. The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with SMAC.
- tae_func (Optional[Callable]):
TargetAlgorithm to be optimised. If None, eval_function available in autoPyTorch/evaluation/train_evaluator is used. Must be child class of AbstractEvaluator.
- all_supported_metrics (bool: default=True):
If True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- precision (int: default=32):
Numeric precision used when loading ensemble data. Can be either ‘16’, ‘32’ or ‘64’.
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- load_models (bool: default=True):
Whether to load the models after fitting AutoPyTorch.
- portfolio_selection (Optional[str]):
This argument controls the initial configurations that AutoPyTorch uses to warm start SMAC for hyperparameter optimization. By default, no warm-starting happens. The user can provide a path to a json file containing configurations, similar to (…herepathtogreedy…). Additionally, the keyword ‘greedy’ is supported, which would use the default portfolio from AutoPyTorch Tabular.
- dataset_compression: Union[bool, Mapping[str, Any]] = True
We compress datasets so that they fit into some predefined amount of memory. NOTE
Default configuration when left as
True
: .. code-block:: python- {
“memory_allocation”: 0.1, “methods”: [“precision”]
}
You can also pass your own configuration with the same keys and choosing from the available
"methods"
. The available options are described here: memory_allocationBy default, we attempt to fit the dataset into
0.1 * memory_limit
. This float value can be set with"memory_allocation": 0.1
. We also allow for specifying absolute memory in MB, e.g. 10MB is"memory_allocation": 10
. The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in"methods"
will not be performed.- methods
We currently provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given. *
"precision"
-- We reduce floating point precision as follows:
np.float128 -> np.float64
np.float96 -> np.float64
np.float64 -> np.float32
pandas dataframes are reduced using the downcast option of pd.to_numeric to the lowest possible precision.
subsample
- We subsample data such that it fits directly into the memory allocationmemory_allocation * memory_limit
. Therefore, this should likely be the last method listed in"methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
- Returns:
self
- set_pipeline_options(**pipeline_options_kwargs: Any) None ¶
Check whether arguments are valid and then sets them to the current pipeline configuration.
- Args:
**pipeline_options_kwargs: Valid config options include “num_run”, “device”, “budget_type”, “epochs”, “runtime”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- Returns:
None
- show_models() str ¶
Returns a Markdown containing details about the final ensemble/configuration.
- Returns:
- str:
Markdown table of models.
- sprint_statistics() str ¶
Prints statistics about the SMAC search.
These statistics include:
Optimisation Metric
Best Optimisation score achieved by individual pipelines
Total number of target algorithm runs
Total number of successful target algorithm runs
Total number of crashed target algorithm runs
Total number of target algorithm runs that exceeded the time limit
Total number of successful target algorithm runs that exceeded the memory limit
- Returns:
- (str):
Formatted string with statistics
Regression¶
- class autoPyTorch.api.tabular_regression.TabularRegressionTask(seed: int = 1, n_jobs: int = 1, n_threads: int = 1, logging_config: Optional[Dict] = None, ensemble_size: int = 50, ensemble_nbest: int = 50, max_models_on_disc: int = 50, temporary_directory: Optional[str] = None, output_directory: Optional[str] = None, delete_tmp_folder_after_terminate: bool = True, delete_output_folder_after_terminate: bool = True, include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, resampling_strategy: Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes] = HoldoutValTypes.holdout_validation, resampling_strategy_args: Optional[Dict[str, Any]] = None, backend: Optional[Backend] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None)[source]¶
Tabular Regression API to the pipelines.
- Args:
- seed (int: default=1):
seed to be used for reproducibility.
- n_jobs (int: default=1):
number of consecutive processes to spawn.
- n_threads (int: default=1):
number of threads to use for each process.
- logging_config (Optional[Dict]):
Specifies configuration for logging, if None, it is loaded from the logging.yaml
- ensemble_size (int: default=50):
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.
- ensemble_nbest (int: default=50):
Only consider the ensemble_nbest models to build the ensemble
- max_models_on_disc (int: default=50):
Maximum number of models saved to disc. Also, controls the size of the ensemble as any additional models will be deleted. Must be greater than or equal to 1.
- temporary_directory (str):
Folder to store configuration output and log file
- output_directory (str):
Folder to store predictions for optional test set
- delete_tmp_folder_after_terminate (bool):
Determines whether to delete the temporary directory, when finished
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- resampling_strategy resampling_strategy (RESAMPLING_STRATEGIES),
(default=HoldoutValTypes.holdout_validation): strategy to split the training data.
- resampling_strategy_args (Optional[Dict[str, Any]]): arguments
required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- build_pipeline(dataset_properties: Dict[str, Union[int, float, str, List, bool, Tuple]], include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None) TabularRegressionPipeline [source]¶
Build pipeline according to current task and for the passed dataset properties
- Args:
- dataset_properties (Dict[str, Any]):
Characteristics of the dataset to guide the pipeline choices of components
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- Returns:
TabularRegressionPipeline:
- fit_pipeline(configuration: Configuration, *, dataset: Optional[BaseDataset] = None, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, resampling_strategy: Optional[Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes]] = None, resampling_strategy_args: Optional[Dict[str, Any]] = None, run_time_limit_secs: int = 60, memory_limit: Optional[int] = None, eval_metric: Optional[str] = None, all_supported_metrics: bool = False, budget_type: Optional[str] = None, include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None, budget: Optional[float] = None, pipeline_options: Optional[Dict] = None, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None) Tuple[Optional[BasePipeline], RunInfo, RunValue, BaseDataset] ¶
Fit a pipeline on the given task for the budget. A pipeline configuration can be specified if None, uses default
Fit uses the estimator pipeline_options attribute, which the user can interact via the get_pipeline_options()/set_pipeline_options() methods.
- Args:
- configuration (Configuration):
configuration to fit the pipeline with.
- dataset (BaseDataset):
An object of the appropriate child class of BaseDataset, that will be used to fit the pipeline
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- dataset_name (Optional[str]):
Name of the dataset, if None, random value is used.
- resampling_strategy (Optional[RESAMPLING_STRATEGIES]):
Strategy to split the training data. if None, uses HoldoutValTypes.holdout_validation.
- resampling_strategy_args (Optional[Dict[str, Any]]):
Arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- run_time_limit_secs (int: default=60):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- memory_limit (Optional[int]):
Memory limit in MB for the machine learning algorithm. autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- eval_metric (Optional[str]):
Name of the metric that is used to evaluate a pipeline.
- all_supported_metrics (bool: default=True):
if True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- search_space_updates(Optional[HyperparameterSearchSpaceUpdates]):
Updates to be made to the hyperparameter search space of the pipeline
- budget (Optional[float]):
Budget to fit a single run of the pipeline. If not provided, uses the default in the pipeline config
- pipeline_options (Optional[Dict]):
Valid config options include “device”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- Returns:
- (BasePipeline):
fitted pipeline
- (RunInfo):
Run information
- (RunValue):
Result of fitting the pipeline
- (BaseDataset):
Dataset created from the given tensors
- get_dataset(X_train: Union[List, DataFrame, ndarray], y_train: Union[List, DataFrame, ndarray], X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, resampling_strategy: Optional[Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes]] = None, resampling_strategy_args: Optional[Dict[str, Any]] = None, dataset_name: Optional[str] = None, dataset_compression: Optional[Dict[str, Union[int, float, List[str]]]] = None, **kwargs: Any) BaseDataset ¶
Returns an object of a child class of BaseDataset according to the current task.
- Args:
- X_train (Union[List, pd.DataFrame, np.ndarray]):
Training feature set.
- y_train (Union[List, pd.DataFrame, np.ndarray]):
Training target set.
- X_test (Optional[Union[List, pd.DataFrame, np.ndarray]]):
Testing feature set
- y_test (Optional[Union[List, pd.DataFrame, np.ndarray]]):
Testing target set
- resampling_strategy (Optional[RESAMPLING_STRATEGIES]):
Strategy to split the training data. if None, uses HoldoutValTypes.holdout_validation.
- resampling_strategy_args (Optional[Dict[str, Any]]):
arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- dataset_compression (Optional[DatasetCompressionSpec]):
We compress datasets so that they fit into some predefined amount of memory. NOTE
You can also pass your own configuration with the same keys and choosing from the available
"methods"
. The available options are described here: memory_allocationAbsolute memory in MB, e.g. 10MB is
"memory_allocation": 10
. The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in"methods"
will not be performed. It can be either float or int.- methods
We currently provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given. *
"precision"
-- We reduce floating point precision as follows:
np.float128 -> np.float64
np.float96 -> np.float64
np.float64 -> np.float32
pandas dataframes are reduced using the downcast option of pd.to_numeric to the lowest possible precision.
subsample
- We subsample data such that it fits directly into the memory allocationmemory_allocation * memory_limit
. Therefore, this should likely be the last method listed in"methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
- kwargs (Any):
can be used to pass task specific dataset arguments. Currently supports passing feat_types for tabular tasks which specifies whether a feature is ‘numerical’ or ‘categorical’.
- Returns:
- BaseDataset:
the dataset object
- get_incumbent_results(include_traditional: bool = False) Tuple[Configuration, Dict[str, Union[int, str, float]]] ¶
Get Incumbent config and the corresponding results
- Args:
- include_traditional (bool):
Whether to include results from tradtional pipelines
- Returns:
- Configuration (CS.ConfigurationSpace):
The incumbent configuration
- Dict[str, Union[int, str, float]]:
Additional information about the run of the incumbent configuration.
- get_pipeline_options() dict ¶
Returns the current pipeline configuration.
- get_search_results() SearchResults ¶
Get the interface to obtain the search results easily.
- get_search_space(dataset: Optional[BaseDataset] = None) ConfigurationSpace ¶
Returns the current search space as ConfigurationSpace object.
- plot_perf_over_time(metric_name: str, ax: Optional[Axes] = None, plot_setting_params: PlotSettingParams = PlotSettingParams(n_points=20, xscale='linear', yscale='linear', xlabel=None, ylabel=None, title=None, title_kwargs={}, xlim=None, ylim=None, grid=True, legend=True, legend_kwargs={}, show=False, figname=None, figsize=None, savefig_kwargs={}), color_label_settings: ColorLabelSettings = ColorLabelSettings(single_train=('red', None), single_opt=('blue', None), single_test=('green', None), ensemble_train=('brown', None), ensemble_test=('purple', None)), *args: Any, **kwargs: Any) None ¶
Visualize the performance over time using matplotlib. The plot related arguments are based on matplotlib. Please refer to the matplotlib documentation for more details.
- Args:
- metric_name (str):
The name of metric to visualize. The names are available in
autoPyTorch.metrics.CLASSIFICATION_METRICS
autoPyTorch.metrics.REGRESSION_METRICS
- ax (Optional[plt.Axes]):
axis to plot (subplots of matplotlib). If None, it will be created automatically.
- plot_setting_params (PlotSettingParams):
Parameters for the plot.
- color_label_settings (ColorLabelSettings):
The settings of a pair of color and label for each plot.
- args, kwargs (Any):
Arguments for the ax.plot.
- Note:
You might need to run export DISPLAY=:0.0 if you are using non-GUI based environment.
- predict(X_test: ndarray, batch_size: Optional[int] = None, n_jobs: int = 1) ndarray [source]¶
Generate the estimator predictions. Generate the predictions based on the given examples from the test set.
- Args:
- X_test (np.ndarray):
The test set examples.
- Returns:
Array with estimator predictions.
- refit(dataset: Optional[BaseDataset] = None, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, resampling_strategy: Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes] = NoResamplingStrategyTypes.no_resampling, resampling_strategy_args: Optional[Dict[str, Any]] = None, total_walltime_limit: int = 120, run_time_limit_secs: int = 60, memory_limit: Optional[int] = None, eval_metric: Optional[str] = None, all_supported_metrics: bool = False, budget_type: Optional[str] = None, budget: Optional[float] = None, pipeline_options: Optional[Dict] = None, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None) BaseTask ¶
Fit all the models found in the ensemble on the whole training set X_train. Therefore, we recommend using NoResamplingStrategy to be able to do that. Nevertheless, it is still able to fit using other splitting techniques such as hold out or cross validation.
Refit uses the estimator pipeline_options attribute, which the user can interact via the get_pipeline_options()/set_pipeline_options() methods.
- Args:
- dataset (BaseDataset):
An object of the appropriate child class of BaseDataset, that will be used to fit the pipeline
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- dataset_name (Optional[str]):
Name of the dataset, if None, random value is used.
- resampling_strategy (ResamplingStrategies):
Strategy to split the training data. Defaults to NoResamplingStrategyTypes.no_resampling.
- resampling_strategy_args (Optional[Dict[str, Any]]):
Arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- total_walltime_limit (int):
Total time that can be used by all the models to be refitted. Defaults to 120.
- run_time_limit_secs (int: default=60):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- memory_limit (Optional[int]):
Memory limit in MB for the machine learning algorithm. autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- eval_metric (Optional[str]):
Name of the metric that is used to evaluate a pipeline.
- all_supported_metrics (bool: default=True):
if True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
- budget (Optional[float]):
Budget to fit a single run of the pipeline. If not provided, uses the default in the pipeline config
- pipeline_options (Optional[Dict]):
Valid config options include “device”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- Returns:
self
- score(y_pred: ndarray, y_test: Union[ndarray, DataFrame]) Dict[str, float] ¶
Calculate the score on the test set. Calculate the evaluation measure on the test set.
- Args:
- y_pred (np.ndarray):
The test predictions
- y_test (np.ndarray):
The test ground truth labels.
- Returns:
- Dict[str, float]:
Value of the evaluation metric calculated on the test set.
- search(optimize_metric: str, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, feat_types: Optional[List[str]] = None, budget_type: str = 'epochs', min_budget: int = 5, max_budget: int = 50, total_walltime_limit: int = 100, func_eval_time_limit_secs: Optional[int] = None, enable_traditional_pipeline: bool = True, memory_limit: int = 4096, smac_scenario_args: Optional[Dict[str, Any]] = None, get_smac_object_callback: Optional[Callable] = None, all_supported_metrics: bool = True, precision: int = 32, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None, load_models: bool = True, portfolio_selection: Optional[str] = None, dataset_compression: Union[Mapping[str, Any], bool] = False) BaseTask [source]¶
Search for the best pipeline configuration for the given dataset.
Fit both optimizes the machine learning models and builds an ensemble out of them. To disable ensembling, set ensemble_size==0. using the optimizer.
- Args:
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- feat_types (Optional[List[str]]):
Description about the feature types of the columns. Accepts numerical for integers, float data and categorical for categories, strings and bool. Defaults to None.
- optimize_metric (str):
Name of the metric that is used to evaluate a pipeline.
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.). budget_type will determine the units of min_budget/max_budget. If budget_type==’epochs’ is used, min_budget will refer to epochs whereas if budget_type==’runtime’ then min_budget will refer to seconds.
- min_budget (int):
Auto-PyTorch uses Hyperband to trade-off resources between running many pipelines at min_budget and running the top performing pipelines on max_budget. min_budget states the minimum resource allocation a pipeline should have so that we can compare and quickly discard bad performing models. For example, if the budget_type is epochs, and min_budget=5, then we will run every pipeline to a minimum of 5 epochs before performance comparison.
- max_budget (int):
Auto-PyTorch uses Hyperband to trade-off resources between running many pipelines at min_budget and running the top performing pipelines on max_budget. max_budget states the maximum resource allocation a pipeline is going to be ran. For example, if the budget_type is epochs, and max_budget=50, then the pipeline training will be terminated after 50 epochs.
- total_walltime_limit (int: default=100):
Time limit in seconds for the search of appropriate models. By increasing this value, autopytorch has a higher chance of finding better models.
- func_eval_time_limit_secs (Optional[int]):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data. When set to None, this time will automatically be set to total_walltime_limit // 2 to allow enough time to fit at least 2 individual machine learning algorithms. Set to np.inf in case no time limit is desired.
- enable_traditional_pipeline (bool: default=True):
We fit traditional machine learning algorithms (LightGBM, CatBoost, RandomForest, ExtraTrees, KNN, SVM) prior building PyTorch Neural Networks. You can disable this feature by turning this flag to False. All machine learning algorithms that are fitted during search() are considered for ensemble building.
- memory_limit (int: default=4096):
Memory limit in MB for the machine learning algorithm. Autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- smac_scenario_args (Optional[Dict]):
Additional arguments inserted into the scenario of SMAC. See the SMAC documentation for a list of available arguments.
- get_smac_object_callback (Optional[Callable]):
Callback function to create an object of class smac.optimizer.smbo.SMBO. The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with SMAC.
- tae_func (Optional[Callable]):
TargetAlgorithm to be optimised. If None, eval_function available in autoPyTorch/evaluation/train_evaluator is used. Must be child class of AbstractEvaluator.
- all_supported_metrics (bool: default=True):
If True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- precision (int: default=32):
Numeric precision used when loading ensemble data. Can be either ‘16’, ‘32’ or ‘64’.
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- load_models (bool: default=True):
Whether to load the models after fitting AutoPyTorch.
- portfolio_selection (Optional[str]):
This argument controls the initial configurations that AutoPyTorch uses to warm start SMAC for hyperparameter optimization. By default, no warm-starting happens. The user can provide a path to a json file containing configurations, similar to (…herepathtogreedy…). Additionally, the keyword ‘greedy’ is supported, which would use the default portfolio from AutoPyTorch Tabular.
- dataset_compression: Union[bool, Mapping[str, Any]] = True
We compress datasets so that they fit into some predefined amount of memory. NOTE
Default configuration when left as
True
: .. code-block:: python- {
“memory_allocation”: 0.1, “methods”: [“precision”]
}
You can also pass your own configuration with the same keys and choosing from the available
"methods"
. The available options are described here: memory_allocationBy default, we attempt to fit the dataset into
0.1 * memory_limit
. This float value can be set with"memory_allocation": 0.1
. We also allow for specifying absolute memory in MB, e.g. 10MB is"memory_allocation": 10
. The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in"methods"
will not be performed.- methods
We currently provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given. *
"precision"
-- We reduce floating point precision as follows:
np.float128 -> np.float64
np.float96 -> np.float64
np.float64 -> np.float32
pandas dataframes are reduced using the downcast option of pd.to_numeric to the lowest possible precision.
subsample
- We subsample data such that it fits directly into the memory allocationmemory_allocation * memory_limit
. Therefore, this should likely be the last method listed in"methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
- Returns:
self
- set_pipeline_options(**pipeline_options_kwargs: Any) None ¶
Check whether arguments are valid and then sets them to the current pipeline configuration.
- Args:
**pipeline_options_kwargs: Valid config options include “num_run”, “device”, “budget_type”, “epochs”, “runtime”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- Returns:
None
- show_models() str ¶
Returns a Markdown containing details about the final ensemble/configuration.
- Returns:
- str:
Markdown table of models.
- sprint_statistics() str ¶
Prints statistics about the SMAC search.
These statistics include:
Optimisation Metric
Best Optimisation score achieved by individual pipelines
Total number of target algorithm runs
Total number of successful target algorithm runs
Total number of crashed target algorithm runs
Total number of target algorithm runs that exceeded the time limit
Total number of successful target algorithm runs that exceeded the memory limit
- Returns:
- (str):
Formatted string with statistics
Time Series Forecasting¶
- class autoPyTorch.api.time_series_forecasting.TimeSeriesForecastingTask(seed: int = 1, n_jobs: int = 1, logging_config: Optional[Dict] = None, ensemble_size: int = 50, ensemble_nbest: int = 50, max_models_on_disc: int = 50, temporary_directory: Optional[str] = None, output_directory: Optional[str] = None, delete_tmp_folder_after_terminate: bool = True, delete_output_folder_after_terminate: bool = True, include_components: Optional[Dict] = None, exclude_components: Optional[Dict] = None, resampling_strategy: Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes] = HoldoutValTypes.time_series_hold_out_validation, resampling_strategy_args: Optional[Dict[str, Any]] = None, backend: Optional[Backend] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None)[source]¶
Time Series Forecasting API to the pipelines.
- Args:
- seed (int):
seed to be used for reproducibility.
- n_jobs (int), (default=1):
number of consecutive processes to spawn.
- logging_config (Optional[Dict]):
specifies configuration for logging, if None, it is loaded from the logging.yaml
- ensemble_size (int), (default=50):
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.
- ensemble_nbest (int), (default=50):
only consider the ensemble_nbest models to build the ensemble
- max_models_on_disc (int), (default=50):
- maximum number of models saved to disc. Also, controls the size of the ensemble as any additional models
will be deleted. Must be greater than or equal to 1.
- temporary_directory (str):
folder to store configuration output and log file
- output_directory (str):
folder to store predictions for optional test set
- delete_tmp_folder_after_terminate (bool):
determines whether to delete the temporary directory, when finished
- include_components (Optional[Dict]):
If None, all possible components are used. Otherwise specifies set of components to use.
- exclude_components (Optional[Dict]):
If None, all possible components are used. Otherwise specifies set of components not to use. Incompatible with include components
- build_pipeline(dataset_properties: Dict[str, Union[int, float, str, List, bool, Tuple]], include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None) TimeSeriesForecastingPipeline [source]¶
Build pipeline according to current task and for the passed dataset properties
- Args:
- dataset_properties (Dict[str, Any]):
Characteristics of the dataset to guide the pipeline choices of components
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- Returns:
TimeSeriesForecastingPipeline:
- fit_pipeline(configuration: Configuration, *, dataset: Optional[BaseDataset] = None, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, resampling_strategy: Optional[Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes]] = None, resampling_strategy_args: Optional[Dict[str, Any]] = None, run_time_limit_secs: int = 60, memory_limit: Optional[int] = None, eval_metric: Optional[str] = None, all_supported_metrics: bool = False, budget_type: Optional[str] = None, include_components: Optional[Dict[str, Any]] = None, exclude_components: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None, budget: Optional[float] = None, pipeline_options: Optional[Dict] = None, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None) Tuple[Optional[BasePipeline], RunInfo, RunValue, BaseDataset] ¶
Fit a pipeline on the given task for the budget. A pipeline configuration can be specified if None, uses default
Fit uses the estimator pipeline_options attribute, which the user can interact via the get_pipeline_options()/set_pipeline_options() methods.
- Args:
- configuration (Configuration):
configuration to fit the pipeline with.
- dataset (BaseDataset):
An object of the appropriate child class of BaseDataset, that will be used to fit the pipeline
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- dataset_name (Optional[str]):
Name of the dataset, if None, random value is used.
- resampling_strategy (Optional[RESAMPLING_STRATEGIES]):
Strategy to split the training data. if None, uses HoldoutValTypes.holdout_validation.
- resampling_strategy_args (Optional[Dict[str, Any]]):
Arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- run_time_limit_secs (int: default=60):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- memory_limit (Optional[int]):
Memory limit in MB for the machine learning algorithm. autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- eval_metric (Optional[str]):
Name of the metric that is used to evaluate a pipeline.
- all_supported_metrics (bool: default=True):
if True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
- include_components (Optional[Dict[str, Any]]):
Dictionary containing components to include. Key is the node name and Value is an Iterable of the names of the components to include. Only these components will be present in the search space.
- exclude_components (Optional[Dict[str, Any]]):
Dictionary containing components to exclude. Key is the node name and Value is an Iterable of the names of the components to exclude. All except these components will be present in the search space.
- search_space_updates(Optional[HyperparameterSearchSpaceUpdates]):
Updates to be made to the hyperparameter search space of the pipeline
- budget (Optional[float]):
Budget to fit a single run of the pipeline. If not provided, uses the default in the pipeline config
- pipeline_options (Optional[Dict]):
Valid config options include “device”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- Returns:
- (BasePipeline):
fitted pipeline
- (RunInfo):
Run information
- (RunValue):
Result of fitting the pipeline
- (BaseDataset):
Dataset created from the given tensors
- get_dataset(X_train: Union[List, DataFrame, ndarray], y_train: Union[List, DataFrame, ndarray], X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, resampling_strategy: Optional[Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes]] = None, resampling_strategy_args: Optional[Dict[str, Any]] = None, dataset_name: Optional[str] = None, dataset_compression: Optional[Dict[str, Union[int, float, List[str]]]] = None, **kwargs: Any) BaseDataset ¶
Returns an object of a child class of BaseDataset according to the current task.
- Args:
- X_train (Union[List, pd.DataFrame, np.ndarray]):
Training feature set.
- y_train (Union[List, pd.DataFrame, np.ndarray]):
Training target set.
- X_test (Optional[Union[List, pd.DataFrame, np.ndarray]]):
Testing feature set
- y_test (Optional[Union[List, pd.DataFrame, np.ndarray]]):
Testing target set
- resampling_strategy (Optional[RESAMPLING_STRATEGIES]):
Strategy to split the training data. if None, uses HoldoutValTypes.holdout_validation.
- resampling_strategy_args (Optional[Dict[str, Any]]):
arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- dataset_compression (Optional[DatasetCompressionSpec]):
We compress datasets so that they fit into some predefined amount of memory. NOTE
You can also pass your own configuration with the same keys and choosing from the available
"methods"
. The available options are described here: memory_allocationAbsolute memory in MB, e.g. 10MB is
"memory_allocation": 10
. The memory used by the dataset is checked after each reduction method is performed. If the dataset fits into the allocated memory, any further methods listed in"methods"
will not be performed. It can be either float or int.- methods
We currently provide the following methods for reducing the dataset size. These can be provided in a list and are performed in the order as given. *
"precision"
-- We reduce floating point precision as follows:
np.float128 -> np.float64
np.float96 -> np.float64
np.float64 -> np.float32
pandas dataframes are reduced using the downcast option of pd.to_numeric to the lowest possible precision.
subsample
- We subsample data such that it fits directly into the memory allocationmemory_allocation * memory_limit
. Therefore, this should likely be the last method listed in"methods"
. Subsampling takes into account classification labels and stratifies accordingly. We guarantee that at least one occurrence of each label is included in the sampled set.
- kwargs (Any):
can be used to pass task specific dataset arguments. Currently supports passing feat_types for tabular tasks which specifies whether a feature is ‘numerical’ or ‘categorical’.
- Returns:
- BaseDataset:
the dataset object
- get_incumbent_results(include_traditional: bool = False) Tuple[Configuration, Dict[str, Union[int, str, float]]] ¶
Get Incumbent config and the corresponding results
- Args:
- include_traditional (bool):
Whether to include results from tradtional pipelines
- Returns:
- Configuration (CS.ConfigurationSpace):
The incumbent configuration
- Dict[str, Union[int, str, float]]:
Additional information about the run of the incumbent configuration.
- get_pipeline_options() dict ¶
Returns the current pipeline configuration.
- get_search_results() SearchResults ¶
Get the interface to obtain the search results easily.
- get_search_space(dataset: Optional[BaseDataset] = None) ConfigurationSpace ¶
Returns the current search space as ConfigurationSpace object.
- plot_perf_over_time(metric_name: str, ax: Optional[Axes] = None, plot_setting_params: PlotSettingParams = PlotSettingParams(n_points=20, xscale='linear', yscale='linear', xlabel=None, ylabel=None, title=None, title_kwargs={}, xlim=None, ylim=None, grid=True, legend=True, legend_kwargs={}, show=False, figname=None, figsize=None, savefig_kwargs={}), color_label_settings: ColorLabelSettings = ColorLabelSettings(single_train=('red', None), single_opt=('blue', None), single_test=('green', None), ensemble_train=('brown', None), ensemble_test=('purple', None)), *args: Any, **kwargs: Any) None ¶
Visualize the performance over time using matplotlib. The plot related arguments are based on matplotlib. Please refer to the matplotlib documentation for more details.
- Args:
- metric_name (str):
The name of metric to visualize. The names are available in
autoPyTorch.metrics.CLASSIFICATION_METRICS
autoPyTorch.metrics.REGRESSION_METRICS
- ax (Optional[plt.Axes]):
axis to plot (subplots of matplotlib). If None, it will be created automatically.
- plot_setting_params (PlotSettingParams):
Parameters for the plot.
- color_label_settings (ColorLabelSettings):
The settings of a pair of color and label for each plot.
- args, kwargs (Any):
Arguments for the ax.plot.
- Note:
You might need to run export DISPLAY=:0.0 if you are using non-GUI based environment.
- predict(X_test: Optional[List[Union[ndarray, DataFrame, TimeSeriesSequence]]] = None, batch_size: Optional[int] = None, n_jobs: int = 1, past_targets: Optional[List[ndarray]] = None, future_targets: Optional[List[Union[ndarray, DataFrame, TimeSeriesSequence]]] = None, start_times: List[DatetimeIndex] = []) ndarray [source]¶
Predict the future varaibles
- Args:
- X_test (List[Union[np.ndarray, pd.DataFrame, TimeSeriesSequence]])
if it is a list of TimeSeriesSequence, then it is the series to be forecasted. Otherwise, it is the known future features
- batch_size: Optional[int]
batch size
- n_jobs (int):
number of jobs
- past_targets (Optional[List[np.ndarray]])
past observed targets, required when X_test is not a list of TimeSeriesSequence
- future_targets (Optional[List[Union[np.ndarray, pd.DataFrame, TimeSeriesSequence]]]):
future targets (test sets)
- start_times (List[pd.DatetimeIndex]):
starting time of each series when they are sampled. If it is not given, we simply start with a fixed timestamp.
- Return:
- np.ndarray
predicted value, it needs to be with shape (B, H, N), B is the number of series, H is forecasting horizon (n_prediction_steps), N is the number of targets
- refit(dataset: Optional[BaseDataset] = None, X_train: Optional[Union[List, DataFrame, ndarray]] = None, y_train: Optional[Union[List, DataFrame, ndarray]] = None, X_test: Optional[Union[List, DataFrame, ndarray]] = None, y_test: Optional[Union[List, DataFrame, ndarray]] = None, dataset_name: Optional[str] = None, resampling_strategy: Union[CrossValTypes, HoldoutValTypes, NoResamplingStrategyTypes] = NoResamplingStrategyTypes.no_resampling, resampling_strategy_args: Optional[Dict[str, Any]] = None, total_walltime_limit: int = 120, run_time_limit_secs: int = 60, memory_limit: Optional[int] = None, eval_metric: Optional[str] = None, all_supported_metrics: bool = False, budget_type: Optional[str] = None, budget: Optional[float] = None, pipeline_options: Optional[Dict] = None, disable_file_output: Optional[List[Union[str, DisableFileOutputParameters]]] = None) BaseTask ¶
Fit all the models found in the ensemble on the whole training set X_train. Therefore, we recommend using NoResamplingStrategy to be able to do that. Nevertheless, it is still able to fit using other splitting techniques such as hold out or cross validation.
Refit uses the estimator pipeline_options attribute, which the user can interact via the get_pipeline_options()/set_pipeline_options() methods.
- Args:
- dataset (BaseDataset):
An object of the appropriate child class of BaseDataset, that will be used to fit the pipeline
- X_train, y_train, X_test, y_test: Union[np.ndarray, List, pd.DataFrame]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- dataset_name (Optional[str]):
Name of the dataset, if None, random value is used.
- resampling_strategy (ResamplingStrategies):
Strategy to split the training data. Defaults to NoResamplingStrategyTypes.no_resampling.
- resampling_strategy_args (Optional[Dict[str, Any]]):
Arguments required for the chosen resampling strategy. If None, uses the default values provided in DEFAULT_RESAMPLING_PARAMETERS in
`datasets/resampling_strategy.py`
.- dataset_name (Optional[str]):
name of the dataset, used as experiment name.
- total_walltime_limit (int):
Total time that can be used by all the models to be refitted. Defaults to 120.
- run_time_limit_secs (int: default=60):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- memory_limit (Optional[int]):
Memory limit in MB for the machine learning algorithm. autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- eval_metric (Optional[str]):
Name of the metric that is used to evaluate a pipeline.
- all_supported_metrics (bool: default=True):
if True, all metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.).
- budget (Optional[float]):
Budget to fit a single run of the pipeline. If not provided, uses the default in the pipeline config
- pipeline_options (Optional[Dict]):
Valid config options include “device”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- Returns:
self
- score(y_pred: ndarray, y_test: Union[ndarray, DataFrame]) Dict[str, float] ¶
Calculate the score on the test set. Calculate the evaluation measure on the test set.
- Args:
- y_pred (np.ndarray):
The test predictions
- y_test (np.ndarray):
The test ground truth labels.
- Returns:
- Dict[str, float]:
Value of the evaluation metric calculated on the test set.
- search(optimize_metric: str, X_train: Optional[Union[List, DataFrame]] = None, y_train: Optional[Union[List, DataFrame]] = None, X_test: Optional[Union[List, DataFrame]] = None, y_test: Optional[Union[List, DataFrame]] = None, n_prediction_steps: int = 1, freq: Optional[Union[str, int, List[int]]] = None, start_times: Optional[List[DatetimeIndex]] = None, series_idx: Optional[Union[List[Union[str, int]], str, int]] = None, dataset_name: Optional[str] = None, budget_type: str = 'epochs', min_budget: Union[int, float] = 5, max_budget: Union[int, float] = 50, total_walltime_limit: int = 100, func_eval_time_limit_secs: Optional[int] = None, enable_traditional_pipeline: bool = False, memory_limit: Optional[int] = 4096, smac_scenario_args: Optional[Dict[str, Any]] = None, get_smac_object_callback: Optional[Callable] = None, all_supported_metrics: bool = True, precision: int = 32, disable_file_output: List = [], load_models: bool = True, portfolio_selection: Optional[str] = None, suggested_init_models: Optional[List[str]] = None, custom_init_setting_path: Optional[str] = None, min_num_test_instances: Optional[int] = None, dataset_compression: Union[Mapping[str, Any], bool] = False, **forecasting_dataset_kwargs: Any) BaseTask [source]¶
Search for the best pipeline configuration for the given dataset.
Fit both optimizes the machine learning models and builds an ensemble out of them. To disable ensembling, set ensemble_size==0. using the optimizer.
- Args:
- optimize_metric (str):
name of the metric that is used to evaluate a pipeline.
- X_train: Optional[Union[List, pd.DataFrame]]
A pair of features (X_train) and targets (y_train) used to fit a pipeline. Additionally, a holdout of this pairs (X_test, y_test) can be provided to track the generalization performance of each stage.
- y_train: Union[List, pd.DataFrame]
training target, must be given
- X_test: Optional[Union[List, pd.DataFrame]]
Test Features, Test series need to end at one step before forecasting
- y_test: Optional[Union[List, pd.DataFrame]]
Test Targets
- n_prediction_steps: int
How many steps in advance we need to predict
- freq: Optional[Union[str, int, List[int]]]
frequency information, it determines the configuration space of the window size, if it is not given, we will use the default configuration
- start_times:List[pd.DatetimeIndex]
A list indicating the start time of each series in the training sets
- series_idx: Optional[Union[List[Union[str, int]], str, int]]
variable in X indicating series indices
- dataset_name: Optional[str],
dataset name
- budget_type (str):
Type of budget to be used when fitting the pipeline. It can be one of:
- epochs: The training of each pipeline will be terminated after
a number of epochs have passed. This number of epochs is determined by the budget argument of this method.
- runtime: The training of each pipeline will be terminated after
a number of seconds have passed. This number of seconds is determined by the budget argument of this method. The overall fitting time of a pipeline is controlled by func_eval_time_limit_secs. ‘runtime’ only controls the allocated time to train a pipeline, but it does not consider the overall time it takes to create a pipeline (data loading and preprocessing, other i/o operations, etc.). budget_type will determine the units of min_budget/max_budget. If budget_type==’epochs’ is used, min_budget will refer to epochs whereas if budget_type==’runtime’ then min_budget will refer to seconds.
‘resolution’: The sample resolution of time series, for instance, if a time series sequence is
[0, 1, 2, 3, 4] with resolution 0.5, the sequence fed to the network is [0, 2, 4]
- min_budget Union[int, float]:
Auto-PyTorch uses Hyperband to trade-off resources between running many pipelines at min_budget and running the top performing pipelines on max_budget. min_budget states the minimum resource allocation a pipeline should have so that we can compare and quickly discard bad performing models. For example, if the budget_type is epochs, and min_budget=5, then we will run every pipeline to a minimum of 5 epochs before performance comparison.
- max_budget Union[int, float]:
Auto-PyTorch uses Hyperband to trade-off resources between running many pipelines at min_budget and running the top performing pipelines on max_budget. max_budget states the maximum resource allocation a pipeline is going to be ran. For example, if the budget_type is epochs, and max_budget=50, then the pipeline training will be terminated after 50 epochs.
- total_walltime_limit (int), (default=100): Time limit
in seconds for the search of appropriate models. By increasing this value, autopytorch has a higher chance of finding better models.
- func_eval_time_limit (int), (default=60): Time limit
for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
- traditional_per_total_budget (float), (default=0.1):
Percent of total walltime to be allocated for running traditional classifiers.
- memory_limit (Optional[int]), (default=4096): Memory
limit in MB for the machine learning algorithm. autopytorch will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
- smac_scenario_args (Optional[Dict]): Additional arguments inserted
into the scenario of SMAC. See the [SMAC documentation] (https://automl.github.io/SMAC3/master/options.html?highlight=scenario#scenario)
- get_smac_object_callback (Optional[Callable]): Callback function
to create an object of class [smac.optimizer.smbo.SMBO](https://automl.github.io/SMAC3/master/apidoc/smac.optimizer.smbo.html). The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with [SMAC](https://automl.github.io/SMAC3/master/index.html).
- all_supported_metrics (bool), (default=True): if True, all
metrics supporting current task will be calculated for each pipeline and results will be available via cv_results
- precision (int), (default=32): Numeric precision used when loading
ensemble data. Can be either ‘16’, ‘32’ or ‘64’.
- disable_file_output (Optional[List[Union[str, DisableFileOutputParameters]]]):
Used as a list to pass more fine-grained information on what to save. Must be a member of DisableFileOutputParameters. Allowed elements in the list are:
- y_optimization:
do not save the predictions for the optimization set, which would later on be used to build an ensemble. Note that SMAC optimizes a metric evaluated on the optimization set.
- pipeline:
do not save any individual pipeline files
- pipelines:
In case of cross validation, disables saving the joint model of the pipelines fit on each fold.
- y_test:
do not save the predictions for the test set.
- all:
do not save any of the above.
For more information check autoPyTorch.evaluation.utils.DisableFileOutputParameters.
- load_models (bool), (default=True): Whether to load the
models after fitting AutoPyTorch.
- suggested_init_models: Optional[List[str]]
suggested initial models with their default configurations setting
- custom_init_setting_path: Optional[str]
path to a json file that contains the initial configuration suggested by the users
- min_num_test_instances: Optional[int]
if it is set None, then full validation sets will be evaluated in each fidelity. Otherwise, the number of instances in the test sets should be a value that is at least as great as this value, otherwise, the number of test instance is proportional to its fidelity
- forecasting_dataset_kwargs: Dict[Any]
Forecasting dataset kwargs used to initialize forecasting dataset
- Returns:
self
- set_pipeline_options(**pipeline_options_kwargs: Any) None ¶
Check whether arguments are valid and then sets them to the current pipeline configuration.
- Args:
**pipeline_options_kwargs: Valid config options include “num_run”, “device”, “budget_type”, “epochs”, “runtime”, “torch_num_threads”, “early_stopping”, “use_tensorboard_logger”, “metrics_during_training”
- Returns:
None
- show_models() str ¶
Returns a Markdown containing details about the final ensemble/configuration.
- Returns:
- str:
Markdown table of models.
- sprint_statistics() str ¶
Prints statistics about the SMAC search.
These statistics include:
Optimisation Metric
Best Optimisation score achieved by individual pipelines
Total number of target algorithm runs
Total number of successful target algorithm runs
Total number of crashed target algorithm runs
Total number of target algorithm runs that exceeded the time limit
Total number of successful target algorithm runs that exceeded the memory limit
- Returns:
- (str):
Formatted string with statistics
- update_sliding_window_size(n_prediction_steps: int) None [source]¶
the size of the sliding window is heavily dependent on the dataset, so we only update them when we get the information from the
- Args:
- n_prediction_steps (int):
forecast horizon. Sometimes we could also make our base sliding window size based on the forecast horizon
Pipelines¶
Tabular Classification¶
- class autoPyTorch.pipeline.tabular_classification.TabularClassificationPipeline(config: Optional[Configuration] = None, steps: Optional[List[Tuple[str, Union[autoPyTorchComponent, autoPyTorchChoice]]]] = None, dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None, include: Optional[Dict[str, Any]] = None, exclude: Optional[Dict[str, Any]] = None, random_state: Optional[RandomState] = None, init_params: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None)[source]¶
This class is a wrapper around Sklearn Pipeline to integrate autoPyTorch components and choices for tabular classification tasks.
It implements a pipeline, which includes the following as steps:
imputer
encoder
scaler
feature_preprocessor
tabular_transformer
preprocessing
network_embedding
network_backbone
network_head
network
network_init
optimizer
lr_scheduler
data_loader
trainer
Contrary to the sklearn API it is not possible to enumerate the possible parameters in the __init__ function because we only know the available classifiers at runtime. For this reason the user must specifiy the parameters by passing an instance of ConfigSpace.configuration_space.Configuration.
- Args:
- config (Configuration)
The configuration to evaluate.
- steps (Optional[List[Tuple[str, autoPyTorchChoice]]]):
The list of autoPyTorchComponent or autoPyTorchChoice that build the pipeline. If provided, they won’t be dynamically produced.
- include (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to honor during the creation of the configuration space.
- exclude (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to avoid during the creation of the configuration space.
- random_state (np.random.RandomState):
Allows to produce reproducible results by setting a seed for randomized settings
- init_params (Optional[Dict[str, Any]]):
Optional initial settings for the config
- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- Attributes:
- steps (List[Tuple[str, PipelineStepType]]):
The steps of the current pipeline. Each step in an AutoPyTorch pipeline is either a autoPyTorchChoice or autoPyTorchComponent. Both of these are child classes of sklearn ‘BaseEstimator’ and they perform operations on and transform the fit dictionary. For more info, check documentation of ‘autoPyTorchChoice’ or ‘autoPyTorchComponent’.
- config (Configuration):
A configuration to delimit the current component choice
- random_state (Optional[np.random.RandomState]):
Allows to produce reproducible results by setting a seed for randomized settings
- get_pipeline_representation() Dict[str, str] [source]¶
Returns a representation of the pipeline, so that it can be consumed and formatted by the API.
It should be a representation that follows: [{‘PreProcessing’: <>, ‘Estimator’: <>}]
- Returns:
Dict: contains the pipeline representation in a short format
- predict_proba(X: ndarray, batch_size: Optional[int] = None) ndarray [source]¶
predict_proba.
- Args:
- X (np.ndarray):
Input to the pipeline, from which to guess targets
- batch_size (Optional[int]):
Controls whether the pipeline will be called on small chunks of the data. Useful when calling the predict method on the whole array X results in a MemoryError.
- Returns:
- np.ndarray:
Probabilities of the target being certain class
- score(X: ndarray, y: ndarray, batch_size: Optional[int] = None, metric_name: str = 'accuracy') float [source]¶
Scores the fitted estimator on (X, y)
- Args:
- X (np.ndarray):
input to the pipeline, from which to guess targets
- batch_size (Optional[int]):
batch_size controls whether the pipeline will be called on small chunks of the data. Useful when calling the predict method on the whole array X results in a MemoryError.
- y (np.ndarray):
Ground Truth labels
- metric_name (str: default = ‘accuracy’):
name of the metric to be calculated
- Returns:
float: score based on the metric name
- class autoPyTorch.pipeline.traditional_tabular_classification.TraditionalTabularClassificationPipeline(config: Optional[Configuration] = None, steps: Optional[List[Tuple[str, Union[autoPyTorchComponent, autoPyTorchChoice]]]] = None, dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None, include: Optional[Dict[str, Any]] = None, exclude: Optional[Dict[str, Any]] = None, random_state: Optional[RandomState] = None, init_params: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None)[source]¶
A pipeline to fit traditional ML methods for tabular classification.
- Args:
- config (Configuration)
The configuration to evaluate.
- steps (Optional[List[Tuple[str, Union[autoPyTorchComponent, autoPyTorchChoice]]]]):
the list of autoPyTorchComponent or autoPyTorchChoice that build the pipeline. If provided, they won’t be dynamically produced.
- include (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to honor during the creation of the configuration space.
- exclude (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to avoid during the creation of the configuration space.
- random_state (np.random.RandomState):
Allows to produce reproducible results by setting a seed for randomized settings
- init_params (Optional[Dict[str, Any]]):
Optional initial settings for the config
- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- Attributes:
- steps (List[Tuple[str, PipelineStepType]]):
The steps of the current pipeline. Each step in an AutoPyTorch pipeline is either a autoPyTorchChoice or autoPyTorchComponent. Both of these are child classes of sklearn ‘BaseEstimator’ and they perform operations on and transform the fit dictionary. For more info, check documentation of ‘autoPyTorchChoice’ or ‘autoPyTorchComponent’.
- config (Configuration):
A configuration to delimit the current component choice
- random_state (Optional[np.random.RandomState]):
Allows to produce reproducible results by setting a seed for randomized settings
- get_pipeline_representation() Dict[str, str] [source]¶
Returns a representation of the pipeline, so that it can be consumed and formatted by the API.
It should be a representation that follows: [{‘PreProcessing’: <>, ‘Estimator’: <>}]
- Returns:
- Dict:
Contains the pipeline representation in a short format
- predict(X: ndarray, batch_size: Optional[int] = None) ndarray [source]¶
Predict the output using the selected model.
- Args:
- X (np.ndarray):
Input data to the array
- batch_size (Optional[int]):
Controls whether the pipeline will be called on small chunks of the data. Useful when calling the predict method on the whole array X results in a MemoryError.
- Returns:
np.ndarray: the predicted values given input X
- predict_proba(X: ndarray, batch_size: Optional[int] = None) ndarray [source]¶
predict_proba.
- Args:
- X (np.ndarray):
Input to the pipeline, from which to guess targets
- batch_size (Optional[int]):
Controls whether the pipeline will be called on small chunks of the data. Useful when calling the predict method on the whole array X results in a MemoryError.
- Returns:
- np.ndarray:
Probabilities of the target being certain class
Tabular Regression¶
- class autoPyTorch.pipeline.tabular_regression.TabularRegressionPipeline(config: Optional[Configuration] = None, steps: Optional[List[Tuple[str, Union[autoPyTorchComponent, autoPyTorchChoice]]]] = None, dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None, include: Optional[Dict[str, Any]] = None, exclude: Optional[Dict[str, Any]] = None, random_state: Optional[RandomState] = None, init_params: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None)[source]¶
This class is a wrapper around Sklearn Pipeline to integrate autoPyTorch components and choices for tabular classification tasks.
It implements a pipeline, which includes the following as steps:
imputer
encoder
scaler
feature_preprocessor
tabular_transformer
preprocessing
network_embedding
network_backbone
network_head
network
network_init
optimizer
lr_scheduler
data_loader
trainer
Contrary to the sklearn API it is not possible to enumerate the possible parameters in the __init__ function because we only know the available regressors at runtime. For this reason the user must specifiy the parameters by passing an instance of ConfigSpace.configuration_space.Configuration.
- Args:
- config (Configuration)
The configuration to evaluate.
- steps (Optional[List[Tuple[str, autoPyTorchChoice]]]):
the list of autoPyTorchComponent or autoPyTorchChoice that build the pipeline. If provided, they won’t be dynamically produced.
- include (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to honor during the creation of the configuration space.
- exclude (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to avoid during the creation of the configuration space.
- random_state (np.random.RandomState):
Allows to produce reproducible results by setting a seed for randomized settings
- init_params (Optional[Dict[str, Any]]):
Optional initial settings for the config
- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- Attributes:
- steps (List[Tuple[str, PipelineStepType]]):
The steps of the current pipeline. Each step in an AutoPyTorch pipeline is either a autoPyTorchChoice or autoPyTorchComponent. Both of these are child classes of sklearn ‘BaseEstimator’ and they perform operations on and transform the fit dictionary. For more info, check documentation of ‘autoPyTorchChoice’ or ‘autoPyTorchComponent’.
- config (Configuration):
A configuration to delimit the current component choice
- random_state (Optional[np.random.RandomState]):
Allows to produce reproducible results by setting a seed for randomized settings
- get_pipeline_representation() Dict[str, str] [source]¶
Returns a representation of the pipeline, so that it can be consumed and formatted by the API.
It should be a representation that follows: [{‘PreProcessing’: <>, ‘Estimator’: <>}]
- Returns:
Dict: contains the pipeline representation in a short format
- score(X: ndarray, y: ndarray, batch_size: Optional[int] = None, metric_name: str = 'r2') float [source]¶
Scores the fitted estimator on (X, y)
- Args:
- X (np.ndarray):
input to the pipeline, from which to guess targets
- batch_size (Optional[int]):
batch_size controls whether the pipeline will be called on small chunks of the data. Useful when calling the predict method on the whole array X results in a MemoryError.
- y (np.ndarray):
Ground Truth labels
- metric_name (str, default = ‘r2’):
name of the metric to be calculated
- Returns:
float: score based on the metric name
- class autoPyTorch.pipeline.traditional_tabular_regression.TraditionalTabularRegressionPipeline(config: Optional[Configuration] = None, steps: Optional[List[Tuple[str, Union[autoPyTorchComponent, autoPyTorchChoice]]]] = None, dataset_properties: Optional[Dict[str, Any]] = None, include: Optional[Dict[str, Any]] = None, exclude: Optional[Dict[str, Any]] = None, random_state: Optional[RandomState] = None, init_params: Optional[Dict[str, Any]] = None)[source]¶
A pipeline to fit traditional ML methods for tabular regression.
- Args:
- config (Configuration)
The configuration to evaluate.
- steps (Optional[List[Tuple[str, autoPyTorchChoice]]]):
the list of autoPyTorchComponent or autoPyTorchChoice that build the pipeline. If provided, they won’t be dynamically produced.
- include (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to honor during the creation of the configuration space.
- exclude (Optional[Dict[str, Any]]):
Allows the caller to specify which configurations to avoid during the creation of the configuration space.
- random_state (np.random.RandomState):
Allows to produce reproducible results by setting a seed for randomized settings
- init_params (Optional[Dict[str, Any]]):
Optional initial settings for the config
- search_space_updates (Optional[HyperparameterSearchSpaceUpdates]):
Search space updates that can be used to modify the search space of particular components or choice modules of the pipeline
- Attributes:
- steps (List[Tuple[str, PipelineStepType]]):
The steps of the current pipeline. Each step in an AutoPyTorch pipeline is either a autoPyTorchChoice or autoPyTorchComponent. Both of these are child classes of sklearn ‘BaseEstimator’ and they perform operations on and transform the fit dictionary. For more info, check documentation of ‘autoPyTorchChoice’ or ‘autoPyTorchComponent’.
- config (Configuration):
A configuration to delimit the current component choice
- random_state (Optional[np.random.RandomState]):
Allows to produce reproducible results by setting a seed for randomized settings
- get_pipeline_representation() Dict[str, str] [source]¶
Returns a representation of the pipeline, so that it can be consumed and formatted by the API.
It should be a representation that follows: [{‘PreProcessing’: <>, ‘Estimator’: <>}]
- Returns:
- Dict[str, str]:
Contains the pipeline representation in a short format
- predict(X: ndarray, batch_size: Optional[int] = None) ndarray [source]¶
Predict the output using the selected model.
- Args:
- X (np.ndarray):
Input data to the array
- batch_size (Optional[int]):
Controls whether the pipeline will be called on small chunks of the data. Useful when calling the predict method on the whole array X results in a MemoryError.
- Returns:
np.ndarray: the predicted values given input X
Time Series Forecasting¶
- class autoPyTorch.pipeline.time_series_forecasting.TimeSeriesForecastingPipeline(config: Optional[Configuration] = None, steps: Optional[List[Tuple[str, Union[autoPyTorchComponent, autoPyTorchChoice]]]] = None, dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None, include: Optional[Dict[str, Any]] = None, exclude: Optional[Dict[str, Any]] = None, random_state: Optional[RandomState] = None, init_params: Optional[Dict[str, Any]] = None, search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None)[source]¶
This class is a proof of concept to integrate AutoPyTorch Components
It implements a pipeline, which includes as steps:
->One preprocessing step ->One neural network
Contrary to the sklearn API it is not possible to enumerate the possible parameters in the __init__ function because we only know the available regressors at runtime. For this reason the user must specifiy the parameters by passing an instance of ConfigSpace.configuration_space.Configuration.
- Args:
- config (Configuration):
The configuration to evaluate.
- random_state (Optional[RandomState):
random_state is the random number generator
Attributes:
- get_pipeline_representation() Dict[str, str] [source]¶
Returns a representation of the pipeline, so that it can be consumed and formatted by the API.
It should be a representation that follows: [{‘PreProcessing’: <>, ‘Estimator’: <>}]
- Returns:
Dict: contains the pipeline representation in a short format
- predict(X: List[Union[ndarray, DataFrame, TimeSeriesSequence]], batch_size: Optional[int] = None) ndarray [source]¶
Predict the output using the selected model.
- Args:
- X (List[Union[np.ndarray, pd.DataFrame, TimeSeriesSequence]]):
input data to predict
- batch_size (Optional[int]):
batch_size controls whether the pipeline will be called on small chunks of the data. Useful when calling the predict method on the whole array X results in a MemoryError.
- Returns:
- np.ndarray:
the predicted values given input X
- score(X: List[Union[ndarray, DataFrame, TimeSeriesSequence]], y: ndarray, batch_size: Optional[int] = None, **score_kwargs: Any) float [source]¶
Scores the fitted estimator on (X, y)
- Args:
- X (List[Union[np.ndarray, pd.DataFrame, TimeSeriesSequence]]):
input to the pipeline, from which to guess targets
- batch_size (Optional[int]):
- batch_size controls whether the pipeline will be called on small chunks of the data.
Useful when calling the predict method on the whole array X results in a MemoryError.
- Returns:
- np.ndarray:
coefficient of determination R^2 of the prediction
Steps in Pipeline¶
autoPyTorchComponent¶
- class autoPyTorch.pipeline.components.base_component.autoPyTorchComponent(random_state: Optional[RandomState] = None)[source]¶
Provides an abstract interface which can be used to create steps of a pipeline in AutoPyTorch.
- Args:
- random_state (Optional[np.random.RandomState]):
Allows to produce reproducible results by setting a seed for randomized settings
- check_requirements(X: Dict[str, Any], y: Optional[Any] = None) None [source]¶
A mechanism in code to ensure the correctness of the fit dictionary It recursively makes sure that the children and parent level requirements are honored before fit.
- Args:
- X (Dict[str, Any]):
Dictionary with fitted parameters. It is a message passing mechanism, in which during a transform, a components adds relevant information so that further stages can be properly fitted
- fit(X: Dict[str, Any], y: Optional[Any] = None) autoPyTorchComponent [source]¶
The fit function calls the fit function of the underlying model and returns self.
- Args:
- X (Dict[str, Any]):
Dictionary with fitted parameters. It is a message passing mechanism, in which during a transform, a components adds relevant information so that further stages can be properly fitted
- y (Any):
Not Used – to comply with API
- Returns:
- self:
returns an instance of self.
- Notes:
Please see the scikit-learn API documentation for further information.
- get_fit_requirements() Optional[List[FitRequirement]] [source]¶
Function to get the required keys by the component that need to be in the fit dictionary
- Returns:
- List[FitRequirement]:
a list containing required keys in a named tuple (name: str, type: object)
- static get_hyperparameter_search_space(dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None) ConfigurationSpace [source]¶
Return the configuration space of this classification algorithm.
- Args:
- dataset_properties (Optional[Dict[str, Union[str, int]]):
Describes the dataset to work on
- Returns:
- ConfigurationSpace:
The configuration space of this algorithm.
- static get_properties(dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None) Dict[str, Union[str, bool]] [source]¶
Get the properties of the underlying algorithm.
- Args:
- dataset_properties (Optional[Dict[str, Union[str, int]]):
Describes the dataset to work on
- Returns:
- Dict[str, Any]:
Properties of the algorithm
- classmethod get_required_properties() Optional[List[str]] [source]¶
Function to get the properties in the component that are required for the properly fitting the pipeline. Usually defined in the base class of the component
- Returns:
- List[str]:
list of properties autopytorch component must have for proper functioning of the pipeline
- set_hyperparameters(configuration: Configuration, init_params: Optional[Dict[str, Any]] = None) BaseEstimator [source]¶
Applies a configuration to the given component. This method translate a hierarchical configuration key, to an actual parameter of the autoPyTorch component.
- Args:
- configuration (Configuration):
Which configuration to apply to the chosen component
- init_params (Optional[Dict[str, any]]):
Optional arguments to initialize the chosen component
- Returns:
An instance of self
autoPyTorchChoice¶
- class autoPyTorch.pipeline.components.base_choice.autoPyTorchChoice(dataset_properties: Dict[str, Union[int, float, str, List, bool, Tuple]], random_state: Optional[RandomState] = None)[source]¶
Allows for the dynamically generation of components as pipeline steps.
- Args:
- dataset_properties (Dict[str, Union[str, BaseDatasetPropertiesType]]):
Describes the dataset to work on
- random_state (Optional[np.random.RandomState]):
Allows to produce reproducible results by setting a seed for randomized settings
- Attributes:
- random_state (Optional[np.random.RandomState]):
Allows to produce reproducible results by setting a seed for randomized settings
- choice (autoPyTorchComponent):
the choice of components for this stage
- check_requirements(X: Dict[str, Any], y: Optional[Any] = None) None [source]¶
A mechanism in code to ensure the correctness of the fit dictionary It recursively makes sure that the children and parent level requirements are honored before fit.
- Args:
- X (Dict[str, Any]):
Dictionary with fitted parameters. It is a message passing mechanism, in which during a transform, a components adds relevant information so that further stages can be properly fitted
- fit(X: Dict[str, Any], y: Any) autoPyTorchComponent [source]¶
Handy method to check if a component is fitted
- Args:
- X (X: Dict[str, Any]):
Dependencies needed by current component to perform fit
- y (Any):
not used. To comply with sklearn API
- get_available_components(dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None, include: Optional[List[str]] = None, exclude: Optional[List[str]] = None) Dict[str, autoPyTorchComponent] [source]¶
Wrapper over get components to incorporate include/exclude user specification
- Args:
- dataset_properties (Optional[Dict[str, BaseDatasetPropertiesType]]):
Describes the dataset to work on
- include: Optional[Dict[str, Any]]:
what components to include. It is an exhaustive list, and will exclusively use this components.
- exclude: Optional[Dict[str, Any]]:
which components to skip. Can’t be used together with include
- Results:
- Dict[str, autoPyTorchComponent]: A dictionary with valid components for this
choice object
- get_components() Dict[str, autoPyTorchComponent] [source]¶
Returns and ordered dict with the components available for current step.
- Args:
- cls (autoPyTorchChoice):
The choice object from which to query the valid components
- Returns:
- Dict[str, autoPyTorchComponent]:
The available components via a mapping from the module name to the component class
- get_hyperparameter_search_space(dataset_properties: Optional[Dict[str, Union[int, float, str, List, bool, Tuple]]] = None, default: Optional[str] = None, include: Optional[List[str]] = None, exclude: Optional[List[str]] = None) ConfigurationSpace [source]¶
Returns the configuration space of the current chosen components
- Args:
- dataset_properties (Optional[Dict[str, BaseDatasetPropertiesType]]):
Describes the dataset to work on
- default: (Optional[str]):
Default component to use in hyperparameters
- include: Optional[Dict[str, Any]]:
what components to include. It is an exhaustive list, and will exclusively use this components.
- exclude: Optional[Dict[str, Any]]:
which components to skip
- Returns:
- ConfigurationSpace: the configuration space of the hyper-parameters of the
chosen component
- predict(X: ndarray) ndarray [source]¶
Predicts the target given an input, by using the chosen component
- Args:
- X (np.ndarray):
input features from which to predict the target
- Returns:
- np.ndarray:
the target prediction
- set_hyperparameters(configuration: Configuration, init_params: Optional[Dict[str, Any]] = None) autoPyTorchChoice [source]¶
Applies a configuration to the given component. This method translate a hierarchical configuration key, to an actual parameter of the autoPyTorch component.
- Args:
- configuration (Configuration):
Which configuration to apply to the chosen component
- init_params (Optional[Dict[str, any]]):
Optional arguments to initialize the chosen component
- Returns:
self: returns an instance of self