CAVE(folders: List[str], output_dir: str, ta_exec_dir: List[str], file_format: str = 'SMAC3', validation_format='NONE', validation_method: str = 'epm', pimp_max_samples: int = -1, fanova_pairwise: bool = True, use_budgets: bool = False, seed: int = 42, show_jupyter: bool = True, verbose_level: str = 'OFF')¶
Initialize CAVE facade to handle analyzing, plotting and building the report-page easily. During initialization, the analysis-infrastructure is built and the data is validated, the overall best incumbent is found and default+incumbent are evaluated for all instances for all runs, by default using an EPM.
In the internal data-management the we have three types of runhistories: original, validated and epm.
- original_rh contain only runs that have been gathered during the optimization-process.
- validated_rh may contain original runs, but also data that was not gathered iteratively during the optimization, but systematically through external validation of interesting configurations. Important: NO ESTIMATED RUNS IN validated RUNHISTORIES!
- epm_rh contain runs that are gathered through empirical performance models.
Runhistories are organized as follows:
- each ConfiguratorRun has an original_runhistory- and a combined_runhistory-attribute
- if available, each ConfiguratorRun’s validated_runhistory contains a runhistory with validation-data gathered after the optimization
- combined_runhistory always contains as many real runs as possible
- folders (list<strings>) – paths to relevant SMAC runs
- output_dir (string) – output for cave to write results (figures + report)
- ta_exec_dir (string) – execution directory for target algorithm (to find instance.txt specified in scenario, ..)
- file_format (str) – what format the rundata is in, options are [SMAC3, SMAC2, BOHB and CSV]
- file_format – what format the validation rundata is in, options are [SMAC3, SMAC2, CSV and None]
- validation_method (string) – from [validation, epm], how to estimate missing runs
- pimp_max_samples (int) – passed to PIMP for configuration
- fanova_pairwise (bool) – whether to calculate pairwise marginals for fanova
- use_budgets (bool) – if true, individual runs are treated as different budgets. they are not evaluated together, but compared against each other. runs are expected in ascending budget-size.
- seed (int) – random seed for analysis (e.g. the random forests)
- show_jupyter (bool) – default True, tries to output plots and tables to jupyter-frontend, if available
- verbose_level (str) – from [OFF, INFO, DEBUG, DEV_DEBUG and WARNING]
The instance features are projected into a two/three dimensional space using principal component analysis (PCA) and the footprint of each algorithm is plotted, i.e., on which instances the default or the optimized configuration performs well. In contrast to the other analysis methods in this section, these plots allow insights into which of the two configurations performs well on specific types or clusters of instances. Inspired by Smith-Miles.
analyze(performance=True, cdf=True, scatter=True, cfp=True, cfp_time_slider=False, cfp_max_plot=-1, cfp_number_quantiles=10, param_importance=['forward_selection', 'ablation', 'fanova'], pimp_sort_table_by: str = 'average', feature_analysis=['box_violin', 'correlation', 'importance', 'clustering', 'feature_cdf'], parallel_coordinates=True, cost_over_time=True, algo_footprint=True)¶
Analyze the available data and build HTML-webpage as dict. Save webpage in ‘self.output_dir/CAVE/report.html’. Analyzing is performed with the analyzer-instance that is initialized in the __init__
- performance (bool) – whether to calculate par10-values
- cdf (bool) – whether to plot cdf
- scatter (bool) – whether to plot scatter
- cfp (bool) – whether to perform configuration visualization
- cfp_time_slider (bool) – whether to include an interactive time-slider in configuration footprint
- cfp_max_plot (int) – limit number of configurations considered for configuration footprint (-1 -> all configs)
- cfp_number_quantiles (int) – number of steps over time generated in configuration footprint
- param_importance (List[str]) – containing methods for parameter importance
- pimp_sort_table (str) – in what order the parameter-importance overview should be organized
- feature_analysis (List[str]) – containing methods for feature analysis
- parallel_coordinates (bool) – whether to plot parallel coordinates
- cost_over_time (bool) – whether to plot cost over time
- algo_footprint (bool) – whether to plot algorithm footprints
Show the incumbents for each budget (i.e. the best configuration by kernel-estimation using data from that budget).
Visualizing the learning curves of the individual Hyperband-iterations. Model based picks are marked with a cross. The config-id tuple denotes (iteration, stage, id_within_stage), where the iteration is a hyperband iteration and the stage is the index of the budget used. It can be interpreted as a nested index-identifier.
Box and Violin Plots show the distribution of each feature value across the instances. Box plots show the quantiles of the distribution and violin plots show the approximated probability density of the feature values. Such plots are useful to inspect the instances and to detect characteristics of the instances. For example, if the distributions have two or more modes, it could indicate that the instance set is heterogeneous which could cause problems in combination with racing strategies configurators typically use. NaN values are removed from the data.
Use spearman correlation, to get a correlation-value and a p-value for every pairwise combination of budgets. First value is the correlation, second is the p-value (the p-value roughly estimates the likelihood to obtain this correlation coefficient with uncorrelated datasets).
Ablation Analysis is a method to determine parameter importance by comparing two parameter configurations, typically the default and the optimized configuration. It uses a greedy forward search to determine the order of flipping the parameter settings from default configuration to incumbent such that in each step the cost is maximally decreased.
fANOVA (functional analysis of variance) computes the fraction of the variance in the cost space explained by changing a parameter by marginalizing over all other parameters, for each parameter (or for pairs of parameters). Parameters with high importance scores will have a large impact on the performance. To this end, a random forest is trained as an empirical performance model on the available empirical data from the available runhistories.
Comparing parameters of default and incumbent. Parameters that differ from default to incumbent are presented first.
configurator_footprint(cave, use_timeslider=False, max_confs=1000, num_quantiles=8)¶
Analysis of the iteratively sampled configurations during the optimization procedure. Multi-dimensional scaling (MDS) is used to reduce dimensionality of the search space and plot the distribution of evaluated configurations. The larger the dot, the more often the configuration was evaluated on instances from the set. Configurations that were incumbents at least once during optimization are marked as red squares. Configurations acquired through local search are marked with a ‘x’. The downward triangle denotes the final incumbent, whereas the orange upward triangle denotes the default configuration. The heatmap and the colorbar correspond to the predicted performance in that part of the search space.
- use_timeslider (bool) – whether to generate time-slíder widget in bokehplot (cool, but time-consuming)
- max_confs (int) – maximum number of configurations to consider for the plot
- num_quantiles (int) – number of quantiles for evolution over time (number of time-steps to look at)
configurators_behavior(d, run, cost_over_time=False, cfp=False, cfp_max_plot=-1, cfp_time_slider=False, cfp_number_quantiles=1, parallel_coordinates=False)¶
Depicts the average cost of the best so far found configuration (using all trajectory data) over the time spent by the configurator (including target algorithm runs and the overhead generated by the configurator) If the curve flattens out early, it indicates that too much time was spent for the configurator run; whereas a curve that is still improving at the end of the budget indicates that one should increase the configuration budget. The plotted standard deviation gives the uncertainty over multiple configurator runs.
feature_analysis(d, run, box_violin=False, correlation=False, clustering=False, importance=False)¶
Clustering instances in 2d; the color encodes the cluster assigned to each cluster. Similar to ISAC, we use a k-means to cluster the instances in the feature space. As pre-processing, we use standard scaling and a PCA to 2 dimensions. To guess the number of clusters, we use the silhouette score on the range of 2 to 12 in the number of clusters
Correlation of features based on the Pearson product-moment correlation. Since instance features are used to train an empirical performance model in model-based configurators, it can be important to remove correlated features in a pre-processing step depending on the machine-learning algorithm. Darker fields corresponds to a larger correlation between the features.
Using an empirical performance model, performance changes of a configuration along each parameter are calculated. To quantify the importance of a parameter value, the variance of all cost values by changing that parameter are predicted and then the fraction of all variances is computed. This analysis is inspired by the human behaviour to look for improvements in the neighborhood of individual parameters of a configuration.
Meta data, i.e. number of instances and parameters as well as configuration budget. Statistics apply to the best run, if multiple configurator runs are compared.
parallel_coordinates(cave, params: Union[int, List[str]] = 5, n_configs: int = 100, max_runs_epm: int = 300000)¶
Previously used by Golovin et al. to study the frequency of chosen parameter settings in black-box-optimization. Each line corresponds to one configuration in the runhistory and shows the parameter settings and the corresponding (estimated) average cost. To handle large configuration spaces with hundreds of parameters, the (at most) 10 most important parameters based on a fANOVA parameter importance analysis are plotted. To emphasize better configurations, the performance is encoded in the color of each line, ranging from blue to red. These plots provide insights into whether the configurator focused on specific parameter values and how these correlate to their costs.
NOTE: the given runhistory should contain only optimization and no validation to analyze the explored parameter-space.
- params (List[str] or int) – if int, plot at most params parameters, trying to determine with parameter importance. if List of strings, the names of the parameters to be plotted
- n_configs (int) – number of configs. will try to find most interesting configs to plot
- max_runs_epm (int) – this is a maximum of runs to be used for training of the epm. use to avoid MemoryErrors
parameter_importance(d, run, ablation=False, fanova=False, forward_selection=False, lpi=False, pimp_sort_table_by='average')¶
Perform the specified parameter importance procedures.
performance_analysis(d, run, performance, cdf, scatter, algo_footprint)¶
Generate performance analysis.
- d (dictionary) – dictionary to add entries to
- cdf, scatter, algo_footprint (performance,) – what analysis-methods to perform
If the run-objective is ‘runtime’: PAR stands for Penalized Average Runtime. If there is a timeout in the scenario, runs that were thus cut off can be penalized with a factor (because we do not know how long it would have run). PAR1 is no penalty, PAR10 will count all cutoffs with a factor of 10.
For timeouts: if there are multiple runs on the same configuration-instance pair (with different seeds), some resulting in timeouts and some not, the majority decides here.
P-value (between 0 and 1) results from comparing default and incumbent using a paired permutation test with 10000 iterations (permuting instances) and tests against the null-hypothesis that the mean of performance between default and incumbent is equal.
Oracle performance searches for the best single run per instance (so the best seed/configuration-pair that was seen) and aggregates over them.
Parameters are sorted by pimp_sort_table_by. Note, that the values are not directly comparable, since the different techniques provide different metrics (see respective tooltips for details on the differences).
Forward Selection is a generic method to obtain a subset of parameters to achieve the same prediction error as with the full parameter set. Each parameter is scored by how much the out-of-bag-error of an empirical performance model based on a random forest is decreased.
Depicts cost distributions over the set of instances. Since these are empirical distributions, the plots show step functions. These plots provide insights into how well configurations perform up to a certain threshold. For runtime scenarios this shows the probability of solving all instances from the set in a given timeframe. On the left side the training-data is scattered, on the right side the test-data is scattered.
Scatter plots show the costs of the default and optimized parameter configuration on each instance. Since this looses detailed information about the individual cost on each instance by looking at aggregated cost values in tables, scatter plots provide a more detailed picture. They provide insights whether overall performance improvements can be explained only by some outliers or whether they are due to improvements on the entire instance set. On the left side the training-data is scattered, on the right side the test-data is scattered.
If the analyzed configurator uses budgets, print a list of available budgets.