Quickstart#
Make sure you first followed the setup guide.
We will be using the synthetic MFHartmann for this tutorial as this requires no downloads to run.
In general, the only import you should need for generic use is just import mfpbench
and using mfpbench.get(...)
to get a benchmark.
There are also some nuances when working with tabular data that should be mentioned, see Tabular Benchmarks for more information.
Quick Reference
Useful Properties
.space
- The space of the benchmark.start
- The starting fidelity of the benchmark.end
- The end fidelity of the benchmark.fidelity_name
- The name of the fidelity.table
- The table backing aTabularBenchmark
.Config
- The type of config used by the benchmark will be attached to the benchmark object.Result
- The type of result used by the benchmark will be attached to the benchmark object.
Main Methods
sample(n)
- Sample one or many configs from a benchmarkquery(config, at)
- Query a benchmark for a given fidelitytrajectory(config)
- Get the full trajectory curve of a config
Other
load()
- Load a benchmark into memory if not already
Getting a benchmark#
We try to make it so the normal use case of a benchmark is as simple as possible.
For this we use get()
. Each benchmarks comes with it's own **kwargs
but
you can find them in the API documentation of get()
.
API
Get a benchmark.
PARAMETER | DESCRIPTION |
---|---|
name |
The name of the benchmark
TYPE:
|
prior |
The prior to use for the benchmark. * str - If it ends in {.json} or {.yaml, .yml}, it will convert it to a path and use it as if it is a path to a config. Otherwise, it is treated as preset * Path - path to a file * Config - A Config object * None - Use the default if available |
preload |
Whether to preload the benchmark data in
TYPE:
|
**kwargs |
Extra arguments, optional or required for other benchmarks. Please look up the associated benchmarks.
TYPE:
|
For the **kwargs
, please see the benchmarks listed below by name=
name='lcbench'
(YAHPO-GYM)
Possible task_id=
:
('3945', '7593', '34539', '126025', '126026', '126029', '146212', '167104', '167149', '167152', '167161', '167168', '167181', '167184', '167185', '167190', '167200', '167201', '168329', '168330', '168331', '168335', '168868', '168908', '168910', '189354', '189862', '189865', '189866', '189873', '189905', '189906', '189908', '189909')
PARAMETER | DESCRIPTION |
---|---|
task_id |
The task id to choose.
TYPE:
|
seed |
The seed to use
TYPE:
|
datadir |
The path to where mfpbench stores it data. If left to |
seed |
The seed for the benchmark instance
TYPE:
|
prior |
The prior to use for the benchmark. If None, no prior is used. If a str, will check the local location first for a prior specific for this benchmark, otherwise assumes it to be a Path. If a Path, will load the prior from the path. If a Mapping, will be used directly.
TYPE:
|
perturb_prior |
If given, will perturb the prior by this amount. Only used if
TYPE:
|
session |
The onnxruntime session to use. If None, will create a new one. Not for faint hearted This is only a backdoor for onnx compatibility issues with YahpoGym. You are advised not to use this unless you know what you are doing.
TYPE:
|
name='lm1b_transformer_2048'
(PD1)
PARAMETER | DESCRIPTION |
---|---|
datadir |
Path to the data directory |
seed |
The seed to use for the space
TYPE:
|
prior |
Any prior to use for the benchmark
TYPE:
|
perturb_prior |
Whether to perturb the prior. If specified, this is interpreted as the std of a normal from which to perturb numerical hyperparameters of the prior, and the raw probability of swapping a categorical value.
TYPE:
|
name='uniref50_transformer_128'
(PD1)
PARAMETER | DESCRIPTION |
---|---|
datadir |
Path to the data directory |
seed |
The seed to use for the space
TYPE:
|
prior |
Any prior to use for the benchmark
TYPE:
|
perturb_prior |
Whether to perturb the prior. If specified, this is interpreted as the std of a normal from which to perturb numerical hyperparameters of the prior, and the raw probability of swapping a categorical value.
TYPE:
|
name='cifar100_wideresnet_2048'
(PD1)
PARAMETER | DESCRIPTION |
---|---|
datadir |
Path to the data directory |
seed |
The seed to use for the space
TYPE:
|
prior |
Any prior to use for the benchmark
TYPE:
|
perturb_prior |
Whether to perturb the prior. If specified, this is interpreted as the std of a normal from which to perturb numerical hyperparameters of the prior, and the raw probability of swapping a categorical value.
TYPE:
|
name='imagenet_resnet_512'
(PD1)
PARAMETER | DESCRIPTION |
---|---|
datadir |
Path to the data directory |
seed |
The seed to use for the space
TYPE:
|
prior |
Any prior to use for the benchmark
TYPE:
|
perturb_prior |
Whether to perturb the prior. If specified, this is interpreted as the std of a normal from which to perturb numerical hyperparameters of the prior, and the raw probability of swapping a categorical value.
TYPE:
|
name='jahs'
Possible task_id=
:
PARAMETER | DESCRIPTION |
---|---|
task_id |
The specific task to use.
TYPE:
|
datadir |
The path to where mfpbench stores it data. If left to |
seed |
The seed to give this benchmark instance
TYPE:
|
prior |
The prior to use for the benchmark.
TYPE:
|
perturb_prior |
If given, will perturb the prior by this amount.
Only used if
TYPE:
|
name='mfh3'
PARAMETER | DESCRIPTION |
---|---|
seed |
The seed to use.
TYPE:
|
bias |
How much bias to introduce
TYPE:
|
noise |
How much noise to introduce
TYPE:
|
prior |
The prior to use for the benchmark.
TYPE:
|
perturb_prior |
If not None, will perturb the prior by this amount. For numericals, while for categoricals, this is interpreted as the probability of swapping the value for a random one.
TYPE:
|
name='mfh6'
PARAMETER | DESCRIPTION |
---|---|
seed |
The seed to use.
TYPE:
|
bias |
How much bias to introduce
TYPE:
|
noise |
How much noise to introduce
TYPE:
|
prior |
The prior to use for the benchmark.
TYPE:
|
perturb_prior |
If not None, will perturb the prior by this amount. For numericals, while for categoricals, this is interpreted as the probability of swapping the value for a random one.
TYPE:
|
name='lcbench_tabular'
Possible task_id=
:
('adult', 'airlines', 'albert', 'Amazon_employee_access', 'APSFailure', 'Australian', 'bank-marketing', 'blood-transfusion-service-center', 'car', 'christine', 'cnae-9', 'connect-4', 'covertype', 'credit-g', 'dionis', 'fabert', 'Fashion-MNIST', 'helena', 'higgs', 'jannis', 'jasmine', 'jungle_chess_2pcs_raw_endgame_complete', 'kc1', 'KDDCup09_appetency', 'kr-vs-kp', 'mfeat-factors', 'MiniBooNE', 'nomao', 'numerai28.6', 'phoneme', 'segment', 'shuttle', 'sylvine', 'vehicle', 'volkert')
PARAMETER | DESCRIPTION |
---|---|
task_id |
The task to benchmark on.
TYPE:
|
datadir |
The directory to look for the data in. If |
remove_constants |
Whether to remove constant config columns from the data or not.
TYPE:
|
seed |
The seed to use.
TYPE:
|
prior |
The prior to use for the benchmark. If None, no prior is used. If a str, will check the local location first for a prior specific for this benchmark, otherwise assumes it to be a Path. If a Path, will load the prior from the path. If a Mapping, will be used directly.
TYPE:
|
perturb_prior |
If not None, will perturb the prior by this amount. For numericals, this is interpreted as the standard deviation of a normal distribution while for categoricals, this is interpreted as the probability of swapping the value for a random one.
TYPE:
|
Source code in src/mfpbench/get.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
Preloading benchmarks
By default, benchmarks will not load in required data or surrogate models. To
have these ready and in memory, you can pass in preload=True
.
Properties of Benchmarks#
All benchmarks inherit from Benchmark
which has some useful
properties we might want to know about:
print(f"Benchmark fidelity starts at: {benchmark.start}")
print(f"Benchmark fidelity ends at: {benchmark.end}")
print(f"Benchmark fidelity is called: {benchmark.fidelity_name}")
print(f"Benchmark has conditionals: {benchmark.has_conditionals}")
print("Benchmark has the following space")
print(benchmark.space)
Benchmark fidelity starts at: 3
Benchmark fidelity ends at: 100
Benchmark fidelity is called: z
Benchmark has conditionals: False
Benchmark has the following space
Configuration space object:
Hyperparameters:
mfh3
X_0, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
X_1, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
X_2, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
Sampling from a benchmark#
To sample from a benchmark, we use the sample()
method.
This method takes in a number of samples to return and returns a list of configs.
config = benchmark.sample()
print(config)
configs = benchmark.sample(10, seed=2)
Querying a benchmark#
To query a benchmark, we use the query()
method.
This method takes in a config and a fidelity to query at and returns the
Result
of the benchmark at that fidelity.
By default, this will return at the maximum fidelity but you can pass at=
to query at a different fidelity.
value = benchmark.query(config)
print(value)
value = benchmark.query(config, at=benchmark.start)
print(value)
When querying a benchmark, you can get the entire trajctory curve of a config with
trajectory()
. This will be a list[Result]
, one
for each fidelity available.
trajectory = benchmark.trajectory(config)
print(len(trajectory))
errors = [r.error for r in trajectory]
trajectory = benchmark.trajectory(config, frm=benchmark.start, to=benchmark.end // 2)
print(len(trajectory))
Tip
The query and trajectory function can take in a Config
object
or anything that looks like a mapping.
Working with Config
objects#
When interacting with a Benchmark
, you will always be returned
Config
objects. These contain some simple methods to make working
with them easier.
They behave like a non-mutable dictionary so you can use them like a non-mutable dictionary.
config = benchmark.sample()
print("index", config["X_1"])
print("get", config.get("X_1231", 42))
for key, value in config.items():
print(key, value)
print("contains", "X_1" in config)
print("len", len(config))
print("dict", dict(config))
How is that done?
This is done by inheriting from python's Mapping
class and implementing it's methods, namely __getitem__()
__iter__()
, __len__()
. You can also implement things to look like lists, containers
and other pythonic things!
Config.dict()
returns a dictionary of the config. This is useful for
working with the config in other libraries.
Config.copy()
returns a new config with the same values.
Config.mutate()
takes in a dictionary of keys to values
and returns a new config with those values changed.
Config.perturb()
takes in the space the config is from,
a standard deviation and/or a categorical swap change and returns a new config with
the values perturbed by a normal distribution with the given standard deviation and/or
the categorical swap change.
Config.save()
and Config.from_file()
are
used to save and load configs to and from disk.
Working with Result
objects#
When interacting with a Benchmark
, all results will be communicated back
with Result
objects. These contain some simple methods to make working
with them easier. Every benchmark will have a different set of results available but in general
we try to make at least an error
and score
available. We also make a cost
available for benchmarks, which
is often something like the time taken to train the specific config. These error
and score
attributes are usually validation errors and scores. Some benchmarks also provide a test_error
and test_score
which are the test errors and scores, but not all.
config = benchmark.sample()
result = benchmark.query(config)
print("error", result.error)
print("cost", result.cost)
print(result)
These share the dict()
and from_dict()
methods as Config
objects but do not behave like dictionaries.
The most notable property of Result
objects is that also have the
fidelity
at which they were evaluated at and also
the config
that was evaluated to generate the results.
Tabular Benchmarks#
Some benchmarks are tabular in nature, meaning they have a table of results that
can be queried. These benchmarks inherit from TabularBenchmark
and have a table
property that is the ground source of truth
for the benchmark. This table is a pandas.DataFrame
and can be
queried as such.
In general, tabular benchmarks will have to construct themselves using the base TabularBenchmark
This requires the follow arguments which can be used to normalize the table for efficient indexing and usage.
Predefined tabular benchmarks will fill these in easily for you, e.g. LCBenchTabularBenchmark
.
Required arguments for a TabularBenchmark
The main required arguments are .config_name
, .fidelity_name
,
.config_keys
, .result_keys
PARAMETER | DESCRIPTION |
---|---|
name |
The name of this benchmark.
TYPE:
|
table |
The table to use for the benchmark.
TYPE:
|
config_name |
The column in the table that contains the config id
TYPE:
|
fidelity_name |
The column in the table that contains the fidelity
TYPE:
|
result_keys |
The columns in the table that contain the results |
config_keys |
The columns in the table that contain the config values |
remove_constants |
Remove constant config columns from the data or not.
TYPE:
|
space |
The configuration space to use for the benchmark. If None, will just be an empty space.
TYPE:
|
prior |
The prior to use for the benchmark. If None, no prior is used. If a string, will be treated as a prior specific for this benchmark if it can be found, otherwise assumes it to be a Path. If a Path, will load the prior from the path. If a dict or Configuration, will be used directly.
TYPE:
|
perturb_prior |
If not None, will perturb the prior by this amount. For numericals, while for categoricals, this is interpreted as the probability of swapping the value for a random one.
TYPE:
|
seed |
The seed to use for the benchmark.
TYPE:
|
Source code in src/mfpbench/tabular.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
Difference for Config
#
When working with tabular benchmarks, the config type that is used is a TabularConfig
.
The one difference is that it includes an .id
property that is used to
identify the config in the table. This is what's used to retrieve results from the table.
If this is missing when doing a query()
, we'll do our best to match
the config to the table and get the correct id, but this is not guaranteed.
When using dict()
, this id
is not included in the dictionary.
In general you should either store the config
object itself or at least config.id
, that you can
include back in before calling query()
.
Using your own Tabular Data#
To facilitate each of use for you own usage of tabular data, we provide a
GenericTabularBenchmark
that can be used
to load in and use your own tabular data.
import pandas as pd
from mfpbench import GenericTabularBenchmark
# Create some fake data
df = pd.DataFrame(
{
"config": ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
"fidelity": [1, 2, 3, 1, 2, 3, 1, 2, 3],
"balanced_accuracy": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
"color": ["red", "red", "red", "blue", "blue", "blue", "green", "green", "green"],
"shape": ["circle", "circle", "circle", "square", "square", "square", "triangle", "triangle", "triangle"],
"kind": ["mlp", "mlp", "mlp", "mlp", "mlp", "mlp", "mlp", "mlp", "mlp"],
}
)
print(df)
print()
print()
benchmark = GenericTabularBenchmark(
df,
name="mydata",
config_name="config",
fidelity_name="fidelity",
config_keys=["color", "shape"],
result_keys=["balanced_accuracy"],
result_mapping={
"error": lambda df: 1 - df["balanced_accuracy"],
"score": lambda df: df["balanced_accuracy"],
},
remove_constants=True,
)
print(benchmark.table)
config fidelity balanced_accuracy color shape kind
0 a 1 0.1 red circle mlp
1 a 2 0.2 red circle mlp
2 a 3 0.3 red circle mlp
3 b 1 0.4 blue square mlp
4 b 2 0.5 blue square mlp
5 b 3 0.6 blue square mlp
6 c 1 0.7 green triangle mlp
7 c 2 0.8 green triangle mlp
8 c 3 0.9 green triangle mlp
balanced_accuracy color shape error score
config fidelity
a 1 0.1 red circle 0.9 0.1
2 0.2 red circle 0.8 0.2
3 0.3 red circle 0.7 0.3
b 1 0.4 blue square 0.6 0.4
2 0.5 blue square 0.5 0.5
3 0.6 blue square 0.4 0.6
c 1 0.7 green triangle 0.3 0.7
2 0.8 green triangle 0.2 0.8
3 0.9 green triangle 0.1 0.9
You can then operate on this benchmark as expected.
API for GenericTabularBenchmark
PARAMETER | DESCRIPTION |
---|---|
table |
The table to use for the benchmark
TYPE:
|
name |
The name of the benchmark. If None, will be set to
TYPE:
|
fidelity_name |
The column in the table that contains the fidelity
TYPE:
|
config_name |
The column in the table that contains the config id
TYPE:
|
result_keys |
The columns in the table that contain the results |
config_keys |
The columns in the table that contain the config values |
result_mapping |
A mapping from the result keys to the table keys. If a string, will be used as the key in the table. If a callable, will be called with the table and the result will be used as the value.
TYPE:
|
remove_constants |
Remove constant config columns from the data or not.
TYPE:
|
space |
The configuration space to use for the benchmark. If None, will just be an empty space.
TYPE:
|
seed |
The seed to use.
TYPE:
|
prior |
The prior to use for the benchmark. If None, no prior is used. If a str, will check the local location first for a prior specific for this benchmark, otherwise assumes it to be a Path. If a Path, will load the prior from the path. If a Mapping, will be used directly.
TYPE:
|
perturb_prior |
If not None, will perturb the prior by this amount. For numericals, this is interpreted as the standard deviation of a normal distribution while for categoricals, this is interpreted as the probability of swapping the value for a random one.
TYPE:
|