Spaces
Spaces#
A common requirement when performing optimization of some pipeline is to be able to parametrize it. To do so we often think about parametrize each component separately, with the structure of the pipeline adding additional constraints.
To facilitate this, we allow the construction of
piplines, where each part
of the pipeline can contains a .space
.
When we wish to extract out the entire search space from the pipeline, we can
call search_space(parser=...)
on the root node
of our pipeline, returning some sort of space object.
Now there are unfortunately quite a few search space implementations out there.
Some support concepts such as forbidden combinations, conditionals and
functional constraints, while others are fully constrained just numerical
parameters. Other reasons to choose a particular space representation is
dependant upon some Optimizer
you may wish to use, where typically they will only have one preferred search
space representation.
To generalize over this, AMLTK itself will not care what is in a .space
of each part of the pipeline, i.e.
â•─ Component(object) ────────╮
│ item class object(...) │
│ space 'hmmm, a str space?' │
╰────────────────────────────╯
What follow's below is a list of supported parsers you could pass parser=
to extract a search space representation.
ConfigSpace#
ConfigSpace is a library for representing and sampling configurations for hyperparameter optimization. It features a straightforward API for defining hyperparameters, their ranges and even conditional dependencies.
It is generally flexible enough for more complex use cases, even handling the complex pipelines of AutoSklearn and AutoPyTorch, large scale hyperparameter spaces over which to optimize entire pipelines at a time.
Requirements
This requires ConfigSpace
which can be installed with:
In general, you should have the ConfigSpace documentation ready to consult for a full understanding of how to construct hyperparameter spaces with AMLTK.
Basic Usage#
You can directly us the parser()
function and pass that into the search_space()
method of a Node
, however you can also simply provide
search_space(parser="configspace", ...)
for simplicity.
from amltk.pipeline import Component, Choice, Sequential
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
my_pipeline = (
Sequential(name="Pipeline")
>> Component(PCA, space={"n_components": (1, 3)})
>> Choice(
Component(
SVC,
space={"C": (0.1, 10.0)}
),
Component(
RandomForestClassifier,
space={"n_estimators": (10, 100), "criterion": ["gini", "log_loss"]},
),
Component(
MLPClassifier,
space={
"activation": ["identity", "logistic", "relu"],
"alpha": (0.0001, 0.1),
"learning_rate": ["constant", "invscaling", "adaptive"],
},
),
name="estimator"
)
)
space = my_pipeline.search_space("configspace")
print(space)
Configuration space object:
Hyperparameters:
Pipeline:PCA:n_components, Type: UniformInteger, Range: [1, 3], Default: 2
Pipeline:estimator:MLPClassifier:activation, Type: Categorical, Choices: {identity, logistic, relu}, Default: identity
Pipeline:estimator:MLPClassifier:alpha, Type: UniformFloat, Range: [0.0001, 0.1], Default: 0.05005
Pipeline:estimator:MLPClassifier:learning_rate, Type: Categorical, Choices: {constant, invscaling, adaptive}, Default: constant
Pipeline:estimator:RandomForestClassifier:criterion, Type: Categorical, Choices: {gini, log_loss}, Default: gini
Pipeline:estimator:RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 55
Pipeline:estimator:SVC:C, Type: UniformFloat, Range: [0.1, 10.0], Default: 5.05
Pipeline:estimator:__choice__, Type: Categorical, Choices: {MLPClassifier, RandomForestClassifier, SVC}, Default: MLPClassifier
Conditions:
Pipeline:estimator:MLPClassifier:activation | Pipeline:estimator:__choice__ == 'MLPClassifier'
Pipeline:estimator:MLPClassifier:alpha | Pipeline:estimator:__choice__ == 'MLPClassifier'
Pipeline:estimator:MLPClassifier:learning_rate | Pipeline:estimator:__choice__ == 'MLPClassifier'
Pipeline:estimator:RandomForestClassifier:criterion | Pipeline:estimator:__choice__ == 'RandomForestClassifier'
Pipeline:estimator:RandomForestClassifier:n_estimators | Pipeline:estimator:__choice__ == 'RandomForestClassifier'
Pipeline:estimator:SVC:C | Pipeline:estimator:__choice__ == 'SVC'
Here we have an example of a few different kinds of hyperparmeters,
PCA:n_components
is a integer with a range of 1 to 3, uniform distribution, as specified by it's integer bounds in a tuple.SVC:C
is a float with a range of 0.1 to 10.0, uniform distribution, as specified by it's float bounds in a tuple.RandomForestClassifier:criterion
is a categorical hyperparameter, with two choices,"gini"
and"log_loss"
.
There is also a Choice
node, which is a special node that indicates that
we could choose from one of these estimators. This leads to the conditionals that you
can see in the printed out space.
You may wish to remove all conditionals if an Optimizer
does not support them, or
you may wish to remove them for other reasons. You can do this by passing
conditionals=False
to the parser()
function.
Configuration space object:
Hyperparameters:
Pipeline:PCA:n_components, Type: UniformInteger, Range: [1, 3], Default: 2
Pipeline:estimator:MLPClassifier:activation, Type: Categorical, Choices: {identity, logistic, relu}, Default: identity
Pipeline:estimator:MLPClassifier:alpha, Type: UniformFloat, Range: [0.0001, 0.1], Default: 0.05005
Pipeline:estimator:MLPClassifier:learning_rate, Type: Categorical, Choices: {constant, invscaling, adaptive}, Default: constant
Pipeline:estimator:RandomForestClassifier:criterion, Type: Categorical, Choices: {gini, log_loss}, Default: gini
Pipeline:estimator:RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 55
Pipeline:estimator:SVC:C, Type: UniformFloat, Range: [0.1, 10.0], Default: 5.05
Pipeline:estimator:__choice__, Type: Categorical, Choices: {MLPClassifier, RandomForestClassifier, SVC}, Default: MLPClassifier
Likewise, you can also remove all heirarchy from the space which may make downstream tasks easier,
by passing flat=True
to the parser()
function.
Configuration space object:
Hyperparameters:
MLPClassifier:activation, Type: Categorical, Choices: {identity, logistic, relu}, Default: identity
MLPClassifier:alpha, Type: UniformFloat, Range: [0.0001, 0.1], Default: 0.05005
MLPClassifier:learning_rate, Type: Categorical, Choices: {constant, invscaling, adaptive}, Default: constant
PCA:n_components, Type: UniformInteger, Range: [1, 3], Default: 2
RandomForestClassifier:criterion, Type: Categorical, Choices: {gini, log_loss}, Default: gini
RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 55
SVC:C, Type: UniformFloat, Range: [0.1, 10.0], Default: 5.05
estimator:__choice__, Type: Categorical, Choices: {MLPClassifier, RandomForestClassifier, SVC}, Default: MLPClassifier
Conditions:
MLPClassifier:activation | estimator:__choice__ == 'MLPClassifier'
MLPClassifier:alpha | estimator:__choice__ == 'MLPClassifier'
MLPClassifier:learning_rate | estimator:__choice__ == 'MLPClassifier'
RandomForestClassifier:criterion | estimator:__choice__ == 'RandomForestClassifier'
RandomForestClassifier:n_estimators | estimator:__choice__ == 'RandomForestClassifier'
SVC:C | estimator:__choice__ == 'SVC'
More Specific Hyperparameters#
You'll often want to be a bit more specific with your hyperparameters, here we just
show a few examples of how you'd couple your pipelines a bit more towards ConfigSpace
.
from ConfigSpace import Float, Categorical, Normal
from amltk.pipeline import Searchable
s = Searchable(
space={
"lr": Float("lr", bounds=(1e-5, 1.), log=True, default=0.3),
"balance": Float("balance", bounds=(-1.0, 1.0), distribution=Normal(0.0, 0.5)),
"color": Categorical("color", ["red", "green", "blue"], weights=[2, 1, 1], default="blue"),
},
name="Something-To-Search",
)
print(s.search_space("configspace"))
Configuration space object:
Hyperparameters:
Something-To-Search:balance, Type: NormalFloat, Mu: 0.0 Sigma: 0.5, Range: [-1.0, 1.0], Default: 0.0
Something-To-Search:color, Type: Categorical, Choices: {red, green, blue}, Default: blue, Probabilities: (0.5, 0.25, 0.25)
Something-To-Search:lr, Type: UniformFloat, Range: [1e-05, 1.0], Default: 0.3, on log-scale
Conditional ands Advanced Usage#
We will refer you to the
ConfigSpace documentation for the construction
of these. However once you've constructed a ConfigurationSpace
and added any forbiddens and
conditionals, you may simply set that as the .space
attribute.
from amltk.pipeline import Component, Choice, Sequential
from ConfigSpace import ConfigurationSpace, EqualsCondition, InCondition
myspace = ConfigurationSpace({"A": ["red", "green", "blue"], "B": (1, 10), "C": (-100.0, 0.0)})
myspace.add_conditions([
EqualsCondition(myspace["B"], myspace["A"], "red"), # B is active when A is red
InCondition(myspace["C"], myspace["A"], ["green", "blue"]), # C is active when A is green or blue
])
component = Component(object, space=myspace, name="MyThing")
parsed_space = component.search_space("configspace")
print(parsed_space)
Configuration space object:
Hyperparameters:
MyThing:A, Type: Categorical, Choices: {red, green, blue}, Default: red
MyThing:B, Type: UniformInteger, Range: [1, 10], Default: 6
MyThing:C, Type: UniformFloat, Range: [-100.0, 0.0], Default: -50.0
Conditions:
MyThing:B | MyThing:A == 'red'
MyThing:C | MyThing:A in {'green', 'blue'}
Optuna#
Optuna parser for parsing out a
search_space()
.
from a pipeline.
Requirements
This requires Optuna
which can be installed with:
Limitations
Optuna feature a very dynamic search space (define-by-run), where people typically sample from some trial object and use traditional python control flow to define conditionality.
This means we can not trivially represent this conditionality in a static search space. While band-aids are possible, it naturally does not sit well with the static output of a parser.
As such, our parser does not support conditionals or choices!. Users may still use the define-by-run within their optimization function itself.
If you have experience with Optuna and have any suggestions, please feel free to open an issue or PR on GitHub!
Usage#
The typical way to represent a search space for Optuna is just to use a dictionary,
where the keys are the names of the hyperparameters and the values are either
integer/float tuples indicating boundaries or some discrete set of values.
It is possible to have the value directly be a
BaseDistribution
, an optuna type, when you need to customize the distribution more.
from amltk.pipeline import Component
from optuna.distributions import FloatDistribution
c = Component(
object,
space={
"myint": (1, 10),
"myfloat": (1.0, 10.0),
"mycategorical": ["a", "b", "c"],
"log-scale-custom": FloatDistribution(1e-10, 1e-2, log=True),
},
name="name",
)
space = c.search_space(parser="optuna")
{
'name:myint': IntDistribution(high=10, log=False, low=1, step=1),
'name:myfloat': FloatDistribution(high=10.0, log=False, low=1.0, step=None),
'name:mycategorical': CategoricalDistribution(choices=('a', 'b', 'c')),
'name:log-scale-custom': FloatDistribution(high=0.01, log=True, low=1e-10,
step=None)
}
You may also just pass the parser=
function directly if preferred
from amltk.pipeline.parsers.optuna import parser as optuna_parser
space = c.search_space(parser=optuna_parser)
{
'name:myint': IntDistribution(high=10, log=False, low=1, step=1),
'name:myfloat': FloatDistribution(high=10.0, log=False, low=1.0, step=None),
'name:mycategorical': CategoricalDistribution(choices=('a', 'b', 'c')),
'name:log-scale-custom': FloatDistribution(high=0.01, log=True, low=1e-10,
step=None)
}
When using search_space()
on a some nested
structures, you may want to flatten the names of the hyperparameters. For this you
can use flat=
from amltk.pipeline import Searchable, Sequential
seq = Sequential(
Searchable({"myint": (1, 10)}, name="nested_1"),
Searchable({"myfloat": (1.0, 10.0)}, name="nested_2"),
name="seq"
)
hierarchical_space = seq.search_space(parser="optuna", flat=False) # Default
flat_space = seq.search_space(parser="optuna", flat=False) # Default
{
'seq:nested_1:myint': IntDistribution(high=10, log=False, low=1, step=1),
'seq:nested_2:myfloat': FloatDistribution(high=10.0, log=False, low=1.0,
step=None)
}
{
'seq:nested_1:myint': IntDistribution(high=10, log=False, low=1, step=1),
'seq:nested_2:myfloat': FloatDistribution(high=10.0, log=False, low=1.0,
step=None)
}