Spaces

Spaces#

A common requirement when performing optimization of some pipeline is to be able to parametrize it. To do so we often think about parametrize each component separately, with the structure of the pipeline adding additional constraints.

To facilitate this, we allow the construction of piplines, where each part of the pipeline can contains a .space. When we wish to extract out the entire search space from the pipeline, we can call search_space(parser=...) on the root node of our pipeline, returning some sort of space object.

Now there are unfortunately quite a few search space implementations out there. Some support concepts such as forbidden combinations, conditionals and functional constraints, while others are fully constrained just numerical parameters. Other reasons to choose a particular space representation is dependant upon some Optimizer you may wish to use, where typically they will only have one preferred search space representation.

To generalize over this, AMLTK itself will not care what is in a .space of each part of the pipeline, i.e.

from amltk.pipeline import Component

c = Component(object, space="hmmm, a str space?")

╭─ Component(object) ────────╮
│ item  class object(...)    │
│ space 'hmmm, a str space?' │
╰────────────────────────────╯

What follow's below is a list of supported parsers you could pass parser= to extract a search space representation.

ConfigSpace#

ConfigSpace is a library for representing and sampling configurations for hyperparameter optimization. It features a straightforward API for defining hyperparameters, their ranges and even conditional dependencies.

It is generally flexible enough for more complex use cases, even handling the complex pipelines of AutoSklearn and AutoPyTorch, large scale hyperparameter spaces over which to optimize entire pipelines at a time.

Requirements

This requires ConfigSpace which can be installed with:

pip install "amltk[configspace]"

# Or directly
pip install ConfigSpace

In general, you should have the ConfigSpace documentation ready to consult for a full understanding of how to construct hyperparameter spaces with AMLTK.

Basic Usage#

You can directly us the parser() function and pass that into the search_space() method of a Node, however you can also simply provide search_space(parser="configspace", ...) for simplicity.

from amltk.pipeline import Component, Choice, Sequential
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

my_pipeline = (
    Sequential(name="Pipeline")
    >> Component(PCA, space={"n_components": (1, 3)})
    >> Choice(
        Component(
            SVC,
            space={"C": (0.1, 10.0)}
        ),
        Component(
            RandomForestClassifier,
            space={"n_estimators": (10, 100), "criterion": ["gini", "log_loss"]},
        ),
        Component(
            MLPClassifier,
            space={
                "activation": ["identity", "logistic", "relu"],
                "alpha": (0.0001, 0.1),
                "learning_rate": ["constant", "invscaling", "adaptive"],
            },
        ),
        name="estimator"
    )
)

space = my_pipeline.search_space("configspace")
print(space)

Configuration space object:
  Hyperparameters:
    Pipeline:PCA:n_components, Type: UniformInteger, Range: [1, 3], Default: 2
    Pipeline:estimator:MLPClassifier:activation, Type: Categorical, Choices: {identity, logistic, relu}, Default: identity
    Pipeline:estimator:MLPClassifier:alpha, Type: UniformFloat, Range: [0.0001, 0.1], Default: 0.05005
    Pipeline:estimator:MLPClassifier:learning_rate, Type: Categorical, Choices: {constant, invscaling, adaptive}, Default: constant
    Pipeline:estimator:RandomForestClassifier:criterion, Type: Categorical, Choices: {gini, log_loss}, Default: gini
    Pipeline:estimator:RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 55
    Pipeline:estimator:SVC:C, Type: UniformFloat, Range: [0.1, 10.0], Default: 5.05
    Pipeline:estimator:__choice__, Type: Categorical, Choices: {MLPClassifier, RandomForestClassifier, SVC}, Default: MLPClassifier
  Conditions:
    Pipeline:estimator:MLPClassifier:activation | Pipeline:estimator:__choice__ == 'MLPClassifier'
    Pipeline:estimator:MLPClassifier:alpha | Pipeline:estimator:__choice__ == 'MLPClassifier'
    Pipeline:estimator:MLPClassifier:learning_rate | Pipeline:estimator:__choice__ == 'MLPClassifier'
    Pipeline:estimator:RandomForestClassifier:criterion | Pipeline:estimator:__choice__ == 'RandomForestClassifier'
    Pipeline:estimator:RandomForestClassifier:n_estimators | Pipeline:estimator:__choice__ == 'RandomForestClassifier'
    Pipeline:estimator:SVC:C | Pipeline:estimator:__choice__ == 'SVC'

Here we have an example of a few different kinds of hyperparmeters,

PCA:n_components is a integer with a range of 1 to 3, uniform distribution, as specified by it's integer bounds in a tuple.
SVC:C is a float with a range of 0.1 to 10.0, uniform distribution, as specified by it's float bounds in a tuple.
RandomForestClassifier:criterion is a categorical hyperparameter, with two choices, "gini" and "log_loss".

There is also a Choice node, which is a special node that indicates that we could choose from one of these estimators. This leads to the conditionals that you can see in the printed out space.

You may wish to remove all conditionals if an Optimizer does not support them, or you may wish to remove them for other reasons. You can do this by passing conditionals=False to the parser() function.

print(my_pipeline.search_space("configspace", conditionals=False))

Configuration space object:
  Hyperparameters:
    Pipeline:PCA:n_components, Type: UniformInteger, Range: [1, 3], Default: 2
    Pipeline:estimator:MLPClassifier:activation, Type: Categorical, Choices: {identity, logistic, relu}, Default: identity
    Pipeline:estimator:MLPClassifier:alpha, Type: UniformFloat, Range: [0.0001, 0.1], Default: 0.05005
    Pipeline:estimator:MLPClassifier:learning_rate, Type: Categorical, Choices: {constant, invscaling, adaptive}, Default: constant
    Pipeline:estimator:RandomForestClassifier:criterion, Type: Categorical, Choices: {gini, log_loss}, Default: gini
    Pipeline:estimator:RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 55
    Pipeline:estimator:SVC:C, Type: UniformFloat, Range: [0.1, 10.0], Default: 5.05
    Pipeline:estimator:__choice__, Type: Categorical, Choices: {MLPClassifier, RandomForestClassifier, SVC}, Default: MLPClassifier

Likewise, you can also remove all heirarchy from the space which may make downstream tasks easier, by passing flat=True to the parser() function.

print(my_pipeline.search_space("configspace", flat=True))

Configuration space object:
  Hyperparameters:
    MLPClassifier:activation, Type: Categorical, Choices: {identity, logistic, relu}, Default: identity
    MLPClassifier:alpha, Type: UniformFloat, Range: [0.0001, 0.1], Default: 0.05005
    MLPClassifier:learning_rate, Type: Categorical, Choices: {constant, invscaling, adaptive}, Default: constant
    PCA:n_components, Type: UniformInteger, Range: [1, 3], Default: 2
    RandomForestClassifier:criterion, Type: Categorical, Choices: {gini, log_loss}, Default: gini
    RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 55
    SVC:C, Type: UniformFloat, Range: [0.1, 10.0], Default: 5.05
    estimator:__choice__, Type: Categorical, Choices: {MLPClassifier, RandomForestClassifier, SVC}, Default: MLPClassifier
  Conditions:
    MLPClassifier:activation | estimator:__choice__ == 'MLPClassifier'
    MLPClassifier:alpha | estimator:__choice__ == 'MLPClassifier'
    MLPClassifier:learning_rate | estimator:__choice__ == 'MLPClassifier'
    RandomForestClassifier:criterion | estimator:__choice__ == 'RandomForestClassifier'
    RandomForestClassifier:n_estimators | estimator:__choice__ == 'RandomForestClassifier'
    SVC:C | estimator:__choice__ == 'SVC'

More Specific Hyperparameters#

You'll often want to be a bit more specific with your hyperparameters, here we just show a few examples of how you'd couple your pipelines a bit more towards ConfigSpace.

from ConfigSpace import Float, Categorical, Normal
from amltk.pipeline import Searchable

s = Searchable(
    space={
        "lr": Float("lr", bounds=(1e-5, 1.), log=True, default=0.3),
        "balance": Float("balance", bounds=(-1.0, 1.0), distribution=Normal(0.0, 0.5)),
        "color": Categorical("color", ["red", "green", "blue"], weights=[2, 1, 1], default="blue"),
    },
    name="Something-To-Search",
)
print(s.search_space("configspace"))

Configuration space object:
  Hyperparameters:
    Something-To-Search:balance, Type: NormalFloat, Mu: 0.0 Sigma: 0.5, Range: [-1.0, 1.0], Default: 0.0
    Something-To-Search:color, Type: Categorical, Choices: {red, green, blue}, Default: blue, Probabilities: (0.5, 0.25, 0.25)
    Something-To-Search:lr, Type: UniformFloat, Range: [1e-05, 1.0], Default: 0.3, on log-scale

Conditional ands Advanced Usage#

We will refer you to the ConfigSpace documentation for the construction of these. However once you've constructed a ConfigurationSpace and added any forbiddens and conditionals, you may simply set that as the .space attribute.

from amltk.pipeline import Component, Choice, Sequential
from ConfigSpace import ConfigurationSpace, EqualsCondition, InCondition

myspace = ConfigurationSpace({"A": ["red", "green", "blue"], "B": (1, 10), "C": (-100.0, 0.0)})
myspace.add_conditions([
    EqualsCondition(myspace["B"], myspace["A"], "red"),  # B is active when A is red
    InCondition(myspace["C"], myspace["A"], ["green", "blue"]), # C is active when A is green or blue
])

component = Component(object, space=myspace, name="MyThing")

parsed_space = component.search_space("configspace")
print(parsed_space)

Configuration space object:
  Hyperparameters:
    MyThing:A, Type: Categorical, Choices: {red, green, blue}, Default: red
    MyThing:B, Type: UniformInteger, Range: [1, 10], Default: 6
    MyThing:C, Type: UniformFloat, Range: [-100.0, 0.0], Default: -50.0
  Conditions:
    MyThing:B | MyThing:A == 'red'
    MyThing:C | MyThing:A in {'green', 'blue'}

Optuna#

Optuna parser for parsing out a search_space(). from a pipeline.

Requirements

This requires Optuna which can be installed with:

pip install amltk[optuna]

# Or directly
pip install optuna

Limitations

Optuna feature a very dynamic search space (define-by-run), where people typically sample from some trial object and use traditional python control flow to define conditionality.

This means we can not trivially represent this conditionality in a static search space. While band-aids are possible, it naturally does not sit well with the static output of a parser.

As such, our parser does not support conditionals or choices!. Users may still use the define-by-run within their optimization function itself.

If you have experience with Optuna and have any suggestions, please feel free to open an issue or PR on GitHub!

Usage#

The typical way to represent a search space for Optuna is just to use a dictionary, where the keys are the names of the hyperparameters and the values are either integer/float tuples indicating boundaries or some discrete set of values. It is possible to have the value directly be a BaseDistribution, an optuna type, when you need to customize the distribution more.

from amltk.pipeline import Component
from optuna.distributions import FloatDistribution

c = Component(
    object,
    space={
        "myint": (1, 10),
        "myfloat": (1.0, 10.0),
        "mycategorical": ["a", "b", "c"],
        "log-scale-custom": FloatDistribution(1e-10, 1e-2, log=True),
    },
    name="name",
)

space = c.search_space(parser="optuna")

{
    'name:myint': IntDistribution(high=10, log=False, low=1, step=1),
    'name:myfloat': FloatDistribution(high=10.0, log=False, low=1.0, step=None),
    'name:mycategorical': CategoricalDistribution(choices=('a', 'b', 'c')),
    'name:log-scale-custom': FloatDistribution(high=0.01, log=True, low=1e-10, 
step=None)
}

You may also just pass the parser= function directly if preferred

from amltk.pipeline.parsers.optuna import parser as optuna_parser

space = c.search_space(parser=optuna_parser)

{
    'name:myint': IntDistribution(high=10, log=False, low=1, step=1),
    'name:myfloat': FloatDistribution(high=10.0, log=False, low=1.0, step=None),
    'name:mycategorical': CategoricalDistribution(choices=('a', 'b', 'c')),
    'name:log-scale-custom': FloatDistribution(high=0.01, log=True, low=1e-10, 
step=None)
}

When using search_space() on a some nested structures, you may want to flatten the names of the hyperparameters. For this you can use flat=

from amltk.pipeline import Searchable, Sequential

seq = Sequential(
    Searchable({"myint": (1, 10)}, name="nested_1"),
    Searchable({"myfloat": (1.0, 10.0)}, name="nested_2"),
    name="seq"
)

hierarchical_space = seq.search_space(parser="optuna", flat=False)  # Default

flat_space = seq.search_space(parser="optuna", flat=False)  # Default

{
    'seq:nested_1:myint': IntDistribution(high=10, log=False, low=1, step=1),
    'seq:nested_2:myfloat': FloatDistribution(high=10.0, log=False, low=1.0, 
step=None)
}

{
    'seq:nested_1:myint': IntDistribution(high=10, log=False, low=1, step=1),
    'seq:nested_2:myfloat': FloatDistribution(high=10.0, log=False, low=1.0, 
step=None)
}