Skip to content

Pipelines Guide#

AutoML-toolkit was built to support future development of AutoML systems and a central part of an AutoML system is its pipeline. The purpose of this guide is to help you understand all the utility AutoML-toolkit can provide to help you define your pipeline. We will do this by introducing concepts from the ground up, rather than top down. Please see the reference if you just want to quickly look something up.


Introduction#

The kinds of pipelines that exist in an AutoML system come in many different forms. For example, one might be an sklearn.pipeline.Pipeline, others might be some deep-learning pipeline, while some might even stand for some real life machinery process and the settings of these machines.

To accommodate this, what AutoML-Toolkit provides is an abstract representation of a pipeline, to help you define its search space and also to build concrete objects in code if possible (see builders).

We categorize this into 4 steps:

  1. Parametrize your pipeline using the various components, including the kinds of items in the pipeline, the search spaces and any additional configuration. Each of the various types of components gives a syntactic meaning when performing the next steps.

  2. pipeline.search_space(parser=...), Get a useable search space out of the pipeline. This can then be passed to an Optimizer.

  3. pipeline.configure(config=...), Configure your pipeline, either manually or using a configuration suggested by an optimizer.

  4. pipeline.build(builder=...), Build your configured pipeline definition into something useable, i.e. an sklearn.pipeline.Pipeline or a torch.nn.Module.

At the core of these definitions is the many Nodes it consists of. By combining these together, you can define a directed acyclic graph (DAG), that represents the structure of your pipeline. Here is one such sklearn example that we will build up towards.

╭─ Sequential(Classy Pipeline) ────────────────────────────────────────────────╮
 ╭─ Split(preprocessing) ───────────────────────────────────────────────────╮ 
  config {                                                                  
             'categoricals':                                                
         <sklearn.compose._column_transformer.make_column_selector object   
         at 0x7fcf55fc6320>,                                                
             'numerics':                                                    
         <sklearn.compose._column_transformer.make_column_selector object   
         at 0x7fcf55fc4640>                                                 
         }                                                                  
  ╭─ Sequential(categoricals) ──────╮ ╭─ Sequential(numerics) ───────────╮  
   ╭─ Fixed(SimpleImputer) ──────╮   ╭─ Component(SimpleImputer) ───╮   
    item SimpleImputer(fill_va…     item  class                     
         strategy='constant')             SimpleImputer(...)        
   ╰─────────────────────────────╯    space {                         
                 'strategy': [         
   ╭─ Fixed(OneHotEncoder) ──────╮                  'mean',           
    item OneHotEncoder(drop='f…                   'median'          
   ╰─────────────────────────────╯              ]                     
  ╰─────────────────────────────────╯         }                         
                                       ╰──────────────────────────────╯   
                                      ╰──────────────────────────────────╯  
 ╰──────────────────────────────────────────────────────────────────────────╯ 
  
 ╭─ Component(RandomForestClassifier) ──────────────────────────────────╮     
  item  class RandomForestClassifier(...)                                   
  space {'n_estimators': (10, 100), 'criterion': ['gini', 'log_loss']}      
 ╰──────────────────────────────────────────────────────────────────────╯     
╰──────────────────────────────────────────────────────────────────────────────╯

Pipeline
from sklearn.compose import make_column_selector
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import numpy as np

from amltk.pipeline import Component, Split, Sequential

feature_preprocessing = Split(
    {
        "categoricals": [SimpleImputer(strategy="constant", fill_value="missing"), OneHotEncoder(drop="first")],
        "numerics": Component(SimpleImputer, space={"strategy": ["mean", "median"]}),
    },
    config={
        "categoricals": make_column_selector(dtype_include=object),
        "numerics": make_column_selector(dtype_include=np.number),
    },
    name="preprocessing",
)

pipeline = Sequential(
    feature_preprocessing,
    Component(RandomForestClassifier, space={"n_estimators": (10, 100), "criterion": ["gini", "log_loss"]}),
    name="Classy Pipeline",
)
rich printing

To get the same output locally (terminal or Notebook), you can either call thing.__rich()__, use from rich import print; print(thing) or in a Notebook, simply leave it as the last object of a cell.

Once we have our pipeline definition, extracting a search space, configuring it and building it into something useful can be done with the methods.

Guide Requirements

For this guide, we will be using ConfigSpace and scikit-learn, you can install them manually or as so:

pip install "amltk[sklearn, configspace]"

Component#

A pipeline consists of building blocks which we can combine together to create a DAG. We will start by introducing the Component, the common operations, and then show how to combine them together.

A Component is the most common kind of node in a pipeline. Like all parts of the pipeline, they subclass Node, but a Component signifies this is some concrete object, with a possible .space and .config.

Definition#

Naming Nodes

By default, a Component (or any Node for that matter), will use the function/classname for the .name of the Node. You can explicitly pass a name= as a keyword argument when constructing these.

from dataclasses import dataclass

from amltk.pipeline import Component

@dataclass
class MyModel:
    f: float
    i: int
    c: str

my_component = Component(
    MyModel,
    space={"f": (0.0, 1.0), "i": (0, 10), "c": ["red", "green", "blue"]},
)

╭─ Component(MyModel) ─────────────────────────────────────────────────╮
 item  class MyModel(...)                                             
 space {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} 
╰──────────────────────────────────────────────────────────────────────╯

You can also use a function instead of a class if that is preferred.

def myfunc(f: float, i: int, c: str) -> MyModel:
    if f < 0.5:
        c = "red"
    return MyModel(f=f, i=i, c=c)

component_with_function = Component(
    myfunc,
    space={"f": (0.0, 1.0), "i": (0, 10), "c": ["red", "green", "blue"]},
)

╭─ Component(function) ────────────────────────────────────────────────╮
 item  def myfunc(...)                                                
 space {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} 
╰──────────────────────────────────────────────────────────────────────╯

Search Space#

If interacting with an Optimizer, you'll often require some search space object to pass to it. To extract a search space from a Component, we can call search_space(parser=...), passing in the kind of search space you'd like to get out of it.

space = my_component.search_space("configspace")
print(space)
Configuration space object:
  Hyperparameters:
    MyModel:c, Type: Categorical, Choices: {red, green, blue}, Default: red
    MyModel:f, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    MyModel:i, Type: UniformInteger, Range: [0, 10], Default: 5

Available Search Spaces

Please see the spaces reference

Depending on what you pass as the parser= to search_space(parser=...), we'll attempt to give you a valid search space. In this case, we specified "configspace" and so we get a ConfigSpace implementation.

You may also define your own parser= and use that if desired.

Configure#

Pretty straight forward, but what do we do with this config? Well we can configure(config=...) the component with it.

config = space.sample_configuration()
configured_component = my_component.configure(config)

╭─ Component(MyModel) ──────────────────────────────────────────────────╮
 item   class MyModel(...)                                             
 config {'c': 'green', 'f': 0.42650157485855034, 'i': 7}               
 space  {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} 
╰───────────────────────────────────────────────────────────────────────╯

You'll notice that each variable in the space has been set to some value. We could also manually define a config and pass that in. You are not obliged to fully specify this either.

manually_configured_component = my_component.configure({"f": 0.5, "i": 1})

╭─ Component(MyModel) ──────────────────────────────────────────────────╮
 item   class MyModel(...)                                             
 config {'f': 0.5, 'i': 1}                                             
 space  {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} 
╰───────────────────────────────────────────────────────────────────────╯

Immutable methods!

One thing you may have noticed is that we assigned the result of configure(config=...) to a new variable. This is because we do not mutate the original my_component and instead return a copy with all of the config variables set.

Build#

To build the individual item of a Component we can use build_item() and it simply calls the .item with the config we have set.

# Same as if we did `configured_component.item(**configured_component.config)`
the_built_model = configured_component.build_item()
print(the_built_model)
MyModel(f=0.42650157485855034, i=7, c='green')

However, as we'll see later, we often have multiple steps of a pipeline joined together and so we need some way to get a full object out of it that takes into account all of these items joined together. We can do this with build(builder=...).

the_built_model = configured_component.build(builder="sklearn")
print(the_built_model)
Pipeline(steps=[('MyModel', MyModel(f=0.42650157485855034, i=7, c='green'))])

For a look at the available arguments to pass to builder=, see the builder reference

Fixed#

Sometimes we just have some part of the pipeline with no search space and no configuration required, i.e. just some prebuilt thing. We can use the Fixed node type to signify this.

from amltk.pipeline import Fixed
from sklearn.ensemble import RandomForestClassifier

frozen_rf = Fixed(RandomForestClassifier(n_estimators=5))
<pre style="font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace;font-size:0.75rem">
<code style="font-family:inherit"><span style="color: #56351e; text-decoration-color: #56351e">╭─ </span><span style="color: #56351e; text-decoration-color: #56351e; font-weight: bold">Fixed</span><span style="color: #56351e; text-decoration-color: #56351e">(</span><span style="color: #56351e; text-decoration-color: #56351e; font-style: italic">RandomForestClassifier</span><span style="color: #56351e; text-decoration-color: #56351e">) ─────────────╮</span>
<span style="color: #56351e; text-decoration-color: #56351e"></span> <span style="color: #000000; text-decoration-color: #000000">item </span><span style="color: #800080; text-decoration-color: #800080; font-weight: bold">RandomForestClassifier</span><span style="color: #000000; text-decoration-color: #000000; font-weight: bold">(</span><span style="color: #808000; text-decoration-color: #808000">n_estimators</span><span style="color: #000000; text-decoration-color: #000000">=</span><span style="color: #008080; text-decoration-color: #008080; font-weight: bold">5</span><span style="color: #000000; text-decoration-color: #000000; font-weight: bold">)</span> <span style="color: #56351e; text-decoration-color: #56351e"></span>
<span style="color: #56351e; text-decoration-color: #56351e">╰─────────────────────────────────────────────╯</span>
</code>
</pre>

Parameter Requests#

Sometimes you may wish to explicitly specify some value should be added to the .config during configure() which would be difficult to include in the config directly, for example the random_state of an sklearn estimator. You can pass these extra parameters into configure(params={...}), which do not require any namespace prefixing.

For this reason, we introduce the concept of a request(), allowing you to specify that a certain parameter should be added to the config during configure().

from dataclasses import dataclass

from amltk import Component, request

@dataclass
class MyModel:
    f: float
    random_state: int

my_component = Component(
    MyModel,
    space={"f": (0.0, 1.0)},
    config={"random_state": request("seed", default=42)}
)

# Without passing the params
configured_component_no_seed = my_component.configure({"f": 0.5})

# With passing the params
configured_component_with_seed = my_component.configure({"f": 0.5}, params={"seed": 1337})

╭─ Component(MyModel) ──────────────────╮
 item   class MyModel(...)             
 config {'random_state': 42, 'f': 0.5} 
 space  {'f': (0.0, 1.0)}              
╰───────────────────────────────────────╯

╭─ Component(MyModel) ────────────────────╮
 item   class MyModel(...)               
 config {'random_state': 1337, 'f': 0.5} 
 space  {'f': (0.0, 1.0)}                
╰─────────────────────────────────────────╯

If you explicitly require a parameter to be set, just do not set a default=.

my_component = Component(
    MyModel,
    space={"f": (0.0, 1.0)},
    config={"random_state": request("seed")}
)

my_component.configure({"f": 0.5}, params={"seed": 5})  # All good

try:
    my_component.configure({"f": 0.5})  # Missing required parameter
except ValueError as e:
    print(e)
Missing request=ParamRequest(key='seed', default=<object object at 0x7fcf36860340>) for Component(name='MyModel', item=<class 'MyModel'>, nodes=(), config={'random_state': ParamRequest(key='seed', default=<object object at 0x7fcf36860340>)}, space={'f': (0.0, 1.0)}, fidelities=None, config_transform=None, meta=None).
params=None

Config Transform#

Some search space and optimizers may have limitations in terms of the kinds of parameters they can support, one notable example is tuple parameters. To get around this, we can pass a config_transform= to component which will transform the config before it is passed to the .item during build().

from dataclasses import dataclass

from amltk import Component

@dataclass
class MyModel:
    dimensions: tuple[int, int]

def config_transform(config: dict, _) -> dict:
    """Convert "dim1" and "dim2" into a tuple."""
    dim1 = config.pop("dim1")
    dim2 = config.pop("dim2")
    config["dimensions"] = (dim1, dim2)
    return config

my_component = Component(
    MyModel,
    space={"dim1": (1, 10), "dim2": (1, 10)},
    config_transform=config_transform,
)

configured_component = my_component.configure({"dim1": 5, "dim2": 5})

╭─ Component(MyModel) ─────────────────────────╮
 item      class MyModel(...)                 
 config    {'dimensions': (5, 5)}             
 space     {'dim1': (1, 10), 'dim2': (1, 10)} 
 transform def config_transform(...)          
╰──────────────────────────────────────────────╯

Transform Context

There may be times where you have some additional context, which you may only know at configuration time. In this case, it is possible to pass this additional context to configure(..., transform_context=...), which will be forwarded as the second argument to your .config_transform.

Sequential#

A single component might be enough for some basic definitions but generally we need to combine multiple components together. AutoML-Toolkit is designed for large and more complex structures which can be made from simple atomic Nodes.

Chaining Together Nodes#

We'll begin by creating two components that wrap scikit-learn estimators.

from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

from amltk.pipeline import Component

imputer = Component(SimpleImputer, space={"strategy": ["median", "mean"]})
rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})

╭─ Component(SimpleImputer) ─────────────╮
 item  class SimpleImputer(...)         
 space {'strategy': ['median', 'mean']} 
╰────────────────────────────────────────╯

╭─ Component(RandomForestClassifier) ─────╮
 item  class RandomForestClassifier(...) 
 space {'n_estimators': (10, 100)}       
╰─────────────────────────────────────────╯

Infix >>

To join these two components together, we can either use the infix notation using >>, or passing them directly to a Sequential. However, a random name will be given when using the infix notation.

joined_components = imputer >> rf
from amltk.pipeline import Sequential
pipeline = Sequential(imputer, rf, name="My Pipeline")

╭─ Sequential(My Pipeline) ───────────────────╮
 ╭─ Component(SimpleImputer) ─────────────╮  
  item  class SimpleImputer(...)           
  space {'strategy': ['median', 'mean']}   
 ╰────────────────────────────────────────╯  
  
 ╭─ Component(RandomForestClassifier) ─────╮ 
  item  class RandomForestClassifier(...)  
  space {'n_estimators': (10, 100)}        
 ╰─────────────────────────────────────────╯ 
╰─────────────────────────────────────────────╯

Operations#

You can perform much of the same operations as we did for the individual node but now taking into account everything in the pipeline.

space = pipeline.search_space("configspace")
config = space.sample_configuration()
configured_pipeline = pipeline.configure(config)

Configuration space object:
  Hyperparameters:
    My Pipeline:RandomForestClassifier:n_estimators, Type: UniformInteger, 
Range: [10, 100], Default: 55
    My Pipeline:SimpleImputer:strategy, Type: Categorical, Choices: {median, 
mean}, Default: median


Configuration(values={
  'My Pipeline:RandomForestClassifier:n_estimators': 100,
  'My Pipeline:SimpleImputer:strategy': 'median',
})

╭─ Sequential(My Pipeline) ────────────────────╮
 ╭─ Component(SimpleImputer) ──────────────╮  
  item   class SimpleImputer(...)           
  config {'strategy': 'median'}             
  space  {'strategy': ['median', 'mean']}   
 ╰─────────────────────────────────────────╯  
  
 ╭─ Component(RandomForestClassifier) ──────╮ 
  item   class RandomForestClassifier(...)  
  config {'n_estimators': 100}              
  space  {'n_estimators': (10, 100)}        
 ╰──────────────────────────────────────────╯ 
╰──────────────────────────────────────────────╯

To build a pipeline of nodes, we simply call build(builder=...). We explicitly pass the builder we want to use, which informs build() how to go from the abstract pipeline definition you've defined to something concrete you can use. You can find the available builders here.

from sklearn.pipeline import Pipeline as SklearnPipeline

built_pipeline = configured_pipeline.build("sklearn")
assert isinstance(built_pipeline, SklearnPipeline)

Pipeline(steps=[('SimpleImputer', SimpleImputer(strategy='median')),
                ('RandomForestClassifier', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Other Building blocks#

We saw the basic building block of a Component, but AutoML-Toolkit also provides support for some other kinds of building blocks. These building blocks can be attached and joined together just like a Component can and allow for much more complex pipeline structures.

Choice#

A Choice is a way to define a choice between multiple components. This is useful when you want to search over multiple algorithms, which may each have their own hyperparameters.

We'll start again by creating two nodes:

from dataclasses import dataclass

from amltk.pipeline import Component

@dataclass
class ModelA:
    i: int

@dataclass
class ModelB:
    c: str

model_a = Component(ModelA, space={"i": (0, 100)})
model_b = Component(ModelB, space={"c": ["red", "blue"]})

╭─ Component(ModelA) ─────╮
 item  class ModelA(...) 
 space {'i': (0, 100)}   
╰─────────────────────────╯

╭─ Component(ModelB) ──────────╮
 item  class ModelB(...)      
 space {'c': ['red', 'blue']} 
╰──────────────────────────────╯

Now combining them into a choice is rather straight forward:

from amltk.pipeline import Choice

model_choice = Choice(model_a, model_b, name="estimator")

╭─ Choice(estimator) ──────────────────────────────────────────╮
 ╭─ Component(ModelA) ─────╮ ╭─ Component(ModelB) ──────────╮ 
  item  class ModelA(...)   item  class ModelB(...)       
  space {'i': (0, 100)}     space {'c': ['red', 'blue']}  
 ╰─────────────────────────╯ ╰──────────────────────────────╯ 
╰──────────────────────────────────────────────────────────────╯

Conditionals and Search Spaces

Not all search space implementations support conditionals and so some parser= may not be able to handle this. In this case, there won't be any conditionality in the search space.

Check out the parser reference for more information.

Just as we did with a Component, we can also get a search_space() from the choice.

space = model_choice.search_space("configspace")

Configuration space object:
  Hyperparameters:
    estimator:ModelA:i, Type: UniformInteger, Range: [0, 100], Default: 50
    estimator:ModelB:c, Type: Categorical, Choices: {red, blue}, Default: red
    estimator:__choice__, Type: Categorical, Choices: {ModelA, ModelB}, Default:
ModelA
  Conditions:
    estimator:ModelA:i | estimator:__choice__ == 'ModelA'
    estimator:ModelB:c | estimator:__choice__ == 'ModelB'


When we configure() a choice, we will collapse it down to a single component. This is done according to what is set in the config.

config = space.sample_configuration()
configured_choice = model_choice.configure(config)

╭─ Choice(estimator) ───────────────────────────────────────────╮
 config {'__choice__': 'ModelB'}                               
 ╭─ Component(ModelA) ─────╮ ╭─ Component(ModelB) ───────────╮ 
  item  class ModelA(...)   item   class ModelB(...)       
  space {'i': (0, 100)}     config {'c': 'red'}            
 ╰─────────────────────────╯  space  {'c': ['red', 'blue']}  
                             ╰───────────────────────────────╯ 
╰───────────────────────────────────────────────────────────────╯

You'll notice that it set the .config of the Choice to {"__choice__": "model_a"} or {"__choice__": "model_b"}. This lets a builder know which of these two to build.

Split#

A Split is a way to signify a split in the dataflow of a pipeline. This Split by itself will not do anything but it informs the builder about what to do. Each builder will have its own specific strategy for dealing with one.

Let's go ahead with a scikit-learn example, where we'll split the data into categorical and numerical features and then perform some preprocessing on each of them.

from sklearn.compose import make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import numpy as np

from amltk.pipeline import Component, Split

select_categories = make_column_selector(dtype_include=object)
select_numerical = make_column_selector(dtype_include=np.number)

preprocessor = Split(
    {
        "categories": [SimpleImputer(strategy="constant", fill_value="missing"), OneHotEncoder(drop="first")],
        "numerics": Component(SimpleImputer, space={"strategy": ["mean", "median"]}),
    },
    config={"categories": select_categories, "numerics": select_numerical},
    name="feature_preprocessing",
)

╭─ Split(feature_preprocessing) ───────────────────────────────────────────────╮
 config {                                                                     
            'categories':                                                     
        <sklearn.compose._column_transformer.make_column_selector object at   
        0x7fcf55e09330>,                                                      
            'numerics':                                                       
        <sklearn.compose._column_transformer.make_column_selector object at   
        0x7fcf55e09120>                                                       
        }                                                                     
 ╭─ Sequential(categories) ──────────╮ ╭─ Sequential(numerics) ─────────────╮ 
  ╭─ Fixed(SimpleImputer) ────────╮   ╭─ Component(SimpleImputer) ─────╮  
   item SimpleImputer(fill_valu…     item  class SimpleImputer(...)   
        strategy='constant')         space {                          
  ╰───────────────────────────────╯              'strategy': [          
                    'mean',            
  ╭─ Fixed(OneHotEncoder) ────────╮                  'median'           
   item OneHotEncoder(drop='fir…               ]                      
  ╰───────────────────────────────╯          }                          
 ╰───────────────────────────────────╯  ╰────────────────────────────────╯  
                                       ╰────────────────────────────────────╯ 
╰──────────────────────────────────────────────────────────────────────────────╯

An important thing to note here is that first, we passed a dict to Split, such that we can name the individual paths. This is important because we need some name to refer to them when configuring the Split. It does this by simply wrapping each of the paths in a Sequential.

The second thing is that the parameters set for the .config matches those of the paths. This let's the Split know which data should be sent where. Each builder= will have its own way of how to set up a Split and you should refer to the builders reference for more information.

Our last step is just to convert this into a useable object and so once again we use build().

built_pipeline = preprocessor.build("sklearn")

Pipeline(steps=[('feature_preprocessing',
                 ColumnTransformer(transformers=[('categories',
                                                  Pipeline(steps=[('SimpleImputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('OneHotEncoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fcf55e09330>),
                                                 ('SimpleImputer',
                                                  SimpleImputer(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fcf55e09120>)]))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Join#

TODO

TODO

Searchable#

TODO

TODO

Option#

TODO

Please feel free to provide a contribution!