Skip to content

Pipeline

A pipeline is a collection of Nodes that are connected together to form a directed acylic graph, where the nodes follow a parent-child relation ship. The purpose of these is to form some abstract representation of what you want to search over/optimize and then build into a concrete object.

Key Operations#

Once a pipeline is created, you can perform 3 very critical operations on it:

Node#

A Node is the basic building block of a pipeline. It contains various attributes, such as a

  • .name - The name of the node, which is used to identify it in the pipeline.
  • .item - The concrete object or some function to construct one
  • .space - A search space to consider for this node
  • .config - The specific configuration to use for this node once build is called.
  • .nodes - Other nodes that this node links to.

To give syntactic meaning to these nodes, we have various subclasses. For example, Sequential is a node where the order of the nodes it contains matter, while a Component is a node that can be used to parametrize and construct a concrete object, but does not lead to anything else.

Each node type here is either a leaf or a branch, where a branch has children, while while a leaf does not.

There various components are listed here:

Component - leaf#

A parametrizable node type with some way to build an object, given a configuration.

from amltk.pipeline import Component
from dataclasses import dataclass

@dataclass
class Model:
    x: float

c = Component(Model, space={"x": (0.0, 1.0)}, name="model")

╭─ Component(model) ──────╮
 item  class Model(...)  
 space {'x': (0.0, 1.0)} 
╰─────────────────────────╯

Searchable - leaf#

A parametrizable node type that contains a search space that should be searched over, but does not provide a concrete object.

from amltk.pipeline import Searchable

def run_script(mode, n):
    # ... run some actual script
    pass

script_space = Searchable({"mode": ["orange", "blue", "red"], "n": (10, 100)})

╭─ Searchable(Searchable-9AyxMgrj) ─────────────────────────╮
 space {'mode': ['orange', 'blue', 'red'], 'n': (10, 100)} 
╰───────────────────────────────────────────────────────────╯

Fixed - leaf#

A non-parametrizable node type that contains an object that should be used as is.

from amltk.pipeline import Component, Fixed, Sequential
from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier()
# ... pretend it was fit
fitted_estimator = Fixed(estimator)

╭─ Fixed(RandomForestClassifier) ─╮
 item RandomForestClassifier()   
╰─────────────────────────────────╯

Sequential - branch#

A node type which signifies an order between its children, such as a sequential set of preprocessing and estimator through which the data should flow.

from amltk.pipeline import Component, Sequential
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

pipeline = Sequential(
    PCA(n_components=3),
    Component(RandomForestClassifier, space={"n_estimators": (10, 100)}),
    name="my_pipeline"
)

╭─ Sequential(my_pipeline) ───────────────────╮
 ╭─ Fixed(PCA) ─────────────╮                
  item PCA(n_components=3)                 
 ╰──────────────────────────╯                
  
 ╭─ Component(RandomForestClassifier) ─────╮ 
  item  class RandomForestClassifier(...)  
  space {'n_estimators': (10, 100)}        
 ╰─────────────────────────────────────────╯ 
╰─────────────────────────────────────────────╯

Choice - branch#

A node type that signifies a choice between multiple children, usually chosen during configuration.

from amltk.pipeline import Choice, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
mlp = Component(MLPClassifier, space={"activation": ["logistic", "relu", "tanh"]})

estimator_choice = Choice(rf, mlp, name="estimator")

╭─ Choice(estimator) ──────────────────────────────────────────────────────────╮
 ╭─ Component(MLPClassifier) ─────╮ ╭─ Component(RandomForestClassifier)─╮    
  item  class MLPClassifier(...)   item  class                            
  space {                                RandomForestClassifier(...)      
            'activation': [        space {'n_estimators': (10, 100)}      
                'logistic',       ╰────────────────────────────────────╯    
                'relu',                                                     
                'tanh'                                                      
            ]                                                               
        }                                                                   
 ╰────────────────────────────────╯                                           
╰──────────────────────────────────────────────────────────────────────────────╯

Split - branch#

A node where the output of the previous node is split amongst its children, according to it's configuration.

from amltk.pipeline import Component, Split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector

categorical_pipeline = [
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(drop="first"),
]
numerical_pipeline = Component(SimpleImputer, space={"strategy": ["mean", "median"]})

preprocessor = Split(
    {"categories": categorical_pipeline, "numerical": numerical_pipeline},
    name="my_split"
)

╭─ Split(my_split) ────────────────────────────────────────────────────────────╮
 ╭─ Sequential(categories) ──────────╮ ╭─ Sequential(numerical) ────────────╮ 
  ╭─ Fixed(SimpleImputer) ────────╮   ╭─ Component(SimpleImputer) ─────╮  
   item SimpleImputer(fill_valu…     item  class SimpleImputer(...)   
        strategy='constant')         space {                          
  ╰───────────────────────────────╯              'strategy': [          
                    'mean',            
  ╭─ Fixed(OneHotEncoder) ────────╮                  'median'           
   item OneHotEncoder(drop='fir…               ]                      
  ╰───────────────────────────────╯          }                          
 ╰───────────────────────────────────╯  ╰────────────────────────────────╯  
                                       ╰────────────────────────────────────╯ 
╰──────────────────────────────────────────────────────────────────────────────╯

Join - branch#

A node where the output of the previous node is sent all of its children.

from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

pca = Component(PCA, space={"n_components": (1, 3)})
kbest = Component(SelectKBest, space={"k": (1, 3)})

join = Join(pca, kbest, name="my_feature_union")

╭─ Join(my_feature_union) ────────────────────────────────────────────╮
 ╭─ Component(PCA) ───────────────╮ ╭─ Component(SelectKBest) ─────╮ 
  item  class PCA(...)             item  class SelectKBest(...)  
  space {'n_components': (1, 3)}   space {'k': (1, 3)}           
 ╰────────────────────────────────╯ ╰──────────────────────────────╯ 
╰─────────────────────────────────────────────────────────────────────╯

Syntax Sugar#

You can connect these nodes together using either the constructors explicitly, as shown in the examples. We also provide some index operators:

  • >> - Connect nodes together to form a Sequential
  • & - Connect nodes together to form a Join
  • | - Connect nodes together to form a Choice

There is also another short-hand that you may find useful to know:

  • {comp1, comp2, comp3} - This will automatically be converted into a Choice between the given components.
  • (comp1, comp2, comp3) - This will automatically be converted into a Join between the given components.
  • [comp1, comp2, comp3] - This will automatically be converted into a Sequential between the given components.