Pipeline
A pipeline is a collection of Node
s
that are connected together to form a directed acylic graph, where the nodes
follow a parent-child relation ship. The purpose of these is to form some abstract
representation of what you want to search over/optimize and then build into a concrete object.
Key Operations#
Once a pipeline is created, you can perform 3 very critical operations on it:
search_space(parser=...)
- This will return the search space of the pipeline, as defined by it's nodes. You can find the reference to the available parsers and search spaces here.configure(config=...)
- This will return a new pipeline where each node is configured correctly.build(builder=...)
- This will return some concrete object from a configured pipeline. You can find the reference to the available builders here.
Node#
A Node
is the basic building block of a pipeline.
It contains various attributes, such as a
.name
- The name of the node, which is used to identify it in the pipeline..item
- The concrete object or some function to construct one.space
- A search space to consider for this node.config
- The specific configuration to use for this node oncebuild
is called..nodes
- Other nodes that this node links to.
To give syntactic meaning to these nodes, we have various subclasses. For example,
Sequential
is a node where the order of the
nodes
it contains matter, while a Component
is a node
that can be used to parametrize and construct a concrete object, but does not lead to anything else.
Each node type here is either a leaf or a branch, where a branch has children, while while a leaf does not.
There various components are listed here:
Component
- leaf
#
A parametrizable node type with some way to build an object, given a configuration.
from amltk.pipeline import Component
from dataclasses import dataclass
@dataclass
class Model:
x: float
c = Component(Model, space={"x": (0.0, 1.0)}, name="model")
╭─ Component(model) ──────╮
│ item class Model(...) │
│ space {'x': (0.0, 1.0)} │
╰─────────────────────────╯
Searchable
- leaf
#
A parametrizable node type that contains a search space that should be searched over, but does not provide a concrete object.
from amltk.pipeline import Searchable
def run_script(mode, n):
# ... run some actual script
pass
script_space = Searchable({"mode": ["orange", "blue", "red"], "n": (10, 100)})
╭─ Searchable(Searchable-XL8TnRdi) ─────────────────────────╮
│ space {'mode': ['orange', 'blue', 'red'], 'n': (10, 100)} │
╰───────────────────────────────────────────────────────────╯
Fixed
- leaf
#
A non-parametrizable node type that contains an object that should be used as is.
from amltk.pipeline import Component, Fixed, Sequential
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier()
# ... pretend it was fit
fitted_estimator = Fixed(estimator)
╭─ Fixed(RandomForestClassifier) ─╮
│ item RandomForestClassifier() │
╰─────────────────────────────────╯
Sequential
- branch
#
A node type which signifies an order between its children, such as a sequential set of preprocessing and estimator through which the data should flow.
from amltk.pipeline import Component, Sequential
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
pipeline = Sequential(
PCA(n_components=3),
Component(RandomForestClassifier, space={"n_estimators": (10, 100)}),
name="my_pipeline"
)
╭─ Sequential(my_pipeline) ───────────────────╮
│ ╭─ Fixed(PCA) ─────────────╮ │
│ │ item PCA(n_components=3) │ │
│ ╰──────────────────────────╯ │
│ ↓ │
│ ╭─ Component(RandomForestClassifier) ─────╮ │
│ │ item class RandomForestClassifier(...) │ │
│ │ space {'n_estimators': (10, 100)} │ │
│ ╰─────────────────────────────────────────╯ │
╰─────────────────────────────────────────────╯
Choice
- branch
#
A node type that signifies a choice between multiple children, usually chosen during configuration.
from amltk.pipeline import Choice, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
mlp = Component(MLPClassifier, space={"activation": ["logistic", "relu", "tanh"]})
estimator_choice = Choice(rf, mlp, name="estimator")
╭─ Choice(estimator) ──────────────────────────────────────────────────────────╮
│ ╭─ Component(MLPClassifier) ─────╮ ╭─ Component(RandomForestClassifier)─╮ │
│ │ item class MLPClassifier(...) │ │ item class │ │
│ │ space { │ │ RandomForestClassifier(...) │ │
│ │ 'activation': [ │ │ space {'n_estimators': (10, 100)} │ │
│ │ 'logistic', │ ╰────────────────────────────────────╯ │
│ │ 'relu', │ │
│ │ 'tanh' │ │
│ │ ] │ │
│ │ } │ │
│ ╰────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
Split
- branch
#
A node where the output of the previous node is split amongst its children, according to it's configuration.
from amltk.pipeline import Component, Split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector
categorical_pipeline = [
SimpleImputer(strategy="constant", fill_value="missing"),
OneHotEncoder(drop="first"),
]
numerical_pipeline = Component(SimpleImputer, space={"strategy": ["mean", "median"]})
preprocessor = Split(
{"categories": categorical_pipeline, "numerical": numerical_pipeline},
name="my_split"
)
╭─ Split(my_split) ────────────────────────────────────────────────────────────╮
│ ╭─ Sequential(categories) ──────────╮ ╭─ Sequential(numerical) ────────────╮ │
│ │ ╭─ Fixed(SimpleImputer) ────────╮ │ │ ╭─ Component(SimpleImputer) ─────╮ │ │
│ │ │ item SimpleImputer(fill_valu… │ │ │ │ item class SimpleImputer(...) │ │ │
│ │ │ strategy='constant') │ │ │ │ space { │ │ │
│ │ ╰───────────────────────────────╯ │ │ │ 'strategy': [ │ │ │
│ │ ↓ │ │ │ 'mean', │ │ │
│ │ ╭─ Fixed(OneHotEncoder) ────────╮ │ │ │ 'median' │ │ │
│ │ │ item OneHotEncoder(drop='fir… │ │ │ │ ] │ │ │
│ │ ╰───────────────────────────────╯ │ │ │ } │ │ │
│ ╰───────────────────────────────────╯ │ ╰────────────────────────────────╯ │ │
│ ╰────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
Join
- branch
#
A node where the output of the previous node is sent all of its children.
from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
pca = Component(PCA, space={"n_components": (1, 3)})
kbest = Component(SelectKBest, space={"k": (1, 3)})
join = Join(pca, kbest, name="my_feature_union")
╭─ Join(my_feature_union) ────────────────────────────────────────────╮
│ ╭─ Component(PCA) ───────────────╮ ╭─ Component(SelectKBest) ─────╮ │
│ │ item class PCA(...) │ │ item class SelectKBest(...) │ │
│ │ space {'n_components': (1, 3)} │ │ space {'k': (1, 3)} │ │
│ ╰────────────────────────────────╯ ╰──────────────────────────────╯ │
╰─────────────────────────────────────────────────────────────────────╯
Syntax Sugar#
You can connect these nodes together using either the constructors explicitly, as shown in the examples. We also provide some index operators:
>>
- Connect nodes together to form aSequential
&
- Connect nodes together to form aJoin
|
- Connect nodes together to form aChoice
There is also another short-hand that you may find useful to know:
{comp1, comp2, comp3}
- This will automatically be converted into aChoice
between the given components.(comp1, comp2, comp3)
- This will automatically be converted into aJoin
between the given components.[comp1, comp2, comp3]
- This will automatically be converted into aSequential
between the given components.