Pipeline

Pieces of a Pipeline#

A pipeline is a collection of Nodes that are connected together to form a directed acylic graph, where the nodes follow a parent-child relation ship. The purpose of these is to form some abstract representation of what you want to search over/optimize and then build into a concrete object.

These Nodes allow you to specific the function/object that will be used there, it's search space and any configuration you want to explicitly apply. There are various components listed below which gives these nodes extract syntatic meaning, e.g. a Choice which represents some choice between it's children while a Sequential indicates that each child follows one after the other.

Once a pipeline is created, you can perform 3 very critical operations on it:

search_space(parser=...) - This will return the search space of the pipeline, as defined by it's nodes. You can find the reference to the available parsers and search spaces here.
configure(config=...) - This will return a new pipeline where each node is configured correctly.
build(builder=...) - This will return some concrete object from a configured pipeline. You can find the reference to the available builders here.

Components#

You can use the various different node types to build a pipeline.

You can connect these nodes together using either the constructors explicitly, as shown in the examples. We also provide some index operators:

>> - Connect nodes together to form a Sequential
& - Connect nodes together to form a Join
| - Connect nodes together to form a Choice

There is also another short-hand that you may find useful to know:

{comp1, comp2, comp3} - This will automatically be converted into a Choice between the given components.
(comp1, comp2, comp3) - This will automatically be converted into a Join between the given components.
[comp1, comp2, comp3] - This will automatically be converted into a Sequential between the given components.

For each of these components we will show examples using the "sklearn" builder

The components are:

Component#

Bases: Node[Item, Space]

A Component of the pipeline with a possible item and no children.

This is the basic building block of most pipelines, it accepts as it's item= some function that will be called with build_item() to build that one part of the pipeline.

When build_item() is called, The .config on this node will be passed to the function to build the item.

A common pattern is to use a Component to wrap a constructor, specifying the space= and config= to be used when building the item.

from amltk.pipeline import Component
from sklearn.ensemble import RandomForestClassifier

rf = Component(
    RandomForestClassifier,
    config={"max_depth": 3},
    space={"n_estimators": (10, 100)}
)

config = {"n_estimators": 50}  # Sample from some space or something
configured_rf = rf.configure(config)

estimator = configured_rf.build_item()

RandomForestClassifier(max_depth=3, n_estimators=50)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Whenever some other node sees a function/constructor, i.e. RandomForestClassifier, this will automatically be converted into a Component.

from amltk.pipeline import Sequential
from sklearn.ensemble import RandomForestClassifier

pipeline = Sequential(RandomForestClassifier, name="my_pipeline")

The default .name of a component is the name of the class/function that it will use. You can explicitly set the name= if you want to when constructing the component.

Like all Nodes, a Component accepts an explicit name=, item=, config=, space=, fidelities=, config_transform= and meta=.

See Also

Node

Source code in src/amltk/pipeline/components.py

def __init__(
    self,
    item: Callable[..., Item],
    *,
    name: str | None = None,
    config: Config | None = None,
    space: Space | None = None,
    fidelities: Mapping[str, Any] | None = None,
    config_transform: Callable[[Config, Any], Config] | None = None,
    meta: Mapping[str, Any] | None = None,
):
    """See [`Node`][amltk.pipeline.node.Node] for details."""
    super().__init__(
        name=name if name is not None else entity_name(item),
        item=item,
        config=config,
        space=space,
        fidelities=fidelities,
        config_transform=config_transform,
        meta=meta,
    )

Sequential#

Bases: Node[Item, Space]

A Sequential set of operations in a pipeline.

This indicates the different children in .nodes should act one after another, feeding the output of one into the next.

from amltk.pipeline import Component, Sequential
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

pipeline = Sequential(
    PCA(n_components=3),
    Component(RandomForestClassifier, space={"n_estimators": (10, 100)}),
    name="my_pipeline"
)

space = pipeline.search_space("configspace")

configuration = space.sample_configuration()

configured_pipeline = pipeline.configure(configuration)

sklearn_pipeline = pipeline.build("sklearn")

Pipeline(steps=[('PCA', PCA(n_components=3)),
                ('RandomForestClassifier', RandomForestClassifier())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

You may also just chain together nodes using an infix operator >> if you prefer:

from amltk.pipeline import Join, Component, Sequential
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

pipeline = (
    Sequential(name="my_pipeline")
    >> PCA(n_components=3)
    >> Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
)

Whenever some other node sees a list, i.e. [comp1, comp2, comp3], this will automatically be converted into a Sequential.

from amltk.pipeline import Choice
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

pipeline_choice = Choice(
    [SimpleImputer(), RandomForestClassifier()],
    [StandardScaler(), MLPClassifier()],
    name="pipeline_choice"
)

Like all Nodes, a Sequential accepts an explicit name=, item=, config=, space=, fidelities=, config_transform= and meta=.

See Also

Node

Source code in src/amltk/pipeline/components.py

def __init__(
    self,
    *nodes: Node | NodeLike,
    name: str | None = None,
    item: Item | Callable[[Item], Item] | None = None,
    config: Config | None = None,
    space: Space | None = None,
    fidelities: Mapping[str, Any] | None = None,
    config_transform: Callable[[Config, Any], Config] | None = None,
    meta: Mapping[str, Any] | None = None,
):
    """See [`Node`][amltk.pipeline.node.Node] for details."""
    _nodes = tuple(as_node(n) for n in nodes)

    # Perhaps we need to do a deeper check on this...
    if not all_unique(_nodes, key=lambda node: node.name):
        raise DuplicateNamesError(self)

    if name is None:
        name = f"Seq-{randuid(8)}"

    super().__init__(
        *_nodes,
        name=name,
        item=item,
        config=config,
        space=space,
        fidelities=fidelities,
        config_transform=config_transform,
        meta=meta,
    )

Choice#

Bases: Node[Item, Space]

A Choice between different subcomponents.

This indicates that a choice should be made between the different children in .nodes, usually done when you configure() with some config from a search_space().

from amltk.pipeline import Choice, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
mlp = Component(MLPClassifier, space={"activation": ["logistic", "relu", "tanh"]})

estimator_choice = Choice(rf, mlp, name="estimator")

space = estimator_choice.search_space("configspace")

config = space.sample_configuration()

configured_choice = estimator_choice.configure(config)

chosen_estimator = configured_choice.chosen()

estimator = chosen_estimator.build_item()

RandomForestClassifier(n_estimators=53)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

You may also just add nodes to a Choice using an infix operator | if you prefer:

from amltk.pipeline import Choice, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
mlp = Component(MLPClassifier, space={"activation": ["logistic", "relu", "tanh"]})

estimator_choice = (
    Choice(name="estimator") | mlp | rf
)

Whenever some other node sees a set, i.e. {comp1, comp2, comp3}, this will automatically be converted into a Choice.

from amltk.pipeline import Choice, Component, Sequential
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.impute import SimpleImputer

rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
mlp = Component(MLPClassifier, space={"activation": ["logistic", "relu", "tanh"]})

pipeline = Sequential(
    SimpleImputer(fill_value=0),
    {mlp, rf},
    name="my_pipeline",
)

Like all Nodes, a Choice accepts an explicit name=, item=, config=, space=, fidelities=, config_transform= and meta=.

Order of nodes

The given nodes of a choice are always ordered according to their name, so indexing choice.nodes may not be reliable if modifying the choice dynamically.

Please use choice["name"] to access the nodes instead.

See Also

Node

Source code in src/amltk/pipeline/components.py

def __init__(
    self,
    *nodes: Node | NodeLike,
    name: str | None = None,
    item: Item | Callable[[Item], Item] | None = None,
    config: Config | None = None,
    space: Space | None = None,
    fidelities: Mapping[str, Any] | None = None,
    config_transform: Callable[[Config, Any], Config] | None = None,
    meta: Mapping[str, Any] | None = None,
):
    """See [`Node`][amltk.pipeline.node.Node] for details."""
    _nodes: tuple[Node, ...] = tuple(
        sorted((as_node(n) for n in nodes), key=lambda n: n.name),
    )
    if not all_unique(_nodes, key=lambda node: node.name):
        raise ValueError(
            f"Can't handle nodes as we can not generate a __choice__ for {nodes=}."
            "\nAll nodes must have a unique name. Please provide a `name=` to them",
        )

    if name is None:
        name = f"Choice-{randuid(8)}"

    super().__init__(
        *_nodes,
        name=name,
        item=item,
        config=config,
        space=space,
        fidelities=fidelities,
        config_transform=config_transform,
        meta=meta,
    )

Split#

Bases: Node[Item, Space]

A Split of data in a pipeline.

This indicates the different children in .nodes should act in parallel but on different subsets of data.

from amltk.pipeline import Component, Split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector

categorical_pipeline = [
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(drop="first"),
]
numerical_pipeline = Component(SimpleImputer, space={"strategy": ["mean", "median"]})

preprocessor = Split(
    {
        "categories": categorical_pipeline,
        "numerical": numerical_pipeline,
    },
    config={
        # This is how you would configure the split for the sklearn builder in particular
        "categories": make_column_selector(dtype_include="category"),
        "numerical": make_column_selector(dtype_exclude="category"),
    },
    name="my_split"
)

space = preprocessor.search_space("configspace")

configuration = space.sample_configuration()

configured_preprocessor = preprocessor.configure(configuration)

built_preprocessor = configured_preprocessor.build("sklearn")

Pipeline(steps=[('my_split',
                 ColumnTransformer(transformers=[('categories',
                                                  Pipeline(steps=[('SimpleImputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('OneHotEncoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f4568da8bb0>),
                                                 ('SimpleImputer',
                                                  SimpleImputer(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f4568da9c30>)]))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('my_split',
                 ColumnTransformer(transformers=[('categories',
                                                  Pipeline(steps=[('SimpleImputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('OneHotEncoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f4568da8bb0>),
                                                 ('SimpleImputer',
                                                  SimpleImputer(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f4568da9c30>)]))])

my_split: ColumnTransformer

ColumnTransformer(transformers=[('categories',
                                 Pipeline(steps=[('SimpleImputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('OneHotEncoder',
                                                  OneHotEncoder(drop='first'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f4568da8bb0>),
                                ('SimpleImputer', SimpleImputer(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f4568da9c30>)])

See Also

Node

Source code in src/amltk/pipeline/components.py

def __init__(
    self,
    *nodes: Node | NodeLike | dict[str, Node | NodeLike],
    name: str | None = None,
    item: Item | Callable[[Item], Item] | None = None,
    config: Config | None = None,
    space: Space | None = None,
    fidelities: Mapping[str, Any] | None = None,
    config_transform: Callable[[Config, Any], Config] | None = None,
    meta: Mapping[str, Any] | None = None,
):
    """See [`Node`][amltk.pipeline.node.Node] for details."""
    if any(isinstance(n, dict) for n in nodes):
        if len(nodes) > 1:
            raise ValueError(
                "Can't handle multiple nodes with a dictionary as a node.\n"
                f"{nodes=}",
            )
        _node = nodes[0]
        assert isinstance(_node, dict)

        def _construct(key: str, value: Node | NodeLike) -> Node:
            match value:
                case list():
                    return Sequential(*value, name=key)
                case set() | tuple():
                    return as_node(value, name=key)
                case _:
                    return Sequential(value, name=key)

        _nodes = tuple(_construct(key, value) for key, value in _node.items())
    else:
        _nodes = tuple(as_node(n) for n in nodes)

    if not all_unique(_nodes, key=lambda node: node.name):
        raise ValueError(
            f"Can't handle nodes they do not all contain unique names, {nodes=}."
            "\nAll nodes must have a unique name. Please provide a `name=` to them",
        )

    if name is None:
        name = f"Split-{randuid(8)}"

    super().__init__(
        *_nodes,
        name=name,
        item=item,
        config=config,
        space=space,
        fidelities=fidelities,
        config_transform=config_transform,
        meta=meta,
    )

Join#

Bases: Node[Item, Space]

Join together different parts of the pipeline.

This indicates the different children in .nodes should act in tandem with one another, for example, concatenating the outputs of the various members of the Join.

from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

pca = Component(PCA, space={"n_components": (1, 3)})
kbest = Component(SelectKBest, space={"k": (1, 3)})

join = Join(pca, kbest, name="my_feature_union")

space = join.search_space("configspace")

pipeline = join.build("sklearn")

Pipeline(steps=[('my_feature_union',
                 FeatureUnion(transformer_list=[('PCA', PCA()),
                                                ('SelectKBest',
                                                 SelectKBest())]))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

You may also just join together nodes using an infix operator & if you prefer:

from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

pca = Component(PCA, space={"n_components": (1, 3)})
kbest = Component(SelectKBest, space={"k": (1, 3)})

# Can not parametrize or name the join
join = pca & kbest

# With a parametrized join
join = (
    Join(name="my_feature_union") & pca & kbest
)
item = join.build("sklearn")

Pipeline(steps=[('my_feature_union',
                 FeatureUnion(transformer_list=[('PCA', PCA()),
                                                ('SelectKBest',
                                                 SelectKBest())]))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Whenever some other node sees a tuple, i.e. (comp1, comp2, comp3), this will automatically be converted into a Join.

from amltk.pipeline import Sequential, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier

pca = Component(PCA, space={"n_components": (1, 3)})
kbest = Component(SelectKBest, space={"k": (1, 3)})

# Can not parametrize or name the join
join = Sequential(
    (pca, kbest),
    RandomForestClassifier(n_estimators=5),
    name="my_feature_union",
)

╭─ Sequential(my_feature_union) ──────────────────────────────────────────╮ │ ╭─ Join(Join-FX2L7XS4) ───────────────────────────────────────────────╮ │ │ │ ╭─ Component(PCA) ───────────────╮ ╭─ Component(SelectKBest) ─────╮ │ │ │ │ │ item class PCA(...) │ │ item class SelectKBest(...) │ │ │ │ │ │ space {'n_components': (1, 3)} │ │ space {'k': (1, 3)} │ │ │ │ │ ╰────────────────────────────────╯ ╰──────────────────────────────╯ │ │ │ ╰─────────────────────────────────────────────────────────────────────╯ │ │ ↓ │ │ ╭─ Fixed(RandomForestClassifier) ─────────────╮ │ │ │ item RandomForestClassifier(n_estimators=5) │ │ │ ╰─────────────────────────────────────────────╯ │ ╰─────────────────────────────────────────────────────────────────────────╯

Like all Nodes, a Join accepts an explicit name=, item=, config=, space=, fidelities=, config_transform= and meta=.

See Also

Node

Source code in src/amltk/pipeline/components.py

def __init__(
    self,
    *nodes: Node | NodeLike,
    name: str | None = None,
    item: Item | Callable[[Item], Item] | None = None,
    config: Config | None = None,
    space: Space | None = None,
    fidelities: Mapping[str, Any] | None = None,
    config_transform: Callable[[Config, Any], Config] | None = None,
    meta: Mapping[str, Any] | None = None,
):
    """See [`Node`][amltk.pipeline.node.Node] for details."""
    _nodes = tuple(as_node(n) for n in nodes)
    if not all_unique(_nodes, key=lambda node: node.name):
        raise ValueError(
            f"Can't handle nodes they do not all contain unique names, {nodes=}."
            "\nAll nodes must have a unique name. Please provide a `name=` to them",
        )

    if name is None:
        name = f"Join-{randuid(8)}"

    super().__init__(
        *_nodes,
        name=name,
        item=item,
        config=config,
        space=space,
        fidelities=fidelities,
        config_transform=config_transform,
        meta=meta,
    )

Fixed#

Bases: Node[Item, None]

A Fixed part of the pipeline that represents something that can not be configured and used directly as is.

It consists of an .item that is fixed, non-configurable and non-searchable. It also has no children.

This is useful for representing parts of the pipeline that are fixed, for example if you have a pipeline that is a Sequential of nodes, but you want to fix the first component to be a PCA with n_components=3, you can use a Fixed to represent that.

from amltk.pipeline import Component, Fixed, Sequential
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
pca = Fixed(PCA(n_components=3))

pipeline = Sequential(pca, rf, name="my_pipeline")

Whenever some other node sees an instance of something, i.e. something that can't be called, this will automatically be converted into a Fixed.

from amltk.pipeline import Sequential
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

pipeline = Sequential(
    PCA(n_components=3),
    RandomForestClassifier(n_estimators=50),
    name="my_pipeline",
)

The default .name of a component is the class name of the item that it will use. You can explicitly set the name= if you want to when constructing the component.

A Fixed accepts only an explicit name=, item=, meta=.

See Also

Node

Source code in src/amltk/pipeline/components.py

def __init__(
    self,
    item: Item,
    *,
    name: str | None = None,
    config: None = None,
    space: None = None,
    fidelities: None = None,
    config_transform: None = None,
    meta: Mapping[str, Any] | None = None,
):
    """See [`Node`][amltk.pipeline.node.Node] for details."""
    super().__init__(
        name=name if name is not None else entity_name(item),
        item=item,
        config=config,
        space=space,
        fidelities=fidelities,
        config_transform=config_transform,
        meta=meta,
    )

Searchable#

Bases: Node[None, Space]

A Searchable node of the pipeline which just represents a search space, no item attached.

While not usually applicable to pipelines you want to build, this component is useful for creating a search space, especially if the real pipeline you want to optimize can not be built directly. For example, if you are optimize a script, you may wish to use a Searchable to represent the search space of that script.

from amltk.pipeline import Searchable

script_space = Searchable({"mode": ["orange", "blue", "red"], "n": (10, 100)})

A Searchable explicitly does not allow for item= to be set, nor can it have any children. A Searchable accepts an explicit name=, config=, space=, fidelities=, config_transform= and meta=.

See Also

Node

Source code in src/amltk/pipeline/components.py

def __init__(
    self,
    space: Space | None = None,
    *,
    name: str | None = None,
    config: Config | None = None,
    fidelities: Mapping[str, Any] | None = None,
    config_transform: Callable[[Config, Any], Config] | None = None,
    meta: Mapping[str, Any] | None = None,
):
    """See [`Node`][amltk.pipeline.node.Node] for details."""
    if name is None:
        name = f"Searchable-{randuid(8)}"

    super().__init__(
        name=name,
        config=config,
        space=space,
        fidelities=fidelities,
        config_transform=config_transform,
        meta=meta,
    )

Node#

A pipeline consists of Nodes, which hold the various attributes required to build a pipeline, such as the .item, its .space, its .config and so on.

The Nodes are connected to each in a parent-child relation ship where the children are simply the .nodes that the parent leads to.

To give these attributes and relations meaning, there are various subclasses of Node which give different syntactic meanings when you want to construct something like a search_space() or build() some concrete object out of the pipeline.

For example, a Sequential node gives the meaning that each of its children in .nodes should follow one another while something like a Choice gives the meaning that only one of its children should be chosen.

You will likely never have to create a Node directly, but instead use the various components to create the pipeline.

Hashing

When hashing a node, i.e. to put it in a set or as a key in a dict, only the name of the node and the hash of its children is used. This means that two nodes with the same connectivity will be equalling hashed,

Equality

When considering equality, this will be done by comparing all the fields of the node. This include even the parent and branches fields. This means two nodes are considered equal if they look the same and they are connected in to nodes that also look the same.