Pipelines Guide#
AutoML-toolkit was built to support future development of AutoML systems and a central part of an AutoML system is its pipeline. The purpose of this guide is to help you understand all the utility AutoML-toolkit can provide to help you define your pipeline. We will do this by introducing concepts from the ground up, rather than top down. Please see the reference if you just want to quickly look something up.
Introduction#
The kinds of pipelines that exist in an AutoML system come in many different forms. For example, one might be an sklearn.pipeline.Pipeline, others might be some deep-learning pipeline, while some might even stand for some real life machinery process and the settings of these machines.
To accommodate this, what AutoML-Toolkit provides is an abstract representation of a pipeline, to help you define its search space and also to build concrete objects in code if possible (see builders).
We categorize this into 4 steps:
-
Parametrize your pipeline using the various components, including the kinds of items in the pipeline, the search spaces and any additional configuration. Each of the various types of components gives a syntactic meaning when performing the next steps.
-
pipeline.search_space(parser=...)
, Get a useable search space out of the pipeline. This can then be passed to anOptimizer
. -
pipeline.configure(config=...)
, Configure your pipeline, either manually or using a configuration suggested by an optimizer. -
pipeline.build(builder=...)
, Build your configured pipeline definition into something useable, i.e. ansklearn.pipeline.Pipeline
or atorch.nn.Module
.
At the core of these definitions is the many Nodes
it consists of. By combining these together, you can define a directed acyclic graph (DAG),
that represents the structure of your pipeline.
Here is one such sklearn example that we will build up towards.
โญโ Sequential(Classy Pipeline) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โญโ Split(preprocessing) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ config { โ โ
โ โ 'categoricals': โ โ
โ โ <sklearn.compose._column_transformer.make_column_selector object โ โ
โ โ at 0x7efd5a328550>, โ โ
โ โ 'numerics': โ โ
โ โ <sklearn.compose._column_transformer.make_column_selector object โ โ
โ โ at 0x7efd5a32a590> โ โ
โ โ } โ โ
โ โ โญโ Sequential(categoricals) โโโโโโโฎ โญโ Sequential(numerics) โโโโโโโโโโโโฎ โ โ
โ โ โ โญโ Fixed(SimpleImputer) โโโโโโโฎ โ โ โญโ Component(SimpleImputer) โโโโฎ โ โ โ
โ โ โ โ item SimpleImputer(fill_vaโฆ โ โ โ โ item class โ โ โ โ
โ โ โ โ strategy='constant') โ โ โ โ SimpleImputer(...) โ โ โ โ
โ โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ โ space { โ โ โ โ
โ โ โ โ โ โ โ 'strategy': [ โ โ โ โ
โ โ โ โญโ Fixed(OneHotEncoder) โโโโโโโฎ โ โ โ 'mean', โ โ โ โ
โ โ โ โ item OneHotEncoder(drop='fโฆ โ โ โ โ 'median' โ โ โ โ
โ โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ โ ] โ โ โ โ
โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ } โ โ โ โ
โ โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ โ
โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โ โ โ
โ โญโ Component(RandomForestClassifier) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ item class RandomForestClassifier(...) โ โ
โ โ space {'n_estimators': (10, 100), 'criterion': ['gini', 'log_loss']} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
from sklearn.compose import make_column_selector
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from amltk.pipeline import Component, Split, Sequential
feature_preprocessing = Split(
{
"categoricals": [SimpleImputer(strategy="constant", fill_value="missing"), OneHotEncoder(drop="first")],
"numerics": Component(SimpleImputer, space={"strategy": ["mean", "median"]}),
},
config={
"categoricals": make_column_selector(dtype_include=object),
"numerics": make_column_selector(dtype_include=np.number),
},
name="preprocessing",
)
pipeline = Sequential(
feature_preprocessing,
Component(RandomForestClassifier, space={"n_estimators": (10, 100), "criterion": ["gini", "log_loss"]}),
name="Classy Pipeline",
)
rich
printing
To get the same output locally (terminal or Notebook), you can either
call thing.__rich()__
, use from rich import print; print(thing)
or in a Notebook, simply leave it as the last object of a cell.
Once we have our pipeline definition, extracting a search space, configuring it and building it into something useful can be done with the methods.
Guide Requirements
For this guide, we will be using ConfigSpace
and scikit-learn
, you can
install them manually or as so:
Component#
A pipeline consists of building blocks which we can combine together
to create a DAG. We will start by introducing the Component
, the common operations,
and then show how to combine them together.
A Component
is the most common kind of node in a pipeline.
Like all parts of the pipeline, they subclass Node
, but a
Component
signifies this is some concrete object, with a possible
.space
and .config
.
Definition#
Naming Nodes
By default, a Component
(or any Node
for that matter), will use the function/classname
for the .name
of the Node
. You can explicitly pass
a name=
as a keyword argument when constructing these.
from dataclasses import dataclass
from amltk.pipeline import Component
@dataclass
class MyModel:
f: float
i: int
c: str
my_component = Component(
MyModel,
space={"f": (0.0, 1.0), "i": (0, 10), "c": ["red", "green", "blue"]},
)
โญโ Component(MyModel) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ item class MyModel(...) โ
โ space {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You can also use a function instead of a class if that is preferred.
def myfunc(f: float, i: int, c: str) -> MyModel:
if f < 0.5:
c = "red"
return MyModel(f=f, i=i, c=c)
component_with_function = Component(
myfunc,
space={"f": (0.0, 1.0), "i": (0, 10), "c": ["red", "green", "blue"]},
)
โญโ Component(function) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ item def myfunc(...) โ
โ space {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Search Space#
If interacting with an Optimizer
, you'll often require some
search space object to pass to it.
To extract a search space from a Component
, we can call
search_space(parser=...)
,
passing in the kind of search space you'd like to get out of it.
Available Search Spaces
Please see the spaces reference
Depending on what you pass as the parser=
to search_space(parser=...)
, we'll attempt
to give you a valid search space. In this case, we specified "configspace"
and
so we get a ConfigSpace
implementation.
You may also define your own parser=
and use that if desired.
Configure#
Pretty straight forward, but what do we do with this config
? Well we can
configure(config=...)
the component with it.
โญโ Component(MyModel) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ item class MyModel(...) โ
โ config {'c': 'red', 'f': 0.8512433941288, 'i': 0} โ
โ space {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You'll notice that each variable in the space has been set to some value. We could also manually define a config and pass that in. You are not obliged to fully specify this either.
โญโ Component(MyModel) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ item class MyModel(...) โ
โ config {'f': 0.5, 'i': 1} โ
โ space {'f': (0.0, 1.0), 'i': (0, 10), 'c': ['red', 'green', 'blue']} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Immutable methods!
One thing you may have noticed is that we assigned the result of configure(config=...)
to a new
variable. This is because we do not mutate the original my_component
and instead return a copy
with all of the config
variables set.
Build#
To build the individual item of a Component
we can use build_item()
and it simply calls the .item
with the config we have set.
# Same as if we did `configured_component.item(**configured_component.config)`
the_built_model = configured_component.build_item()
print(the_built_model)
However, as we'll see later, we often have multiple steps of a pipeline joined together and so
we need some way to get a full object out of it that takes into account all of these items
joined together. We can do this with build(builder=...)
.
For a look at the available arguments to pass to builder=
, see the
builder reference
Fixed#
Sometimes we just have some part of the pipeline with no search space and
no configuration required, i.e. just some prebuilt thing. We can
use the Fixed
node type to signify this.
from amltk.pipeline import Fixed
from sklearn.ensemble import RandomForestClassifier
frozen_rf = Fixed(RandomForestClassifier(n_estimators=5))
<pre style="font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace;font-size:0.75rem">
<code style="font-family:inherit"><span style="color: #56351e; text-decoration-color: #56351e">โญโ </span><span style="color: #56351e; text-decoration-color: #56351e; font-weight: bold">Fixed</span><span style="color: #56351e; text-decoration-color: #56351e">(</span><span style="color: #56351e; text-decoration-color: #56351e; font-style: italic">RandomForestClassifier</span><span style="color: #56351e; text-decoration-color: #56351e">) โโโโโโโโโโโโโโฎ</span>
<span style="color: #56351e; text-decoration-color: #56351e">โ</span> <span style="color: #000000; text-decoration-color: #000000">item </span><span style="color: #800080; text-decoration-color: #800080; font-weight: bold">RandomForestClassifier</span><span style="color: #000000; text-decoration-color: #000000; font-weight: bold">(</span><span style="color: #808000; text-decoration-color: #808000">n_estimators</span><span style="color: #000000; text-decoration-color: #000000">=</span><span style="color: #008080; text-decoration-color: #008080; font-weight: bold">5</span><span style="color: #000000; text-decoration-color: #000000; font-weight: bold">)</span> <span style="color: #56351e; text-decoration-color: #56351e">โ</span>
<span style="color: #56351e; text-decoration-color: #56351e">โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ</span>
</code>
</pre>
Parameter Requests#
Sometimes you may wish to explicitly specify some value should be added to the .config
during
configure()
which would be difficult to include in the config
directly, for example the random_state
of an sklearn estimator. You can pass these extra parameters into configure(params={...})
, which
do not require any namespace prefixing.
For this reason, we introduce the concept of a request()
, allowing
you to specify that a certain parameter should be added to the config during configure()
.
from dataclasses import dataclass
from amltk import Component, request
@dataclass
class MyModel:
f: float
random_state: int
my_component = Component(
MyModel,
space={"f": (0.0, 1.0)},
config={"random_state": request("seed", default=42)}
)
# Without passing the params
configured_component_no_seed = my_component.configure({"f": 0.5})
# With passing the params
configured_component_with_seed = my_component.configure({"f": 0.5}, params={"seed": 1337})
โญโ Component(MyModel) โโโโโโโโโโโโโโโโโโโฎ
โ item class MyModel(...) โ
โ config {'random_state': 42, 'f': 0.5} โ
โ space {'f': (0.0, 1.0)} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Component(MyModel) โโโโโโโโโโโโโโโโโโโโโฎ
โ item class MyModel(...) โ
โ config {'random_state': 1337, 'f': 0.5} โ
โ space {'f': (0.0, 1.0)} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
If you explicitly require a parameter to be set, just do not set a default=
.
my_component = Component(
MyModel,
space={"f": (0.0, 1.0)},
config={"random_state": request("seed")}
)
my_component.configure({"f": 0.5}, params={"seed": 5}) # All good
try:
my_component.configure({"f": 0.5}) # Missing required parameter
except ValueError as e:
print(e)
Missing request=ParamRequest(_has_default=False, key='seed', default=<NotSet>) for Component(name='MyModel', item=<class '_code_block_session_Pipeline_Parameter_Request_n1_.MyModel'>, nodes=(), config={'random_state': ParamRequest(_has_default=False, key='seed', default=<NotSet>)}, space={'f': (0.0, 1.0)}, fidelities=None, config_transform=None, meta=None).
params=None
Config Transform#
Some search space and optimizers may have limitations in terms of the kinds of parameters they
can support, one notable example is tuple parameters. To get around this, we can pass
a config_transform=
to component
which will transform the config before it is passed to the
.item
during build()
.
from dataclasses import dataclass
from amltk import Component
@dataclass
class MyModel:
dimensions: tuple[int, int]
def config_transform(config: dict, _) -> dict:
"""Convert "dim1" and "dim2" into a tuple."""
dim1 = config.pop("dim1")
dim2 = config.pop("dim2")
config["dimensions"] = (dim1, dim2)
return config
my_component = Component(
MyModel,
space={"dim1": (1, 10), "dim2": (1, 10)},
config_transform=config_transform,
)
configured_component = my_component.configure({"dim1": 5, "dim2": 5})
โญโ Component(MyModel) โโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ item class MyModel(...) โ
โ config {'dimensions': (5, 5)} โ
โ space {'dim1': (1, 10), 'dim2': (1, 10)} โ
โ transform def config_transform(...) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Transform Context
There may be times where you have some additional context, which you may only know at configuration time.
In this case, it is possible to pass this additional context to configure(..., transform_context=...)
,
which will be forwarded as the second argument to your .config_transform
.
Sequential#
A single component might be enough for some basic definitions but generally we need to combine multiple
components together. AutoML-Toolkit is designed for large and more complex structures which can be
made from simple atomic Node
s.
Chaining Together Nodes#
We'll begin by creating two components that wrap scikit-learn estimators.
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Component
imputer = Component(SimpleImputer, space={"strategy": ["median", "mean"]})
rf = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
โญโ Component(SimpleImputer) โโโโโโโโโโโโโโฎ
โ item class SimpleImputer(...) โ
โ space {'strategy': ['median', 'mean']} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Component(RandomForestClassifier) โโโโโโฎ
โ item class RandomForestClassifier(...) โ
โ space {'n_estimators': (10, 100)} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Infix >>
To join these two components together, we can either use the infix notation using >>
,
or passing them directly to a Sequential
. However,
a random name will be given when using the infix notation.
โญโ Sequential(My Pipeline) โโโโโโโโโโโโโโโโโโโโฎ
โ โญโ Component(SimpleImputer) โโโโโโโโโโโโโโฎ โ
โ โ item class SimpleImputer(...) โ โ
โ โ space {'strategy': ['median', 'mean']} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โ โ โ
โ โญโ Component(RandomForestClassifier) โโโโโโฎ โ
โ โ item class RandomForestClassifier(...) โ โ
โ โ space {'n_estimators': (10, 100)} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Operations#
You can perform much of the same operations as we did for the individual node but now taking into account everything in the pipeline.
space = pipeline.search_space("configspace")
config = space.sample_configuration()
configured_pipeline = pipeline.configure(config)
Configuration space object:
Hyperparameters:
My Pipeline:RandomForestClassifier:n_estimators, Type: UniformInteger,
Range: [10, 100], Default: 55
My Pipeline:SimpleImputer:strategy, Type: Categorical, Choices: {median,
mean}, Default: median
Configuration(values={
'My Pipeline:RandomForestClassifier:n_estimators': 86,
'My Pipeline:SimpleImputer:strategy': 'median',
})
โญโ Sequential(My Pipeline) โโโโโโโโโโโโโโโโโโโโโฎ
โ โญโ Component(SimpleImputer) โโโโโโโโโโโโโโโฎ โ
โ โ item class SimpleImputer(...) โ โ
โ โ config {'strategy': 'median'} โ โ
โ โ space {'strategy': ['median', 'mean']} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โ โ โ
โ โญโ Component(RandomForestClassifier) โโโโโโโฎ โ
โ โ item class RandomForestClassifier(...) โ โ
โ โ config {'n_estimators': 86} โ โ
โ โ space {'n_estimators': (10, 100)} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
To build a pipeline of nodes, we simply call build(builder=...)
. We
explicitly pass the builder we want to use, which informs build()
how to go from the abstract
pipeline definition you've defined to something concrete you can use.
You can find the available builders here.
from sklearn.pipeline import Pipeline as SklearnPipeline
built_pipeline = configured_pipeline.build("sklearn")
assert isinstance(built_pipeline, SklearnPipeline)
Pipeline(steps=[('SimpleImputer', SimpleImputer(strategy='median')), ('RandomForestClassifier', RandomForestClassifier(n_estimators=86))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('SimpleImputer', SimpleImputer(strategy='median')), ('RandomForestClassifier', RandomForestClassifier(n_estimators=86))])
SimpleImputer(strategy='median')
RandomForestClassifier(n_estimators=86)
Other Building blocks#
We saw the basic building block of a Component
, but AutoML-Toolkit also provides support
for some other kinds of building blocks. These building blocks can be attached and joined
together just like a Component
can and allow for much more complex pipeline structures.
Choice#
A Choice
is a way to define a choice between multiple
components. This is useful when you want to search over multiple algorithms, which
may each have their own hyperparameters.
We'll start again by creating two nodes:
from dataclasses import dataclass
from amltk.pipeline import Component
@dataclass
class ModelA:
i: int
@dataclass
class ModelB:
c: str
model_a = Component(ModelA, space={"i": (0, 100)})
model_b = Component(ModelB, space={"c": ["red", "blue"]})
โญโ Component(ModelA) โโโโโโฎ
โ item class ModelA(...) โ
โ space {'i': (0, 100)} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Component(ModelB) โโโโโโโโโโโฎ
โ item class ModelB(...) โ
โ space {'c': ['red', 'blue']} โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Now combining them into a choice is rather straight forward:
โญโ Choice(estimator) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โญโ Component(ModelA) โโโโโโฎ โญโ Component(ModelB) โโโโโโโโโโโฎ โ
โ โ item class ModelA(...) โ โ item class ModelB(...) โ โ
โ โ space {'i': (0, 100)} โ โ space {'c': ['red', 'blue']} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Conditionals and Search Spaces
Not all search space implementations support conditionals and so some
parser=
may not be able to handle this. In this case, there won't be
any conditionality in the search space.
Check out the parser reference for more information.
Just as we did with a Component
, we can also get a search_space()
from the choice.
Configuration space object:
Hyperparameters:
estimator:ModelA:i, Type: UniformInteger, Range: [0, 100], Default: 50
estimator:ModelB:c, Type: Categorical, Choices: {red, blue}, Default: red
estimator:__choice__, Type: Categorical, Choices: {ModelA, ModelB}, Default:
ModelA
Conditions:
estimator:ModelA:i | estimator:__choice__ == 'ModelA'
estimator:ModelB:c | estimator:__choice__ == 'ModelB'
When we configure()
a choice, we will collapse it down to a single component. This is
done according to what is set in the config.
โญโ Choice(estimator) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ config {'__choice__': 'ModelB'} โ
โ โญโ Component(ModelA) โโโโโโฎ โญโ Component(ModelB) โโโโโโโโโโโโฎ โ
โ โ item class ModelA(...) โ โ item class ModelB(...) โ โ
โ โ space {'i': (0, 100)} โ โ config {'c': 'blue'} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ space {'c': ['red', 'blue']} โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
.config
of the Choice
to {"__choice__": "model_a"}
or
{"__choice__": "model_b"}
. This lets a builder know which of these two to build.
Split#
A Split
is a way to signify a split in the dataflow of a pipeline.
This Split
by itself will not do anything but it informs the builder about what to do.
Each builder will have its own specific strategy for dealing with one.
Let's go ahead with a scikit-learn example, where we'll split the data into categorical and numerical features and then perform some preprocessing on each of them.
from sklearn.compose import make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from amltk.pipeline import Component, Split
select_categories = make_column_selector(dtype_include=object)
select_numerical = make_column_selector(dtype_include=np.number)
preprocessor = Split(
{
"categories": [SimpleImputer(strategy="constant", fill_value="missing"), OneHotEncoder(drop="first")],
"numerics": Component(SimpleImputer, space={"strategy": ["mean", "median"]}),
},
config={"categories": select_categories, "numerics": select_numerical},
name="feature_preprocessing",
)
โญโ Split(feature_preprocessing) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ config { โ
โ 'categories': โ
โ <sklearn.compose._column_transformer.make_column_selector object at โ
โ 0x7efd5a32b5b0>, โ
โ 'numerics': โ
โ <sklearn.compose._column_transformer.make_column_selector object at โ
โ 0x7efd5a329f00> โ
โ } โ
โ โญโ Sequential(categories) โโโโโโโโโโโฎ โญโ Sequential(numerics) โโโโโโโโโโโโโโฎ โ
โ โ โญโ Fixed(SimpleImputer) โโโโโโโโโฎ โ โ โญโ Component(SimpleImputer) โโโโโโฎ โ โ
โ โ โ item SimpleImputer(fill_valuโฆ โ โ โ โ item class SimpleImputer(...) โ โ โ
โ โ โ strategy='constant') โ โ โ โ space { โ โ โ
โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ โ 'strategy': [ โ โ โ
โ โ โ โ โ โ 'mean', โ โ โ
โ โ โญโ Fixed(OneHotEncoder) โโโโโโโโโฎ โ โ โ 'median' โ โ โ
โ โ โ item OneHotEncoder(drop='firโฆ โ โ โ โ ] โ โ โ
โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ โ } โ โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
An important thing to note here is that first, we passed a dict
to Split
, such that
we can name the individual paths. This is important because we need some name to refer
to them when configuring the Split
. It does this by simply wrapping
each of the paths in a Sequential
.
The second thing is that the parameters set for the .config
matches those of the
paths. This let's the Split
know which data should be sent where. Each builder=
will have its own way of how to set up a Split
and you should refer to
the builders reference for more information.
Our last step is just to convert this into a useable object and so once again
we use build()
.
Pipeline(steps=[('feature_preprocessing', ColumnTransformer(transformers=[('categories', Pipeline(steps=[('SimpleImputer', SimpleImputer(fill_value='missing', strategy='constant')), ('OneHotEncoder', OneHotEncoder(drop='first'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a32b5b0>), ('SimpleImputer', SimpleImputer(), <sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a329f00>)]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('feature_preprocessing', ColumnTransformer(transformers=[('categories', Pipeline(steps=[('SimpleImputer', SimpleImputer(fill_value='missing', strategy='constant')), ('OneHotEncoder', OneHotEncoder(drop='first'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a32b5b0>), ('SimpleImputer', SimpleImputer(), <sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a329f00>)]))])
ColumnTransformer(transformers=[('categories', Pipeline(steps=[('SimpleImputer', SimpleImputer(fill_value='missing', strategy='constant')), ('OneHotEncoder', OneHotEncoder(drop='first'))]), <sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a32b5b0>), ('SimpleImputer', SimpleImputer(), <sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a329f00>)])
<sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a32b5b0>
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(drop='first')
<sklearn.compose._column_transformer.make_column_selector object at 0x7efd5a329f00>
SimpleImputer()
Join#
TODO
TODO
Searchable#
TODO
TODO
Option#
TODO
Please feel free to provide a contribution!