Builders
Builders#
A pipeline of Node
s
is just an abstract representation of some implementation of a pipeline that will actually do
things, for example an sklearn Pipeline
or a
Pytorch Sequential
.
To facilitate custom builders and to allow you to customize building,
there is a explicit argument builder=
required when
calling .build(builder=...)
on your pipeline.
Each builder gives the various kinds of components
an actual meaning, for example the Split
with
the sklearn builder()
,
translates to a ColumnTransformer
and
a Sequential
translates to an sklearn
Pipeline
.
Scikit-learn#
amltk.pipeline.builders.sklearn
#
The sklearn builder()
, converts
a pipeline made of Node
s into a sklearn
Pipeline
.
Requirements
This requires sklearn
which can be installed with:
Each kind of node corresponds to a different part of the end pipeline:
Fixed
- The estimator will simply be cloned, allowing you
to directly configure some object in a pipeline.
from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Fixed
est = Fixed(RandomForestClassifier(n_estimators=25))
built_pipeline = est.build("sklearn")
Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier(n_estimators=25))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier(n_estimators=25))])
RandomForestClassifier(n_estimators=25)
Component
- The estimator will be built from the
component's config. This is mostly useful to allow a space to be defined for
the component.
from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Component
est = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
# ... Likely get the configuration through an optimizer or sampling
configured_est = est.configure({"n_estimators": 25})
built_pipeline = configured_est.build("sklearn")
Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier(n_estimators=25))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier(n_estimators=25))])
RandomForestClassifier(n_estimators=25)
Sequential
- The sequential will be converted into a
Pipeline
, building whatever nodes are contained
within in.
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from amltk.pipeline import Component, Sequential
pipeline = Sequential(
PCA(n_components=3),
Component(RandomForestClassifier, config={"n_estimators": 25})
)
built_pipeline = pipeline.build("sklearn")
Pipeline(steps=[('PCA', PCA(n_components=3)), ('RandomForestClassifier', RandomForestClassifier(n_estimators=25))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('PCA', PCA(n_components=3)), ('RandomForestClassifier', RandomForestClassifier(n_estimators=25))])
PCA(n_components=3)
RandomForestClassifier(n_estimators=25)
Split
- The split will be converted into a
ColumnTransformer
, where each path
and the data that should go through it is specified by the split's config.
You can provide a ColumnTransformer
directly as the item to the Split
,
or otherwise if left blank, it will default to the standard sklearn one.
You can use a Fixed
with the special keyword "passthrough"
as you might normally
do with a ColumnTransformer
.
By default, we provide two special keywords you can provide to a Split
,
namely "categorical"
and "numerical"
, which will
automatically configure a ColumnTransorfmer
to pass the appropraite
columns of a data-frame to the given paths.
from amltk.pipeline import Split, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
categorical_pipeline = [
SimpleImputer(strategy="constant", fill_value="missing"),
Component(
OneHotEncoder,
space={
"min_frequency": (0.01, 0.1),
"handle_unknown": ["ignore", "infrequent_if_exist"],
},
config={"drop": "first"},
),
]
numerical_pipeline = [SimpleImputer(strategy="median"), StandardScaler()]
split = Split(
{
"categorical": categorical_pipeline,
"numerical": numerical_pipeline
}
)
╭─ Split(Split-OtdhQu0d) ──────────────────────────────────────────────────────╮
│ ╭─ Sequential(categorical) ─────────╮ ╭─ Sequential(numerical) ────────────╮ │
│ │ ╭─ Fixed(SimpleImputer) ────────╮ │ │ ╭─ Fixed(SimpleImputer) ─────────╮ │ │
│ │ │ item SimpleImputer(fill_valu… │ │ │ │ item SimpleImputer(strategy='… │ │ │
│ │ │ strategy='constant') │ │ │ ╰────────────────────────────────╯ │ │
│ │ ╰───────────────────────────────╯ │ │ ↓ │ │
│ │ ↓ │ │ ╭─ Fixed(StandardScaler) ─╮ │ │
│ │ ╭─ Component(OneHotEncoder) ────╮ │ │ │ item StandardScaler() │ │ │
│ │ │ item class │ │ │ ╰─────────────────────────╯ │ │
│ │ │ OneHotEncoder(...) │ │ ╰────────────────────────────────────╯ │
│ │ │ config {'drop': 'first'} │ │ │
│ │ │ space { │ │ │
│ │ │ 'min_frequency': ( │ │ │
│ │ │ 0.01, │ │ │
│ │ │ 0.1 │ │ │
│ │ │ ), │ │ │
│ │ │ 'handle_unknown': │ │ │
│ │ │ [ │ │ │
│ │ │ 'ignore', │ │ │
│ │ │ 'infrequent_i… │ │ │
│ │ │ ] │ │ │
│ │ │ } │ │ │
│ │ ╰───────────────────────────────╯ │ │
│ ╰───────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
You can manually specify the column selectors if you prefer.
Join
- The join will be converted into a
FeatureUnion
.
from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
join = Join(PCA(n_components=2), SelectKBest(k=3), name="my_feature_union")
pipeline = join.build("sklearn")
Pipeline(steps=[('my_feature_union', FeatureUnion(transformer_list=[('PCA', PCA(n_components=2)), ('SelectKBest', SelectKBest(k=3))]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('my_feature_union', FeatureUnion(transformer_list=[('PCA', PCA(n_components=2)), ('SelectKBest', SelectKBest(k=3))]))])
FeatureUnion(transformer_list=[('PCA', PCA(n_components=2)), ('SelectKBest', SelectKBest(k=3))])
PCA(n_components=2)
SelectKBest(k=3)
Choice
- The estimator will be built from the chosen
component's config. This is very similar to Component
.
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from amltk.pipeline import Choice
# The choice here is usually provided during the `.configure()` step.
estimator_choice = Choice(
RandomForestClassifier(),
MLPClassifier(),
config={"__choice__": "RandomForestClassifier"}
)
built_pipeline = estimator_choice.build("sklearn")
Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier())])
RandomForestClassifier()
PyTorch#
Planned
If anyone has good knowledge of building pytorch networks in a more functional manner and would like to contribute, please feel free to reach out!
At the moment, we do not provide any native support for torch
. You can
however make use of skorch
to convert your networks to a scikit-learn interface,
using the scikit-learn builder instead.