Builders
Builders#
A pipeline of Node
s
is just an abstract representation of some implementation of a pipeline that will actually do
things, for example an sklearn Pipeline
or a
Pytorch Sequential
.
To facilitate custom builders and to allow you to customize building,
there is a explicit argument builder=
required when
calling .build(builder=...)
on your pipeline.
Each builder gives the various kinds of components
an actual meaning, for example the Split
with
the sklearn builder()
,
translates to a ColumnTransformer
and
a Sequential
translates to an sklearn
Pipeline
.
Scikit-learn#
The sklearn builder()
, converts
a pipeline made of Node
s into a sklearn
Pipeline
.
Requirements
This requires sklearn
which can be installed with:
Each kind of node corresponds to a different part of the end pipeline:
Fixed
- The estimator will simply be cloned, allowing you
to directly configure some object in a pipeline.
from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Fixed
est = Fixed(RandomForestClassifier(n_estimators=25))
built_pipeline = est.build("sklearn")
Pipeline(steps=[('RandomForestClassifier',
RandomForestClassifier(n_estimators=25))])
Component
- The estimator will be built from the
component's config. This is mostly useful to allow a space to be defined for
the component.
from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Component
est = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})
# ... Likely get the configuration through an optimizer or sampling
configured_est = est.configure({"n_estimators": 25})
built_pipeline = configured_est.build("sklearn")
Pipeline(steps=[('RandomForestClassifier',
RandomForestClassifier(n_estimators=25))])
Sequential
- The sequential will be converted into a
Pipeline
, building whatever nodes are contained
within in.
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from amltk.pipeline import Component, Sequential
pipeline = Sequential(
PCA(n_components=3),
Component(RandomForestClassifier, config={"n_estimators": 25})
)
built_pipeline = pipeline.build("sklearn")
Pipeline(steps=[('PCA', PCA(n_components=3)),
('RandomForestClassifier',
RandomForestClassifier(n_estimators=25))])
Split
- The split will be converted into a
ColumnTransformer
, where each path
and the data that should go through it is specified by the split's config.
You can provide a ColumnTransformer
directly as the item to the Split
,
or otherwise if left blank, it will default to the standard sklearn one.
You can use a Fixed
with the special keyword "passthrough"
as you might normally
do with a ColumnTransformer
.
By default, we provide two special keywords you can provide to a Split
,
namely "categorical"
and "numerical"
, which will
automatically configure a ColumnTransorfmer
to pass the appropraite
columns of a data-frame to the given paths.
from amltk.pipeline import Split, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
categorical_pipeline = [
SimpleImputer(strategy="constant", fill_value="missing"),
Component(
OneHotEncoder,
space={
"min_frequency": (0.01, 0.1),
"handle_unknown": ["ignore", "infrequent_if_exist"],
},
config={"drop": "first"},
),
]
numerical_pipeline = [SimpleImputer(strategy="median"), StandardScaler()]
split = Split(
{
"categories": categorical_pipeline,
"numbers": numerical_pipeline
}
)
╭─ Split(Split-wGfUuzPZ) ──────────────────────────────────────────────────────╮
│ ╭─ Sequential(categories) ──────────╮ ╭─ Sequential(numbers) ──────────────╮ │
│ │ ╭─ Fixed(SimpleImputer) ────────╮ │ │ ╭─ Fixed(SimpleImputer) ─────────╮ │ │
│ │ │ item SimpleImputer(fill_valu… │ │ │ │ item SimpleImputer(strategy='… │ │ │
│ │ │ strategy='constant') │ │ │ ╰────────────────────────────────╯ │ │
│ │ ╰───────────────────────────────╯ │ │ ↓ │ │
│ │ ↓ │ │ ╭─ Fixed(StandardScaler) ─╮ │ │
│ │ ╭─ Component(OneHotEncoder) ────╮ │ │ │ item StandardScaler() │ │ │
│ │ │ item class │ │ │ ╰─────────────────────────╯ │ │
│ │ │ OneHotEncoder(...) │ │ ╰────────────────────────────────────╯ │
│ │ │ config {'drop': 'first'} │ │ │
│ │ │ space { │ │ │
│ │ │ 'min_frequency': ( │ │ │
│ │ │ 0.01, │ │ │
│ │ │ 0.1 │ │ │
│ │ │ ), │ │ │
│ │ │ 'handle_unknown': │ │ │
│ │ │ [ │ │ │
│ │ │ 'ignore', │ │ │
│ │ │ 'infrequent_i… │ │ │
│ │ │ ] │ │ │
│ │ │ } │ │ │
│ │ ╰───────────────────────────────╯ │ │
│ ╰───────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
You can manually specify the column selectors if you prefer.
```python split = Split( { "categories": categorical_pipeline, "numbers": numerical_pipeline, }, config={ "categories": make_column_selector(dtype_include=object), "numbers": make_column_selector(dtype_include=np.number), },
) ```
Join
- The join will be converted into a
FeatureUnion
.
from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
join = Join(PCA(n_components=2), SelectKBest(k=3), name="my_feature_union")
pipeline = join.build("sklearn")
Pipeline(steps=[('my_feature_union',
FeatureUnion(transformer_list=[('PCA', PCA(n_components=2)),
('SelectKBest',
SelectKBest(k=3))]))])
Choice
- The estimator will be built from the chosen
component's config. This is very similar to Component
.
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from amltk.pipeline import Choice
# The choice here is usually provided during the `.configure()` step.
estimator_choice = Choice(
RandomForestClassifier(),
MLPClassifier(),
config={"__choice__": "RandomForestClassifier"}
)
built_pipeline = estimator_choice.build("sklearn")
Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier())])
PyTorch#
Planned
If anyone has good knowledge of building pytorch networks in a more functional manner and would like to contribute, please feel free to reach out!
At the moment, we do not provide any native support for torch
. You can
however make use of skorch
to convert your networks to a scikit-learn interface,
using the scikit-learn builder instead.