Builders

Builders#

A pipeline of Nodes is just an abstract representation of some implementation of a pipeline that will actually do things, for example an sklearn Pipeline or a Pytorch Sequential.

To facilitate custom builders and to allow you to customize building, there is a explicit argument builder= required when calling .build(builder=...) on your pipeline.

Each builder gives the various kinds of components an actual meaning, for example the Split with the sklearn builder(), translates to a ColumnTransformer and a Sequential translates to an sklearn Pipeline.

Scikit-learn#

The sklearn builder(), converts a pipeline made of Nodes into a sklearn Pipeline.

Requirements

This requires sklearn which can be installed with:

pip install "amltk[scikit-learn]"

# Or directly
pip install scikit-learn

Basic Usage

# TODO

Each kind of node corresponds to a different part of the end pipeline:

FixedComponentSequentialSplit

Fixed - The estimator will simply be cloned, allowing you to directly configure some object in a pipeline.

from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Fixed

est = Fixed(RandomForestClassifier(n_estimators=25))
built_pipeline = est.build("sklearn")

Pipeline(steps=[('RandomForestClassifier',
                 RandomForestClassifier(n_estimators=25))])

Component - The estimator will be built from the component's config. This is mostly useful to allow a space to be defined for the component.

from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Component

est = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})

# ... Likely get the configuration through an optimizer or sampling
configured_est = est.configure({"n_estimators": 25})

built_pipeline = configured_est.build("sklearn")

Pipeline(steps=[('RandomForestClassifier',
                 RandomForestClassifier(n_estimators=25))])

Sequential - The sequential will be converted into a Pipeline, building whatever nodes are contained within in.

from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from amltk.pipeline import Component, Sequential

pipeline = Sequential(
    PCA(n_components=3),
    Component(RandomForestClassifier, config={"n_estimators": 25})
)
built_pipeline = pipeline.build("sklearn")

Pipeline(steps=[('PCA', PCA(n_components=3)),
                ('RandomForestClassifier',
                 RandomForestClassifier(n_estimators=25))])

Split - The split will be converted into a ColumnTransformer, where each path and the data that should go through it is specified by the split's config. You can provide a ColumnTransformer directly as the item to the Split, or otherwise if left blank, it will default to the standard sklearn one.

You can use a Fixed with the special keyword "passthrough" as you might normally do with a ColumnTransformer.

By default, we provide two special keywords you can provide to a Split, namely "categorical" and "numerical", which will automatically configure a ColumnTransorfmer to pass the appropraite columns of a data-frame to the given paths.

from amltk.pipeline import Split, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

categorical_pipeline = [
    SimpleImputer(strategy="constant", fill_value="missing"),
    Component(
        OneHotEncoder,
        space={
            "min_frequency": (0.01, 0.1),
            "handle_unknown": ["ignore", "infrequent_if_exist"],
        },
        config={"drop": "first"},
    ),
]
numerical_pipeline = [SimpleImputer(strategy="median"), StandardScaler()]

split = Split(
    {
        "categories": categorical_pipeline,
        "numbers": numerical_pipeline
    }
)

╭─ Split(Split-wGfUuzPZ) ──────────────────────────────────────────────────────╮
│ ╭─ Sequential(categories) ──────────╮ ╭─ Sequential(numbers) ──────────────╮ │
│ │ ╭─ Fixed(SimpleImputer) ────────╮ │ │ ╭─ Fixed(SimpleImputer) ─────────╮ │ │
│ │ │ item SimpleImputer(fill_valu… │ │ │ │ item SimpleImputer(strategy='… │ │ │
│ │ │      strategy='constant')     │ │ │ ╰────────────────────────────────╯ │ │
│ │ ╰───────────────────────────────╯ │ │                 ↓                  │ │
│ │                 ↓                 │ │ ╭─ Fixed(StandardScaler) ─╮        │ │
│ │ ╭─ Component(OneHotEncoder) ────╮ │ │ │ item StandardScaler()   │        │ │
│ │ │ item   class                  │ │ │ ╰─────────────────────────╯        │ │
│ │ │        OneHotEncoder(...)     │ │ ╰────────────────────────────────────╯ │
│ │ │ config {'drop': 'first'}      │ │                                        │
│ │ │ space  {                      │ │                                        │
│ │ │            'min_frequency': ( │ │                                        │
│ │ │                0.01,          │ │                                        │
│ │ │                0.1            │ │                                        │
│ │ │            ),                 │ │                                        │
│ │ │            'handle_unknown':  │ │                                        │
│ │ │        [                      │ │                                        │
│ │ │                'ignore',      │ │                                        │
│ │ │                'infrequent_i… │ │                                        │
│ │ │            ]                  │ │                                        │
│ │ │        }                      │ │                                        │
│ │ ╰───────────────────────────────╯ │                                        │
│ ╰───────────────────────────────────╯                                        │
╰──────────────────────────────────────────────────────────────────────────────╯

You can manually specify the column selectors if you prefer.

```python split = Split( { "categories": categorical_pipeline, "numbers": numerical_pipeline, }, config={ "categories": make_column_selector(dtype_include=object), "numbers": make_column_selector(dtype_include=np.number), },

) ```

JoinChoice

Join - The join will be converted into a FeatureUnion.

from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

join = Join(PCA(n_components=2), SelectKBest(k=3), name="my_feature_union")

pipeline = join.build("sklearn")

Pipeline(steps=[('my_feature_union',
                 FeatureUnion(transformer_list=[('PCA', PCA(n_components=2)),
                                                ('SelectKBest',
                                                 SelectKBest(k=3))]))])

Choice - The estimator will be built from the chosen component's config. This is very similar to Component.

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from amltk.pipeline import Choice

# The choice here is usually provided during the `.configure()` step.
estimator_choice = Choice(
    RandomForestClassifier(),
    MLPClassifier(),
    config={"__choice__": "RandomForestClassifier"}
)

built_pipeline = estimator_choice.build("sklearn")

Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier())])

PyTorch#

Planned

If anyone has good knowledge of building pytorch networks in a more functional manner and would like to contribute, please feel free to reach out!

At the moment, we do not provide any native support for torch. You can however make use of skorch to convert your networks to a scikit-learn interface, using the scikit-learn builder instead.