Skip to content

Sklearn

The sklearn builder(), converts a pipeline made of Nodes into a sklearn Pipeline.

Requirements

This requires sklearn which can be installed with:

pip install "amltk[scikit-learn]"

# Or directly
pip install scikit-learn

Each kind of node corresponds to a different part of the end pipeline:

Fixed - The estimator will simply be cloned, allowing you to directly configure some object in a pipeline.

from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Fixed

est = Fixed(RandomForestClassifier(n_estimators=25))
built_pipeline = est.build("sklearn")

Pipeline(steps=[('RandomForestClassifier',
                 RandomForestClassifier(n_estimators=25))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Component - The estimator will be built from the component's config. This is mostly useful to allow a space to be defined for the component.

from sklearn.ensemble import RandomForestClassifier
from amltk.pipeline import Component

est = Component(RandomForestClassifier, space={"n_estimators": (10, 100)})

# ... Likely get the configuration through an optimizer or sampling
configured_est = est.configure({"n_estimators": 25})

built_pipeline = configured_est.build("sklearn")

Pipeline(steps=[('RandomForestClassifier',
                 RandomForestClassifier(n_estimators=25))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Sequential - The sequential will be converted into a Pipeline, building whatever nodes are contained within in.

from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from amltk.pipeline import Component, Sequential

pipeline = Sequential(
    PCA(n_components=3),
    Component(RandomForestClassifier, config={"n_estimators": 25})
)
built_pipeline = pipeline.build("sklearn")

Pipeline(steps=[('PCA', PCA(n_components=3)),
                ('RandomForestClassifier',
                 RandomForestClassifier(n_estimators=25))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Split - The split will be converted into a ColumnTransformer, where each path and the data that should go through it is specified by the split's config. You can provide a ColumnTransformer directly as the item to the Split, or otherwise if left blank, it will default to the standard sklearn one.

You can use a Fixed with the special keyword "passthrough" as you might normally do with a ColumnTransformer.

By default, we provide two special keywords you can provide to a Split, namely "categorical" and "numerical", which will automatically configure a ColumnTransorfmer to pass the appropraite columns of a data-frame to the given paths.

from amltk.pipeline import Split, Component
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

categorical_pipeline = [
    SimpleImputer(strategy="constant", fill_value="missing"),
    Component(
        OneHotEncoder,
        space={
            "min_frequency": (0.01, 0.1),
            "handle_unknown": ["ignore", "infrequent_if_exist"],
        },
        config={"drop": "first"},
    ),
]
numerical_pipeline = [SimpleImputer(strategy="median"), StandardScaler()]

split = Split(
    {
        "categorical": categorical_pipeline,
        "numerical": numerical_pipeline
    }
)

╭─ Split(Split-cEua0bAh) ──────────────────────────────────────────────────────╮
 ╭─ Sequential(categorical) ─────────╮ ╭─ Sequential(numerical) ────────────╮ 
  ╭─ Fixed(SimpleImputer) ────────╮   ╭─ Fixed(SimpleImputer) ─────────╮  
   item SimpleImputer(fill_valu…     item SimpleImputer(strategy='…   
        strategy='constant')        ╰────────────────────────────────╯  
  ╰───────────────────────────────╯     
     ╭─ Fixed(StandardScaler) ─╮         
  ╭─ Component(OneHotEncoder) ────╮    item StandardScaler()            
   item   class                     ╰─────────────────────────╯         
          OneHotEncoder(...)       ╰────────────────────────────────────╯ 
   config {'drop': 'first'}                                               
   space  {                                                               
              'min_frequency': (                                          
                  0.01,                                                   
                  0.1                                                     
              ),                                                          
              'handle_unknown':                                           
          [                                                               
                  'ignore',                                               
                  'infrequent_i…                                          
              ]                                                           
          }                                                               
  ╰───────────────────────────────╯                                         
 ╰───────────────────────────────────╯                                        
╰──────────────────────────────────────────────────────────────────────────────╯

You can manually specify the column selectors if you prefer.

split = Split(
    {
        "categories": categorical_pipeline,
        "numbers": numerical_pipeline,
    },
    config={
        "categories": make_column_selector(dtype_include=object),
        "numbers": make_column_selector(dtype_include=np.number),
    },
)

Join - The join will be converted into a FeatureUnion.

from amltk.pipeline import Join, Component
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

join = Join(PCA(n_components=2), SelectKBest(k=3), name="my_feature_union")

pipeline = join.build("sklearn")

Pipeline(steps=[('my_feature_union',
                 FeatureUnion(transformer_list=[('PCA', PCA(n_components=2)),
                                                ('SelectKBest',
                                                 SelectKBest(k=3))]))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Choice - The estimator will be built from the chosen component's config. This is very similar to Component.

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from amltk.pipeline import Choice

# The choice here is usually provided during the `.configure()` step.
estimator_choice = Choice(
    RandomForestClassifier(),
    MLPClassifier(),
    config={"__choice__": "RandomForestClassifier"}
)

built_pipeline = estimator_choice.build("sklearn")

Pipeline(steps=[('RandomForestClassifier', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

def build(node, *, pipeline_type=SklearnPipeline, **pipeline_kwargs) #

Build a pipeline into a usable object.

PARAMETER DESCRIPTION
node

The node from which to build a pipeline.

TYPE: Node[Any, Any]

pipeline_type

The type of pipeline to build. Defaults to the standard sklearn pipeline but can be any derivative of that, i.e. ImbLearn's pipeline.

TYPE: type[SklearnPipelineT] DEFAULT: Pipeline

**pipeline_kwargs

The kwargs to pass to the pipeline_type.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
SklearnPipelineT

The built pipeline

Source code in src/amltk/pipeline/builders/sklearn.py
def build(
    node: Node[Any, Any],
    *,
    pipeline_type: type[SklearnPipelineT] = SklearnPipeline,
    **pipeline_kwargs: Any,
) -> SklearnPipelineT:
    """Build a pipeline into a usable object.

    Args:
        node: The node from which to build a pipeline.
        pipeline_type: The type of pipeline to build. Defaults to the standard
            sklearn pipeline but can be any derivative of that, i.e. ImbLearn's
            pipeline.
        **pipeline_kwargs: The kwargs to pass to the pipeline_type.

    Returns:
        The built pipeline
    """
    return pipeline_type(list(_iter_steps(node)), **pipeline_kwargs)  # type: ignore