Metalearning#

An important part of AutoML systems is to perform well on new unseen data. There are a variety of methods to do so but we provide some building blocks to help implement these methods.

API

The meta-learning features have not been extensively used yet and such no solid API has been developed yet. We will deprecate any API subject to change before changing them.

MetaFeatures#

A MetaFeature is some statistic about a dataset/task, that can be used to make datasets or tasks more comparable, thus enabling meta-learning methods.

Calculating meta-features of a dataset is quite straight foward.

Metafeatures

import openml
from amltk.metalearning import compute_metafeatures

dataset = openml.datasets.get_dataset(
    31,  # credit-g
    download_data=True,
    download_features_meta_data=False,
    download_qualities=False,
)
X, y, _, _ = dataset.get_data(
    dataset_format="dataframe",
    target=dataset.default_target_attribute,
)

mfs = compute_metafeatures(X, y)

print(mfs)

instance_count                                           1000.000000
log_instance_count                                          6.907755
number_of_classes                                           2.000000
number_of_features                                         20.000000
log_number_of_features                                      2.995732
percentage_missing_values                                   0.000000
percentage_of_instances_with_missing_values                 0.000000
percentage_of_features_with_missing_values                  0.000000
percentage_of_categorical_columns_with_missing_values       0.000000
percentage_of_categorical_values_with_missing_values        0.000000
percentage_of_numeric_columns_with_missing_values           0.000000
percentage_of_numeric_values_with_missing_values            0.000000
number_of_numeric_features                                  7.000000
number_of_categorical_features                             13.000000
ratio_numerical_features                                    0.350000
ratio_categorical_features                                  0.650000
ratio_features_to_instances                                 0.020000
minority_class_imbalance                                    0.200000
majority_class_imbalance                                    0.200000
class_imbalance                                             0.400000
mean_categorical_imbalance                                  0.500500
std_categorical_imbalance                                   0.234994
skewness_mean                                               0.920379
skewness_std                                                0.904952
skewness_min                                               -0.531348
skewness_max                                                1.949628
kurtosis_mean                                               0.924278
kurtosis_std                                                1.785467
kurtosis_min                                               -1.381449
kurtosis_max                                                4.292590
dtype: float64

By default compute_metafeatures() will calculate all the MetaFeature implemented, iterating through their subclasses to do so. You can pass an explicit list as well to compute_metafeatures(X, y, features=[...]).

To implement your own is also quite straight forward:

Create Metafeature

from amltk.metalearning import MetaFeature, compute_metafeatures
import openml
import pandas as pd

dataset = openml.datasets.get_dataset(
    31,  # credit-g
    download_data=True,
    download_features_meta_data=False,
    download_qualities=False,
)
X, y, _, _ = dataset.get_data(
    dataset_format="dataframe",
    target=dataset.default_target_attribute,
)

class TotalValues(MetaFeature):

    @classmethod
    def compute(
        cls,
        x: pd.DataFrame,
        y: pd.Series | pd.DataFrame,
        dependancy_values: dict,
    ) -> int:
        return int(x.shape[0] * x.shape[1])

mfs = compute_metafeatures(X, y, features=[TotalValues])
print(mfs)

total_values    20000
dtype: int64

As many metafeatures rely on pre-computed dataset statistics, and they do not need to be calculated more than once, you can specify the dependancies of a meta feature. When a metafeature would return something other than a single value, i.e. a dict or a pd.DataFrame, we instead call those a DatasetStatistic. These will not be included in the result of compute_metafeatures(). These DatasetStatistics will only be calculated once on a call to compute_metafeatures() so they can be re-used across all MetaFeatures that require that dependancy.

Metafeature Dependancy

from amltk.metalearning import MetaFeature, DatasetStatistic, compute_metafeatures
import openml
import pandas as pd

dataset = openml.datasets.get_dataset(
    31,  # credit-g
    download_data=True,
    download_features_meta_data=False,
    download_qualities=False,
)
X, y, _, _ = dataset.get_data(
    dataset_format="dataframe",
    target=dataset.default_target_attribute,
)

class NAValues(DatasetStatistic):
    """A mask of all NA values in a dataset"""

    @classmethod
    def compute(
        cls,
        x: pd.DataFrame,
        y: pd.Series | pd.DataFrame,
        dependancy_values: dict,
    ) -> pd.DataFrame:
        return x.isna()


class PercentageNA(MetaFeature):
    """The percentage of values missing"""

    dependencies = (NAValues,)

    @classmethod
    def compute(
        cls,
        x: pd.DataFrame,
        y: pd.Series | pd.DataFrame,
        dependancy_values: dict,
    ) -> int:
        na_values = dependancy_values[NAValues]
        n_na = na_values.sum().sum()
        n_values = int(x.shape[0] * x.shape[1])
        return float(n_na / n_values)

mfs = compute_metafeatures(X, y, features=[PercentageNA])
print(mfs)

percentage_n_a    0.0
dtype: float64

To view the description of a particular MetaFeature, you can call .description() on it. Otherwise you can access all of them in the following way:

SourceResult

Metafeature Descriptions

from pprint import pprint
from amltk.metalearning import metafeature_descriptions

descriptions = metafeature_descriptions()
for name, description in descriptions.items():
    print("---")
    print(name)
    print("---")
    print(" * " + description)

---
instance_count
---
 * Number of instances in the dataset.
---
log_instance_count
---
 * Logarithm of the number of instances in the dataset.
---
number_of_classes
---
 * Number of classes in the dataset.
---
number_of_features
---
 * Number of features in the dataset.
---
log_number_of_features
---
 * Logarithm of the number of features in the dataset.
---
percentage_missing_values
---
 * Percentage of missing values in the dataset.
---
percentage_of_instances_with_missing_values
---
 * Percentage of instances with missing values.
---
percentage_of_features_with_missing_values
---
 * Percentage of features with missing values.
---
percentage_of_categorical_columns_with_missing_values
---
 * Percentage of categorical columns with missing values.
---
percentage_of_categorical_values_with_missing_values
---
 * Percentage of categorical values with missing values.
---
percentage_of_numeric_columns_with_missing_values
---
 * Percentage of numeric columns with missing values.
---
percentage_of_numeric_values_with_missing_values
---
 * Percentage of numeric values with missing values.
---
number_of_numeric_features
---
 * Number of numeric features in the dataset.
---
number_of_categorical_features
---
 * Number of categorical features in the dataset.
---
ratio_numerical_features
---
 * Ratio of numerical features to total features in the dataset.
---
ratio_categorical_features
---
 * Ratio of categoricals features to total features in the dataset.
---
ratio_features_to_instances
---
 * Ratio of features to instances in the dataset.
---
minority_class_imbalance
---
 * Imbalance of the minority class in the dataset. 0 => Balanced. 1 imbalanced.
---
majority_class_imbalance
---
 * Imbalance of the majority class in the dataset. 0 => Balanced. 1 imbalanced.
---
class_imbalance
---
 * Mean Target Imbalance of the classes in general.

    0 => Balanced. 1 Imbalanced.

---
mean_categorical_imbalance
---
 * The mean imbalance of categorical features.
---
std_categorical_imbalance
---
 * The std imbalance of categorical features.
---
skewness_mean
---
 * The mean skewness of numerical features.
---
skewness_std
---
 * The std skewness of numerical features.
---
skewness_min
---
 * The min skewness of numerical features.
---
skewness_max
---
 * The max skewness of numerical features.
---
kurtosis_mean
---
 * The mean kurtosis of numerical features.
---
kurtosis_std
---
 * The std kurtosis of numerical features.
---
kurtosis_min
---
 * The min kurtosis of numerical features.
---
kurtosis_max
---
 * The max kurtosis of numerical features.
---
total_values
---
 * 
---
percentage_n_a
---
 * The percentage of values missing

Dataset Distances#

One common way to define how similar two datasets are is to compute some "similarity" between them. This notion of "similarity" requires computing some features of a dataset (metafeatures) first, such that we can numerically compute some distance function.

Let's see how we can quickly compute the distance between some datasets with dataset_distance()!

Dataset Distances P.1

import pandas as pd
import openml

from amltk.metalearning import compute_metafeatures

def get_dataset(dataset_id: int) -> tuple[pd.DataFrame, pd.Series]:
    dataset = openml.datasets.get_dataset(
        dataset_id,
        download_data=True,
        download_features_meta_data=False,
        download_qualities=False,
    )
    X, y, _, _ = dataset.get_data(
        dataset_format="dataframe",
        target=dataset.default_target_attribute,
    )
    return X, y

d31 = get_dataset(31)
d3 = get_dataset(3)
d4 = get_dataset(4)

metafeatures_dict = {
    "dataset_31": compute_metafeatures(*d31),
    "dataset_3": compute_metafeatures(*d3),
    "dataset_4": compute_metafeatures(*d4),
}

metafeatures = pd.DataFrame(metafeatures_dict)
print(metafeatures)

                                                     dataset_31  ...  dataset_4
instance_count                                      1000.000000  ...  57.000000
log_instance_count                                     6.907755  ...   4.043051
number_of_classes                                      2.000000  ...   2.000000
number_of_features                                    20.000000  ...  16.000000
log_number_of_features                                 2.995732  ...   2.772589
percentage_missing_values                              0.000000  ...   0.357456
percentage_of_instances_with_missing_values            0.000000  ...   0.982456
percentage_of_features_with_missing_values             0.000000  ...   1.000000
percentage_of_categorical_columns_with_missing_...     0.000000  ...   1.000000
percentage_of_categorical_values_with_missing_v...     0.000000  ...   0.410088
percentage_of_numeric_columns_with_missing_values      0.000000  ...   1.000000
percentage_of_numeric_values_with_missing_values       0.000000  ...   0.304825
number_of_numeric_features                             7.000000  ...   8.000000
number_of_categorical_features                        13.000000  ...   8.000000
ratio_numerical_features                               0.350000  ...   0.500000
ratio_categorical_features                             0.650000  ...   0.500000
ratio_features_to_instances                            0.020000  ...   0.280702
minority_class_imbalance                               0.200000  ...   0.149123
majority_class_imbalance                               0.200000  ...   0.149123
class_imbalance                                        0.400000  ...   0.298246
mean_categorical_imbalance                             0.500500  ...   0.308063
std_categorical_imbalance                              0.234994  ...   0.228906
skewness_mean                                          0.920379  ...   0.255076
skewness_std                                           0.904952  ...   1.420729
skewness_min                                          -0.531348  ...  -2.007217
skewness_max                                           1.949628  ...   3.318064
kurtosis_mean                                          0.924278  ...   2.046258
kurtosis_std                                           1.785467  ...   4.890029
kurtosis_min                                          -1.381449  ...  -2.035406
kurtosis_max                                           4.292590  ...  13.193069

[30 rows x 3 columns]

Now we want to know which one of "dataset_3" or "dataset_4" is more similar to "dataset_31".

Dataset Distances P.2

from amltk.metalearning import dataset_distance

target = metafeatures_dict.pop("dataset_31")
others = metafeatures_dict

distances = dataset_distance(target, others, distance_metric="l2")
print(distances)

dataset_4     943.079572
dataset_3    2196.197231
Name: l2, dtype: float64

Seems like "dataset_3" is some notion of closer to "dataset_31" than "dataset_4". However the scale of the metafeatures are not exactly all close. For example, many lie between (0, 1) but some like instance_count can completely dominate the show.

Lets repeat the computation but specify that we should apply a "minmax" scaling across the rows.

Dataset Distances P.3

distances = dataset_distance(
    target,
    others,
    distance_metric="l2",
    scaler="minmax"
)
print(distances)

dataset_3    3.293831
dataset_4    3.480296
Name: l2, dtype: float64

Now "dataset_3" is considered more similar but the difference between the two is a lot less dramatic. In general, applying some scaling to values of different scales is required for metalearning.

You can also use an sklearn.preprocessing.MinMaxScaler or anything other scaler from scikit-learn for that matter.

Dataset Distances P.3

from sklearn.preprocessing import MinMaxScaler

distances = dataset_distance(
    target,
    others,
    distance_metric="l2",
    scaler=MinMaxScaler()
)
print(distances)

dataset_3    3.293831
dataset_4    3.480296
Name: l2, dtype: float64

Portfolio Selection#

A portfolio in meta-learning is to a set (ordered or not) of configurations that maximize some notion of coverage across datasets or tasks. The intuition here is that this also means that any new dataset is also covered!

Suppose we have the given performances of some configurations across some datasets.

Initial Portfolio

import pandas as pd

performances = {
    "c1": [90, 60, 20, 10],
    "c2": [20, 10, 90, 20],
    "c3": [10, 20, 40, 90],
    "c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])
print(portfolio)

           c1  c2  c3  c4
dataset_1  90  20  10  90
dataset_2  60  10  20  10
dataset_3  20  90  40  10
dataset_4  10  20  90  10

If we could only choose k=3 of these configurations on some new given dataset, which ones would you choose and in what priority? Here is where we can apply portfolio_selection()!

The idea is that we pick a subset of these algorithms that maximise some value of utility for the portfolio. We do this by adding a single configuration from the entire set, 1-by-1 until we reach k, beginning with the empty portfolio.

Let's see this in action!

Portfolio Selection

import pandas as pd
from amltk.metalearning import portfolio_selection

performances = {
    "c1": [90, 60, 20, 10],
    "c2": [20, 10, 90, 20],
    "c3": [10, 20, 40, 90],
    "c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])

selected_portfolio, trajectory = portfolio_selection(
    portfolio,
    k=3,
    scaler="minmax"
)

print(selected_portfolio)
print()
print(trajectory)

              c1     c3     c2
dataset_1  1.000  0.000  0.125
dataset_2  1.000  0.200  0.000
dataset_3  0.125  0.375  1.000
dataset_4  0.000  1.000  0.125

c1    0.53125
c3    0.84375
c2    1.00000
dtype: float64

The trajectory tells us which configuration was added at each time stamp along with the utility of the portfolio with that configuration added. However we havn't specified how exactly we defined the utility of a given portfolio. We could define our own function to do so:

Portfolio Selection Custom

import pandas as pd
from amltk.metalearning import portfolio_selection

performances = {
    "c1": [90, 60, 20, 10],
    "c2": [20, 10, 90, 20],
    "c3": [10, 20, 40, 90],
    "c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])

def my_function(p: pd.DataFrame) -> float:
    # Take the maximum score for each dataset and then take the mean across them.
    return p.max(axis=1).mean()

selected_portfolio, trajectory = portfolio_selection(
    portfolio,
    k=3,
    scaler="minmax",
    portfolio_value=my_function,
)

print(selected_portfolio)
print()
print(trajectory)

              c1     c3     c2
dataset_1  1.000  0.000  0.125
dataset_2  1.000  0.200  0.000
dataset_3  0.125  0.375  1.000
dataset_4  0.000  1.000  0.125

c1    0.53125
c3    0.84375
c2    1.00000
dtype: float64

This notion of reducing across all configurations for a dataset and then aggregating these is common enough that we can also directly just define these operations and we will perform the rest.

Portfolio Selection With Reduction

import pandas as pd
import numpy as np
from amltk.metalearning import portfolio_selection

performances = {
    "c1": [90, 60, 20, 10],
    "c2": [20, 10, 90, 20],
    "c3": [10, 20, 40, 90],
    "c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])

selected_portfolio, trajectory = portfolio_selection(
    portfolio,
    k=3,
    scaler="minmax",
    row_reducer=np.max,  # This is actually the default
    aggregator=np.mean,  # This is actually the default
)

print(selected_portfolio)
print()
print(trajectory)

              c1     c3     c2
dataset_1  1.000  0.000  0.125
dataset_2  1.000  0.200  0.000
dataset_3  0.125  0.375  1.000
dataset_4  0.000  1.000  0.125

c1    0.53125
c3    0.84375
c2    1.00000
dtype: float64