Metalearning#
An important part of AutoML systems is to perform well on new unseen data. There are a variety of methods to do so but we provide some building blocks to help implement these methods.
API
The meta-learning features have not been extensively used yet and such no solid API has been developed yet. We will deprecate any API subject to change before changing them.
MetaFeatures#
A MetaFeature
is some
statistic about a dataset/task, that can be used to make datasets or
tasks more comparable, thus enabling meta-learning methods.
Calculating meta-features of a dataset is quite straight foward.
import openml
from amltk.metalearning import compute_metafeatures
dataset = openml.datasets.get_dataset(
31, # credit-g
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
mfs = compute_metafeatures(X, y)
print(mfs)
instance_count 1000.000000
log_instance_count 6.907755
number_of_classes 2.000000
number_of_features 20.000000
log_number_of_features 2.995732
percentage_missing_values 0.000000
percentage_of_instances_with_missing_values 0.000000
percentage_of_features_with_missing_values 0.000000
percentage_of_categorical_columns_with_missing_values 0.000000
percentage_of_categorical_values_with_missing_values 0.000000
percentage_of_numeric_columns_with_missing_values 0.000000
percentage_of_numeric_values_with_missing_values 0.000000
number_of_numeric_features 7.000000
number_of_categorical_features 13.000000
ratio_numerical_features 0.350000
ratio_categorical_features 0.650000
ratio_features_to_instances 0.020000
minority_class_imbalance 0.200000
majority_class_imbalance 0.200000
class_imbalance 0.400000
mean_categorical_imbalance 0.500500
std_categorical_imbalance 0.234994
skewness_mean 0.920379
skewness_std 0.904952
skewness_min -0.531348
skewness_max 1.949628
kurtosis_mean 0.924278
kurtosis_std 1.785467
kurtosis_min -1.381449
kurtosis_max 4.292590
dtype: float64
By default compute_metafeatures()
will
calculate all the MetaFeature
implemented,
iterating through their subclasses to do so. You can pass an explicit list
as well to compute_metafeatures(X, y, features=[...])
.
To implement your own is also quite straight forward:
from amltk.metalearning import MetaFeature, compute_metafeatures
import openml
import pandas as pd
dataset = openml.datasets.get_dataset(
31, # credit-g
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
class TotalValues(MetaFeature):
@classmethod
def compute(
cls,
x: pd.DataFrame,
y: pd.Series | pd.DataFrame,
dependancy_values: dict,
) -> int:
return int(x.shape[0] * x.shape[1])
mfs = compute_metafeatures(X, y, features=[TotalValues])
print(mfs)
As many metafeatures rely on pre-computed dataset statistics, and they do not
need to be calculated more than once, you can specify the dependancies of
a meta feature. When a metafeature would return something other than a single
value, i.e. a dict
or a pd.DataFrame
, we instead call those a
DatasetStatistic
. These will
not be included in the result of compute_metafeatures()
.
These DatasetStatistic
s will only be calculated once on a call to compute_metafeatures()
so
they can be re-used across all MetaFeature
s that require that dependancy.
from amltk.metalearning import MetaFeature, DatasetStatistic, compute_metafeatures
import openml
import pandas as pd
dataset = openml.datasets.get_dataset(
31, # credit-g
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
class NAValues(DatasetStatistic):
"""A mask of all NA values in a dataset"""
@classmethod
def compute(
cls,
x: pd.DataFrame,
y: pd.Series | pd.DataFrame,
dependancy_values: dict,
) -> pd.DataFrame:
return x.isna()
class PercentageNA(MetaFeature):
"""The percentage of values missing"""
dependencies = (NAValues,)
@classmethod
def compute(
cls,
x: pd.DataFrame,
y: pd.Series | pd.DataFrame,
dependancy_values: dict,
) -> int:
na_values = dependancy_values[NAValues]
n_na = na_values.sum().sum()
n_values = int(x.shape[0] * x.shape[1])
return float(n_na / n_values)
mfs = compute_metafeatures(X, y, features=[PercentageNA])
print(mfs)
To view the description of a particular MetaFeature
, you can call
.description()
on it. Otherwise you can access all of them in the following way:
---
instance_count
---
* Number of instances in the dataset.
---
log_instance_count
---
* Logarithm of the number of instances in the dataset.
---
number_of_classes
---
* Number of classes in the dataset.
---
number_of_features
---
* Number of features in the dataset.
---
log_number_of_features
---
* Logarithm of the number of features in the dataset.
---
percentage_missing_values
---
* Percentage of missing values in the dataset.
---
percentage_of_instances_with_missing_values
---
* Percentage of instances with missing values.
---
percentage_of_features_with_missing_values
---
* Percentage of features with missing values.
---
percentage_of_categorical_columns_with_missing_values
---
* Percentage of categorical columns with missing values.
---
percentage_of_categorical_values_with_missing_values
---
* Percentage of categorical values with missing values.
---
percentage_of_numeric_columns_with_missing_values
---
* Percentage of numeric columns with missing values.
---
percentage_of_numeric_values_with_missing_values
---
* Percentage of numeric values with missing values.
---
number_of_numeric_features
---
* Number of numeric features in the dataset.
---
number_of_categorical_features
---
* Number of categorical features in the dataset.
---
ratio_numerical_features
---
* Ratio of numerical features to total features in the dataset.
---
ratio_categorical_features
---
* Ratio of categoricals features to total features in the dataset.
---
ratio_features_to_instances
---
* Ratio of features to instances in the dataset.
---
minority_class_imbalance
---
* Imbalance of the minority class in the dataset. 0 => Balanced. 1 imbalanced.
---
majority_class_imbalance
---
* Imbalance of the majority class in the dataset. 0 => Balanced. 1 imbalanced.
---
class_imbalance
---
* Mean Target Imbalance of the classes in general.
0 => Balanced. 1 Imbalanced.
---
mean_categorical_imbalance
---
* The mean imbalance of categorical features.
---
std_categorical_imbalance
---
* The std imbalance of categorical features.
---
skewness_mean
---
* The mean skewness of numerical features.
---
skewness_std
---
* The std skewness of numerical features.
---
skewness_min
---
* The min skewness of numerical features.
---
skewness_max
---
* The max skewness of numerical features.
---
kurtosis_mean
---
* The mean kurtosis of numerical features.
---
kurtosis_std
---
* The std kurtosis of numerical features.
---
kurtosis_min
---
* The min kurtosis of numerical features.
---
kurtosis_max
---
* The max kurtosis of numerical features.
---
total_values
---
*
---
percentage_n_a
---
* The percentage of values missing
Dataset Distances#
One common way to define how similar two datasets are is to compute some "similarity" between them. This notion of "similarity" requires computing some features of a dataset (metafeatures) first, such that we can numerically compute some distance function.
Let's see how we can quickly compute the distance between some datasets with
dataset_distance()
!
import pandas as pd
import openml
from amltk.metalearning import compute_metafeatures
def get_dataset(dataset_id: int) -> tuple[pd.DataFrame, pd.Series]:
dataset = openml.datasets.get_dataset(
dataset_id,
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
return X, y
d31 = get_dataset(31)
d3 = get_dataset(3)
d4 = get_dataset(4)
metafeatures_dict = {
"dataset_31": compute_metafeatures(*d31),
"dataset_3": compute_metafeatures(*d3),
"dataset_4": compute_metafeatures(*d4),
}
metafeatures = pd.DataFrame(metafeatures_dict)
print(metafeatures)
dataset_31 ... dataset_4
instance_count 1000.000000 ... 57.000000
log_instance_count 6.907755 ... 4.043051
number_of_classes 2.000000 ... 2.000000
number_of_features 20.000000 ... 16.000000
log_number_of_features 2.995732 ... 2.772589
percentage_missing_values 0.000000 ... 0.357456
percentage_of_instances_with_missing_values 0.000000 ... 0.982456
percentage_of_features_with_missing_values 0.000000 ... 1.000000
percentage_of_categorical_columns_with_missing_... 0.000000 ... 1.000000
percentage_of_categorical_values_with_missing_v... 0.000000 ... 0.410088
percentage_of_numeric_columns_with_missing_values 0.000000 ... 1.000000
percentage_of_numeric_values_with_missing_values 0.000000 ... 0.304825
number_of_numeric_features 7.000000 ... 8.000000
number_of_categorical_features 13.000000 ... 8.000000
ratio_numerical_features 0.350000 ... 0.500000
ratio_categorical_features 0.650000 ... 0.500000
ratio_features_to_instances 0.020000 ... 0.280702
minority_class_imbalance 0.200000 ... 0.149123
majority_class_imbalance 0.200000 ... 0.149123
class_imbalance 0.400000 ... 0.298246
mean_categorical_imbalance 0.500500 ... 0.308063
std_categorical_imbalance 0.234994 ... 0.228906
skewness_mean 0.920379 ... 0.255076
skewness_std 0.904952 ... 1.420729
skewness_min -0.531348 ... -2.007217
skewness_max 1.949628 ... 3.318064
kurtosis_mean 0.924278 ... 2.046258
kurtosis_std 1.785467 ... 4.890029
kurtosis_min -1.381449 ... -2.035406
kurtosis_max 4.292590 ... 13.193069
[30 rows x 3 columns]
Now we want to know which one of "dataset_3"
or "dataset_4"
is
more similar to "dataset_31"
.
from amltk.metalearning import dataset_distance
target = metafeatures_dict.pop("dataset_31")
others = metafeatures_dict
distances = dataset_distance(target, others, distance_metric="l2")
print(distances)
Seems like "dataset_3"
is some notion of closer to "dataset_31"
than "dataset_4"
. However the scale of the metafeatures are not exactly all close.
For example, many lie between (0, 1)
but some like instance_count
can completely
dominate the show.
Lets repeat the computation but specify that we should apply a "minmax"
scaling
across the rows.
distances = dataset_distance(
target,
others,
distance_metric="l2",
scaler="minmax"
)
print(distances)
Now "dataset_3"
is considered more similar but the difference between the two is a lot less
dramatic. In general, applying some scaling to values of different scales is required for metalearning.
You can also use an sklearn.preprocessing.MinMaxScaler or anything other scaler from scikit-learn for that matter.
from sklearn.preprocessing import MinMaxScaler
distances = dataset_distance(
target,
others,
distance_metric="l2",
scaler=MinMaxScaler()
)
print(distances)
Portfolio Selection#
A portfolio in meta-learning is to a set (ordered or not) of configurations that maximize some notion of coverage across datasets or tasks. The intuition here is that this also means that any new dataset is also covered!
Suppose we have the given performances of some configurations across some datasets.
import pandas as pd
performances = {
"c1": [90, 60, 20, 10],
"c2": [20, 10, 90, 20],
"c3": [10, 20, 40, 90],
"c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])
print(portfolio)
If we could only choose k=3
of these configurations on some new given dataset, which ones would
you choose and in what priority?
Here is where we can apply portfolio_selection()
!
The idea is that we pick a subset of these algorithms that maximise some value of utility for
the portfolio. We do this by adding a single configuration from the entire set, 1-by-1 until
we reach k
, beginning with the empty portfolio.
Let's see this in action!
import pandas as pd
from amltk.metalearning import portfolio_selection
performances = {
"c1": [90, 60, 20, 10],
"c2": [20, 10, 90, 20],
"c3": [10, 20, 40, 90],
"c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])
selected_portfolio, trajectory = portfolio_selection(
portfolio,
k=3,
scaler="minmax"
)
print(selected_portfolio)
print()
print(trajectory)
The trajectory tells us which configuration was added at each time stamp along with the utility of the portfolio with that configuration added. However we havn't specified how exactly we defined the utility of a given portfolio. We could define our own function to do so:
import pandas as pd
from amltk.metalearning import portfolio_selection
performances = {
"c1": [90, 60, 20, 10],
"c2": [20, 10, 90, 20],
"c3": [10, 20, 40, 90],
"c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])
def my_function(p: pd.DataFrame) -> float:
# Take the maximum score for each dataset and then take the mean across them.
return p.max(axis=1).mean()
selected_portfolio, trajectory = portfolio_selection(
portfolio,
k=3,
scaler="minmax",
portfolio_value=my_function,
)
print(selected_portfolio)
print()
print(trajectory)
This notion of reducing across all configurations for a dataset and then aggregating these is common enough that we can also directly just define these operations and we will perform the rest.
import pandas as pd
import numpy as np
from amltk.metalearning import portfolio_selection
performances = {
"c1": [90, 60, 20, 10],
"c2": [20, 10, 90, 20],
"c3": [10, 20, 40, 90],
"c4": [90, 10, 10, 10],
}
portfolio = pd.DataFrame(performances, index=["dataset_1", "dataset_2", "dataset_3", "dataset_4"])
selected_portfolio, trajectory = portfolio_selection(
portfolio,
k=3,
scaler="minmax",
row_reducer=np.max, # This is actually the default
aggregator=np.mean, # This is actually the default
)
print(selected_portfolio)
print()
print(trajectory)