Skip to content

Dataset distances

One common way to define how similar two datasets are is to compute some "similarity" between them. This notion of "similarity" requires computing some features of a dataset (metafeatures) first, such that we can numerically compute some distance function.

Let's see how we can quickly compute the distance between some datasets with dataset_distance()!

Dataset Distances P.1
import pandas as pd
import openml

from amltk.metalearning import compute_metafeatures

def get_dataset(dataset_id: int) -> tuple[pd.DataFrame, pd.Series]:
    dataset = openml.datasets.get_dataset(
        dataset_id,
        download_data=True,
        download_features_meta_data=False,
        download_qualities=False,
    )
    X, y, _, _ = dataset.get_data(
        dataset_format="dataframe",
        target=dataset.default_target_attribute,
    )
    return X, y

d31 = get_dataset(31)
d3 = get_dataset(3)
d4 = get_dataset(4)

metafeatures_dict = {
    "dataset_31": compute_metafeatures(*d31),
    "dataset_3": compute_metafeatures(*d3),
    "dataset_4": compute_metafeatures(*d4),
}

metafeatures = pd.DataFrame(metafeatures_dict)
print(metafeatures)
                                                     dataset_31  ...  dataset_4
instance_count                                      1000.000000  ...  57.000000
log_instance_count                                     6.907755  ...   4.043051
number_of_classes                                      2.000000  ...   2.000000
number_of_features                                    20.000000  ...  16.000000
log_number_of_features                                 2.995732  ...   2.772589
percentage_missing_values                              0.000000  ...   0.357456
percentage_of_instances_with_missing_values            0.000000  ...   0.982456
percentage_of_features_with_missing_values             0.000000  ...   1.000000
percentage_of_categorical_columns_with_missing_...     0.000000  ...   1.000000
percentage_of_categorical_values_with_missing_v...     0.000000  ...   0.410088
percentage_of_numeric_columns_with_missing_values      0.000000  ...   1.000000
percentage_of_numeric_values_with_missing_values       0.000000  ...   0.304825
number_of_numeric_features                             7.000000  ...   8.000000
number_of_categorical_features                        13.000000  ...   8.000000
ratio_numerical_features                               0.350000  ...   0.500000
ratio_categorical_features                             0.650000  ...   0.500000
ratio_features_to_instances                            0.020000  ...   0.280702
minority_class_imbalance                               0.200000  ...   0.149123
majority_class_imbalance                               0.200000  ...   0.149123
class_imbalance                                        0.400000  ...   0.298246
mean_categorical_imbalance                             0.500500  ...   0.308063
std_categorical_imbalance                              0.234994  ...   0.228906
skewness_mean                                          0.920379  ...   0.255076
skewness_std                                           0.904952  ...   1.420729
skewness_min                                          -0.531348  ...  -2.007217
skewness_max                                           1.949628  ...   3.318064
kurtosis_mean                                          0.924278  ...   2.046258
kurtosis_std                                           1.785467  ...   4.890029
kurtosis_min                                          -1.381449  ...  -2.035406
kurtosis_max                                           4.292590  ...  13.193069

[30 rows x 3 columns]

Now we want to know which one of "dataset_3" or "dataset_4" is more similar to "dataset_31".

Dataset Distances P.2
from amltk.metalearning import dataset_distance

target = metafeatures_dict.pop("dataset_31")
others = metafeatures_dict

distances = dataset_distance(target, others, distance_metric="l2")
print(distances)
dataset_4     943.079572
dataset_3    2196.197231
Name: l2, dtype: float64

Seems like "dataset_3" is some notion of closer to "dataset_31" than "dataset_4". However the scale of the metafeatures are not exactly all close. For example, many lie between (0, 1) but some like instance_count can completely dominate the show.

Lets repeat the computation but specify that we should apply a "minmax" scaling across the rows.

Dataset Distances P.3
distances = dataset_distance(
    target,
    others,
    distance_metric="l2",
    scaler="minmax"
)
print(distances)
dataset_3    3.293831
dataset_4    3.480296
Name: l2, dtype: float64

Now "dataset_3" is considered more similar but the difference between the two is a lot less dramatic. In general, applying some scaling to values of different scales is required for metalearning.

You can also use an sklearn.preprocessing.MinMaxScaler or anything other scaler from scikit-learn for that matter.

Dataset Distances P.3
from sklearn.preprocessing import MinMaxScaler

distances = dataset_distance(
    target,
    others,
    distance_metric="l2",
    scaler=MinMaxScaler()
)
print(distances)
dataset_3    3.293831
dataset_4    3.480296
Name: l2, dtype: float64

def dataset_distance(target, dataset_metafeatures, *, distance_metric='l2', scaler=None, closest_n=None) #

Calculates the distance between a target dataset and a set of datasets.

This uses the metafeatures of the datasets to calculate the distance.

PARAMETER DESCRIPTION
target

The target dataset's metafeatures.

TYPE: Series

dataset_metafeatures

A dictionary of dataset names to their metafeatures.

TYPE: Mapping[str, Series]

distance_metric

The method to use to calculate the distance. Takes in the target dataset's metafeatures and a dataset's metafeatures Should return the distance between the two.

TYPE: DistanceMetric | NearestNeighborsDistance | NamedDistance DEFAULT: 'l2'

scaler

A scaler to use to scale the metafeatures.

TYPE: TransformerMixin | Callable[[DataFrame], DataFrame] | Literal['minmax'] | None DEFAULT: None

closest_n

The number of closest datasets to return. If None, all datasets are returned.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
Series

Series with the index being the dataset name and the values being the distance.

Source code in src/amltk/metalearning/dataset_distances.py
def dataset_distance(  # noqa: C901, PLR0912
    target: pd.Series,
    dataset_metafeatures: Mapping[str, pd.Series],
    *,
    distance_metric: (DistanceMetric | NearestNeighborsDistance | NamedDistance) = "l2",
    scaler: TransformerMixin
    | Callable[[pd.DataFrame], pd.DataFrame]
    | Literal["minmax"]
    | None = None,
    closest_n: int | None = None,
) -> pd.Series:
    """Calculates the distance between a target dataset and a set of datasets.

    This uses the metafeatures of the datasets to calculate the distance.

    Args:
        target: The target dataset's metafeatures.
        dataset_metafeatures: A dictionary of dataset names to their metafeatures.
        distance_metric: The method to use to calculate the distance.
            Takes in the target dataset's metafeatures and a dataset's metafeatures
            Should return the distance between the two.
        scaler: A scaler to use to scale the metafeatures.
        closest_n: The number of closest datasets to return. If None, all datasets
            are returned.

    Returns:
        Series with the index being the dataset name and the values being the distance.
    """
    outname: str
    if isinstance(distance_metric, str):
        outname = distance_metric
    else:
        outname = funcname(distance_metric)

    if target.name is None:
        target = target.copy()
        target.name = "target-dataset"

    _method = (
        distance_metrics[distance_metric]
        if isinstance(distance_metric, str)
        else distance_metric
    )

    if not isinstance(_method, NearestNeighborsDistance):
        _method = _metric_for_frame(_method)

    metafeatures = {
        name: ds_metafeatures.rename(name)
        for name, ds_metafeatures in dataset_metafeatures.items()
    }

    # Index is dataset name with columns being the values
    #      | mf1 | mf2
    # d1
    # d2
    # d3
    combined = pd.concat([target, *metafeatures.values()], axis=1).T

    if scaler is None:
        pass
    elif scaler == "minmax":
        min_maxs = combined.agg(["min", "max"], axis=0).T

        mins = min_maxs["min"]
        maxs = min_maxs["max"]
        normalizer = maxs - mins
        normalizer[normalizer == 0] = 1
        mins[normalizer == 0] = 0

        norm = lambda col: (col - mins) / normalizer
        combined = combined.apply(norm, axis=1)
    elif safe_isinstance(scaler, "TransformerMixin"):
        combined = scaler.set_output(transform="pandas").fit_transform(  # type: ignore
            combined,
        )
    elif callable(scaler):
        combined = scaler(combined)
    else:
        raise ValueError(f"Unsure how to handle {scaler=}")

    # We now transpose the dataframe so that the index is the metafeature name
    # while the columns are the dataset names
    #   x   | d1 | d2 | d3          y | dy
    #  mf1                      mf1
    #  mf2                      mf2
    x = combined.T.drop(columns=target.name)
    y = combined.loc[target.name]

    # Should return a series with index being dataset names and values being the
    #     | distance
    # d1
    # d2
    dataset_distances = _method(x, y)

    if not isinstance(dataset_distances, pd.Series):
        dataset_distances = pd.Series(
            dataset_distances,
            dtype=float,
            index=list(dataset_metafeatures.keys()),
            name=outname,
        )
    else:
        dataset_distances = dataset_distances.astype(float).rename(outname)

    dataset_distances = dataset_distances.sort_values()

    if closest_n is not None:
        if closest_n > len(dataset_distances):
            warnings.warn(
                f"Cannot get {closest_n} closest datasets when there are"
                f" only {len(dataset_distances)} datasets. Returning all.",
                UserWarning,
                stacklevel=2,
            )

        dataset_distances = dataset_distances.iloc[:closest_n]

    return dataset_distances