Dataset distances

One common way to define how similar two datasets are is to compute some "similarity" between them. This notion of "similarity" requires computing some features of a dataset (metafeatures) first, such that we can numerically compute some distance function.

Let's see how we can quickly compute the distance between some datasets with dataset_distance()!

Dataset Distances P.1

import pandas as pd
import openml

from amltk.metalearning import compute_metafeatures

def get_dataset(dataset_id: int) -> tuple[pd.DataFrame, pd.Series]:
    dataset = openml.datasets.get_dataset(
        dataset_id,
        download_data=True,
        download_features_meta_data=False,
        download_qualities=False,
    )
    X, y, _, _ = dataset.get_data(
        dataset_format="dataframe",
        target=dataset.default_target_attribute,
    )
    return X, y

d31 = get_dataset(31)
d3 = get_dataset(3)
d4 = get_dataset(4)

metafeatures_dict = {
    "dataset_31": compute_metafeatures(*d31),
    "dataset_3": compute_metafeatures(*d3),
    "dataset_4": compute_metafeatures(*d4),
}

metafeatures = pd.DataFrame(metafeatures_dict)
print(metafeatures)

                                                     dataset_31  ...  dataset_4
instance_count                                      1000.000000  ...  57.000000
log_instance_count                                     6.907755  ...   4.043051
number_of_classes                                      2.000000  ...   2.000000
number_of_features                                    20.000000  ...  16.000000
log_number_of_features                                 2.995732  ...   2.772589
percentage_missing_values                              0.000000  ...   0.357456
percentage_of_instances_with_missing_values            0.000000  ...   0.982456
percentage_of_features_with_missing_values             0.000000  ...   1.000000
percentage_of_categorical_columns_with_missing_...     0.000000  ...   1.000000
percentage_of_categorical_values_with_missing_v...     0.000000  ...   0.410088
percentage_of_numeric_columns_with_missing_values      0.000000  ...   1.000000
percentage_of_numeric_values_with_missing_values       0.000000  ...   0.304825
number_of_numeric_features                             7.000000  ...   8.000000
number_of_categorical_features                        13.000000  ...   8.000000
ratio_numerical_features                               0.350000  ...   0.500000
ratio_categorical_features                             0.650000  ...   0.500000
ratio_features_to_instances                            0.020000  ...   0.280702
minority_class_imbalance                               0.200000  ...   0.149123
majority_class_imbalance                               0.200000  ...   0.149123
class_imbalance                                        0.400000  ...   0.298246
mean_categorical_imbalance                             0.500500  ...   0.308063
std_categorical_imbalance                              0.234994  ...   0.228906
skewness_mean                                          0.920379  ...   0.255076
skewness_std                                           0.904952  ...   1.420729
skewness_min                                          -0.531348  ...  -2.007217
skewness_max                                           1.949628  ...   3.318064
kurtosis_mean                                          0.924278  ...   2.046258
kurtosis_std                                           1.785467  ...   4.890029
kurtosis_min                                          -1.381449  ...  -2.035406
kurtosis_max                                           4.292590  ...  13.193069

[30 rows x 3 columns]

Now we want to know which one of "dataset_3" or "dataset_4" is more similar to "dataset_31".

Dataset Distances P.2

from amltk.metalearning import dataset_distance

target = metafeatures_dict.pop("dataset_31")
others = metafeatures_dict

distances = dataset_distance(target, others, distance_metric="l2")
print(distances)

dataset_4     943.079572
dataset_3    2196.197231
Name: l2, dtype: float64

Seems like "dataset_3" is some notion of closer to "dataset_31" than "dataset_4". However the scale of the metafeatures are not exactly all close. For example, many lie between (0, 1) but some like instance_count can completely dominate the show.

Lets repeat the computation but specify that we should apply a "minmax" scaling across the rows.

Dataset Distances P.3

distances = dataset_distance(
    target,
    others,
    distance_metric="l2",
    scaler="minmax"
)
print(distances)

dataset_3    3.293831
dataset_4    3.480296
Name: l2, dtype: float64

Now "dataset_3" is considered more similar but the difference between the two is a lot less dramatic. In general, applying some scaling to values of different scales is required for metalearning.

You can also use an sklearn.preprocessing.MinMaxScaler or anything other scaler from scikit-learn for that matter.

Dataset Distances P.3

from sklearn.preprocessing import MinMaxScaler

distances = dataset_distance(
    target,
    others,
    distance_metric="l2",
    scaler=MinMaxScaler()
)
print(distances)

dataset_3    3.293831
dataset_4    3.480296
Name: l2, dtype: float64

`def dataset_distance(target, dataset_metafeatures, *, distance_metric='l2', scaler=None, closest_n=None)` #

Calculates the distance between a target dataset and a set of datasets.

This uses the metafeatures of the datasets to calculate the distance.

PARAMETER	DESCRIPTION
`target`	The target dataset's metafeatures. TYPE: `Series`
`dataset_metafeatures`	A dictionary of dataset names to their metafeatures. TYPE: `Mapping[str, Series]`
`distance_metric`	The method to use to calculate the distance. Takes in the target dataset's metafeatures and a dataset's metafeatures Should return the distance between the two. TYPE: `DistanceMetric \| NearestNeighborsDistance \| NamedDistance` DEFAULT: `'l2'`
`scaler`	A scaler to use to scale the metafeatures. TYPE: `TransformerMixin \| Callable[[DataFrame], DataFrame] \| Literal['minmax'] \| None` DEFAULT: `None`
`closest_n`	The number of closest datasets to return. If None, all datasets are returned. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Series`	Series with the index being the dataset name and the values being the distance.

Source code in src/amltk/metalearning/dataset_distances.py

def dataset_distance(  # noqa: C901, PLR0912
    target: pd.Series,
    dataset_metafeatures: Mapping[str, pd.Series],
    *,
    distance_metric: (DistanceMetric | NearestNeighborsDistance | NamedDistance) = "l2",
    scaler: TransformerMixin
    | Callable[[pd.DataFrame], pd.DataFrame]
    | Literal["minmax"]
    | None = None,
    closest_n: int | None = None,
) -> pd.Series:
    """Calculates the distance between a target dataset and a set of datasets.

    This uses the metafeatures of the datasets to calculate the distance.

    Args:
        target: The target dataset's metafeatures.
        dataset_metafeatures: A dictionary of dataset names to their metafeatures.
        distance_metric: The method to use to calculate the distance.
            Takes in the target dataset's metafeatures and a dataset's metafeatures
            Should return the distance between the two.
        scaler: A scaler to use to scale the metafeatures.
        closest_n: The number of closest datasets to return. If None, all datasets
            are returned.

    Returns:
        Series with the index being the dataset name and the values being the distance.
    """
    outname: str
    if isinstance(distance_metric, str):
        outname = distance_metric
    else:
        outname = funcname(distance_metric)

    if target.name is None:
        target = target.copy()
        target.name = "target-dataset"

    _method = (
        distance_metrics[distance_metric]
        if isinstance(distance_metric, str)
        else distance_metric
    )

    if not isinstance(_method, NearestNeighborsDistance):
        _method = _metric_for_frame(_method)

    metafeatures = {
        name: ds_metafeatures.rename(name)
        for name, ds_metafeatures in dataset_metafeatures.items()
    }

    # Index is dataset name with columns being the values
    #      | mf1 | mf2
    # d1
    # d2
    # d3
    combined = pd.concat([target, *metafeatures.values()], axis=1).T

    if scaler is None:
        pass
    elif scaler == "minmax":
        min_maxs = combined.agg(["min", "max"], axis=0).T

        mins = min_maxs["min"]
        maxs = min_maxs["max"]
        normalizer = maxs - mins
        normalizer[normalizer == 0] = 1
        mins[normalizer == 0] = 0

        norm = lambda col: (col - mins) / normalizer
        combined = combined.apply(norm, axis=1)
    elif safe_isinstance(scaler, "TransformerMixin"):
        combined = scaler.set_output(transform="pandas").fit_transform(  # type: ignore
            combined,
        )
    elif callable(scaler):
        combined = scaler(combined)
    else:
        raise ValueError(f"Unsure how to handle {scaler=}")

    # We now transpose the dataframe so that the index is the metafeature name
    # while the columns are the dataset names
    #   x   | d1 | d2 | d3          y | dy
    #  mf1                      mf1
    #  mf2                      mf2
    x = combined.T.drop(columns=target.name)
    y = combined.loc[target.name]

    # Should return a series with index being dataset names and values being the
    #     | distance
    # d1
    # d2
    dataset_distances = _method(x, y)

    if not isinstance(dataset_distances, pd.Series):
        dataset_distances = pd.Series(
            dataset_distances,
            dtype=float,
            index=list(dataset_metafeatures.keys()),
            name=outname,
        )
    else:
        dataset_distances = dataset_distances.astype(float).rename(outname)

    dataset_distances = dataset_distances.sort_values()

    if closest_n is not None:
        if closest_n > len(dataset_distances):
            warnings.warn(
                f"Cannot get {closest_n} closest datasets when there are"
                f" only {len(dataset_distances)} datasets. Returning all.",
                UserWarning,
                stacklevel=2,
            )

        dataset_distances = dataset_distances.iloc[:closest_n]

    return dataset_distances

Dataset distances

def dataset_distance(target, dataset_metafeatures, *, distance_metric='l2', scaler=None, closest_n=None) #

`def dataset_distance(target, dataset_metafeatures, *, distance_metric='l2', scaler=None, closest_n=None)` #