Dataset distances
amltk.metalearning.dataset_distances
#
One common way to define how similar two datasets are is to compute some "similarity" between them. This notion of "similarity" requires computing some features of a dataset (metafeatures) first, such that we can numerically compute some distance function.
Let's see how we can quickly compute the distance between some datasets with
dataset_distance()
!
import pandas as pd
import openml
from amltk.metalearning import compute_metafeatures
def get_dataset(dataset_id: int) -> tuple[pd.DataFrame, pd.Series]:
dataset = openml.datasets.get_dataset(
dataset_id,
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
return X, y
d31 = get_dataset(31)
d3 = get_dataset(3)
d4 = get_dataset(4)
metafeatures_dict = {
"dataset_31": compute_metafeatures(*d31),
"dataset_3": compute_metafeatures(*d3),
"dataset_4": compute_metafeatures(*d4),
}
metafeatures = pd.DataFrame(metafeatures_dict)
print(metafeatures)
dataset_31 ... dataset_4
instance_count 1000.000000 ... 57.000000
log_instance_count 6.907755 ... 4.043051
number_of_classes 2.000000 ... 2.000000
number_of_features 20.000000 ... 16.000000
log_number_of_features 2.995732 ... 2.772589
percentage_missing_values 0.000000 ... 0.357456
percentage_of_instances_with_missing_values 0.000000 ... 0.982456
percentage_of_features_with_missing_values 0.000000 ... 1.000000
percentage_of_categorical_columns_with_missing_... 0.000000 ... 1.000000
percentage_of_categorical_values_with_missing_v... 0.000000 ... 0.410088
percentage_of_numeric_columns_with_missing_values 0.000000 ... 1.000000
percentage_of_numeric_values_with_missing_values 0.000000 ... 0.304825
number_of_numeric_features 7.000000 ... 8.000000
number_of_categorical_features 13.000000 ... 8.000000
ratio_numerical_features 0.350000 ... 0.500000
ratio_categorical_features 0.650000 ... 0.500000
ratio_features_to_instances 0.020000 ... 0.280702
minority_class_imbalance 0.200000 ... 0.149123
majority_class_imbalance 0.200000 ... 0.149123
class_imbalance 0.400000 ... 0.298246
mean_categorical_imbalance 0.500500 ... 0.308063
std_categorical_imbalance 0.234994 ... 0.228906
skewness_mean 0.920379 ... 0.255076
skewness_std 0.904952 ... 1.420729
skewness_min -0.531348 ... -2.007217
skewness_max 1.949628 ... 3.318064
kurtosis_mean 0.924278 ... 2.046258
kurtosis_std 1.785467 ... 4.890029
kurtosis_min -1.381449 ... -2.035406
kurtosis_max 4.292590 ... 13.193069
[30 rows x 3 columns]
Now we want to know which one of "dataset_3"
or "dataset_4"
is
more similar to "dataset_31"
.
from amltk.metalearning import dataset_distance
target = metafeatures_dict.pop("dataset_31")
others = metafeatures_dict
distances = dataset_distance(target, others, distance_metric="l2")
print(distances)
Seems like "dataset_3"
is some notion of closer to "dataset_31"
than "dataset_4"
. However the scale of the metafeatures are not exactly all close.
For example, many lie between (0, 1)
but some like instance_count
can completely
dominate the show.
Lets repeat the computation but specify that we should apply a "minmax"
scaling
across the rows.
distances = dataset_distance(
target,
others,
distance_metric="l2",
scaler="minmax"
)
print(distances)
Now "dataset_3"
is considered more similar but the difference between the two is a lot less
dramatic. In general, applying some scaling to values of different scales is required for metalearning.
You can also use an sklearn.preprocessing.MinMaxScaler or anything other scaler from scikit-learn for that matter.
from sklearn.preprocessing import MinMaxScaler
distances = dataset_distance(
target,
others,
distance_metric="l2",
scaler=MinMaxScaler()
)
print(distances)
dataset_distance
#
dataset_distance(
target: Series,
dataset_metafeatures: Mapping[str, Series],
*,
distance_metric: (
DistanceMetric
| NearestNeighborsDistance
| NamedDistance
) = "l2",
scaler: (
TransformerMixin
| Callable[[DataFrame], DataFrame]
| Literal["minmax"]
| None
) = None,
closest_n: int | None = None
) -> Series
Calculates the distance between a target dataset and a set of datasets.
This uses the metafeatures of the datasets to calculate the distance.
PARAMETER | DESCRIPTION |
---|---|
target |
The target dataset's metafeatures.
TYPE:
|
dataset_metafeatures |
A dictionary of dataset names to their metafeatures. |
distance_metric |
The method to use to calculate the distance. Takes in the target dataset's metafeatures and a dataset's metafeatures Should return the distance between the two.
TYPE:
|
scaler |
A scaler to use to scale the metafeatures.
TYPE:
|
closest_n |
The number of closest datasets to return. If None, all datasets are returned.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Series
|
Series with the index being the dataset name and the values being the distance. |
Source code in src/amltk/metalearning/dataset_distances.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
|