Metafeatures
A MetaFeature
is some
statistic about a dataset/task, that can be used to make datasets or
tasks more comparable, thus enabling meta-learning methods.
Calculating meta-features of a dataset is quite straight foward.
import openml
from amltk.metalearning import compute_metafeatures
dataset = openml.datasets.get_dataset(
31, # credit-g
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
mfs = compute_metafeatures(X, y)
print(mfs)
instance_count 1000.000000
log_instance_count 6.907755
number_of_classes 2.000000
number_of_features 20.000000
log_number_of_features 2.995732
percentage_missing_values 0.000000
percentage_of_instances_with_missing_values 0.000000
percentage_of_features_with_missing_values 0.000000
percentage_of_categorical_columns_with_missing_values 0.000000
percentage_of_categorical_values_with_missing_values 0.000000
percentage_of_numeric_columns_with_missing_values 0.000000
percentage_of_numeric_values_with_missing_values 0.000000
number_of_numeric_features 7.000000
number_of_categorical_features 13.000000
ratio_numerical_features 0.350000
ratio_categorical_features 0.650000
ratio_features_to_instances 0.020000
minority_class_imbalance 0.200000
majority_class_imbalance 0.200000
class_imbalance 0.400000
mean_categorical_imbalance 0.500500
std_categorical_imbalance 0.234994
skewness_mean 0.920379
skewness_std 0.904952
skewness_min -0.531348
skewness_max 1.949628
kurtosis_mean 0.924278
kurtosis_std 1.785467
kurtosis_min -1.381449
kurtosis_max 4.292590
dtype: float64
By default compute_metafeatures()
will
calculate all the MetaFeature
implemented,
iterating through their subclasses to do so. You can pass an explicit list
as well to compute_metafeatures(X, y, features=[...])
.
To implement your own is also quite straight forward:
from amltk.metalearning import MetaFeature, compute_metafeatures
import openml
dataset = openml.datasets.get_dataset(
31, # credit-g
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
class TotalValues(MetaFeature):
@classmethod
def compute(
cls,
x: pd.DataFrame,
y: pd.Series | pd.DataFrame,
dependancy_values: dict,
) -> int:
return int(x.shape[0] * x.shape[1])
mfs = compute_metafeatures(X, y, features=[TotalValues])
print(mfs)
As many metafeatures rely on pre-computed dataset statistics, and they do not
need to be calculated more than once, you can specify the dependancies of
a meta feature. When a metafeature would return something other than a single
value, i.e. a dict
or a pd.DataFrame
, we instead call those a
DatasetStatistic
. These will
not be included in the result of compute_metafeatures()
.
These DatasetStatistic
s will only be calculated once on a call to compute_metafeatures()
so
they can be re-used across all MetaFeature
s that require that dependancy.
from amltk.metalearning import MetaFeature, DatasetStatistic, compute_metafeatures
import openml
dataset = openml.datasets.get_dataset(
31, # credit-g
download_data=True,
download_features_meta_data=False,
download_qualities=False,
)
X, y, _, _ = dataset.get_data(
dataset_format="dataframe",
target=dataset.default_target_attribute,
)
class NAValues(DatasetStatistic):
"""A mask of all NA values in a dataset"""
@classmethod
def compute(
cls,
x: pd.DataFrame,
y: pd.Series | pd.DataFrame,
dependancy_values: dict,
) -> pd.DataFrame:
return x.isna()
class PercentageNA(MetaFeature):
"""The percentage of values missing"""
dependencies = (NAValues,)
@classmethod
def compute(
cls,
x: pd.DataFrame,
y: pd.Series | pd.DataFrame,
dependancy_values: dict,
) -> int:
na_values = dependancy_values[NAValues]
n_na = na_values.sum().sum()
n_values = int(x.shape[0] * x.shape[1])
return float(n_na / n_values)
mfs = compute_metafeatures(X, y, features=[PercentageNA])
print(mfs)
To view the description of a particular MetaFeature
, you can call
.description()
on it. Otherwise you can access all of them in the following way:
---
instance_count
---
* Number of instances in the dataset.
---
log_instance_count
---
* Logarithm of the number of instances in the dataset.
---
number_of_classes
---
* Number of classes in the dataset.
---
number_of_features
---
* Number of features in the dataset.
---
log_number_of_features
---
* Logarithm of the number of features in the dataset.
---
percentage_missing_values
---
* Percentage of missing values in the dataset.
---
percentage_of_instances_with_missing_values
---
* Percentage of instances with missing values.
---
percentage_of_features_with_missing_values
---
* Percentage of features with missing values.
---
percentage_of_categorical_columns_with_missing_values
---
* Percentage of categorical columns with missing values.
---
percentage_of_categorical_values_with_missing_values
---
* Percentage of categorical values with missing values.
---
percentage_of_numeric_columns_with_missing_values
---
* Percentage of numeric columns with missing values.
---
percentage_of_numeric_values_with_missing_values
---
* Percentage of numeric values with missing values.
---
number_of_numeric_features
---
* Number of numeric features in the dataset.
---
number_of_categorical_features
---
* Number of categorical features in the dataset.
---
ratio_numerical_features
---
* Ratio of numerical features to total features in the dataset.
---
ratio_categorical_features
---
* Ratio of categoricals features to total features in the dataset.
---
ratio_features_to_instances
---
* Ratio of features to instances in the dataset.
---
minority_class_imbalance
---
* Imbalance of the minority class in the dataset. 0 => Balanced. 1 imbalanced.
---
majority_class_imbalance
---
* Imbalance of the majority class in the dataset. 0 => Balanced. 1 imbalanced.
---
class_imbalance
---
* Mean Target Imbalance of the classes in general.
0 => Balanced. 1 Imbalanced.
---
mean_categorical_imbalance
---
* The mean imbalance of categorical features.
---
std_categorical_imbalance
---
* The std imbalance of categorical features.
---
skewness_mean
---
* The mean skewness of numerical features.
---
skewness_std
---
* The std skewness of numerical features.
---
skewness_min
---
* The min skewness of numerical features.
---
skewness_max
---
* The max skewness of numerical features.
---
kurtosis_mean
---
* The mean kurtosis of numerical features.
---
kurtosis_std
---
* The std kurtosis of numerical features.
---
kurtosis_min
---
* The min kurtosis of numerical features.
---
kurtosis_max
---
* The max kurtosis of numerical features.
---
total_values
---
*
---
percentage_n_a
---
* The percentage of values missing
class DatasetStatistic
#
Base class for a dataset statistic.
A dataset statistic is a function that takes a dataset and returns some value(s) that describe the dataset.
If looking to create meta-features, see the MetaFeature
class which
restricts the statistic to be a single number.
def description()
classmethod
#
def name()
classmethod
#
def compute(x, y, dependancy_values)
abstractmethod
classmethod
#
Compute the value of this statistic.
PARAMETER | DESCRIPTION |
---|---|
x |
The features of the dataset.
TYPE:
|
y |
The labels of the dataset. |
dependancy_values |
A dictionary of dependency values.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
S
|
The value of this statistic. |
Source code in src/amltk/metalearning/metafeatures.py
def retrieve(dependancy_values)
classmethod
#
Retrieve the value of this statistic from the dependency values.
PARAMETER | DESCRIPTION |
---|---|
dependancy_values |
A dictionary of dependency values.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
S
|
The value of this statistic. |
Source code in src/amltk/metalearning/metafeatures.py
class MetaFeature
#
Bases: DatasetStatistic[M]
Used to indicate a metafeature to include.
This differs from DatasetStatistic in that it must return a single value.
def iter()
classmethod
#
class NAValues
#
class ClassImbalanceRatios
#
Bases: DatasetStatistic[tuple[Series, float]]
Imbalance ratios of each class in the dataset.
Will return the ratios of each class, the ratio expected if perfectly balanced,
class CategoricalImbalanceRatios
#
class CategoricalColumns
#
class NumericalColumns
#
class InstanceCount
#
class LogInstanceCount
#
class NumberOfClasses
#
class NumberOfFeatures
#
class LogNumberOfFeatures
#
class PercentageMissingValues
#
class PercentageOfInstancesWithMissingValues
#
class PercentageOfFeaturesWithMissingValues
#
class PercentageOfCategoricalColumnsWithMissingValues
#
class PercentageOfCategoricalValuesWithMissingValues
#
class PercentageOfNumericColumnsWithMissingValues
#
class PercentageOfNumericValuesWithMissingValues
#
class NumberOfNumericFeatures
#
class NumberOfCategoricalFeatures
#
class RatioNumericalFeatures
#
class RatioCategoricalFeatures
#
class RatioFeaturesToInstances
#
class ClassCounts
#
class MinorityClassImbalance
#
Bases: MetaFeature[float]
Imbalance of the minority class in the dataset. 0 => Balanced. 1 imbalanced.
class MajorityClassImbalance
#
Bases: MetaFeature[float]
Imbalance of the majority class in the dataset. 0 => Balanced. 1 imbalanced.
class ClassImbalance
#
Bases: MetaFeature[float]
Mean Target Imbalance of the classes in general.
0 => Balanced. 1 Imbalanced.
class ImbalancePerCategory
#
Bases: DatasetStatistic[dict[str, float]]
Imbalance of each categorical feature. 0 => Balanced. 1 most imbalanced.
No categories implies perfectly balanced.
class MeanCategoricalImbalance
#
class StdCategoricalImbalance
#
class SkewnessPerNumericalColumn
#
class SkewnessMean
#
class SkewnessStd
#
class SkewnessMin
#
class SkewnessMax
#
class KurtosisPerNumericalColumn
#
class KurtosisMean
#
class KurtosisStd
#
class KurtosisMin
#
class KurtosisMax
#
def imbalance_ratios(col)
#
Compute the imbalance ratio of a categorical column.
This is done by computing the distance of each item's ratio to what a perfectly balanced ratio would be. We then sum up the distances, dividing by the worst case to normalize between 0 and 1.
PARAMETER | DESCRIPTION |
---|---|
col |
A column of values. If a DataFrame, the values from the subset of columns will be used. |
RETURNS | DESCRIPTION |
---|---|
Series
|
A tuple of the imbalance ratios, sorted from lowest (0) to highest (1) |
float
|
and the expected ratio if perfectly balanced. |
Source code in src/amltk/metalearning/metafeatures.py
def column_imbalance(ratios, balanced_ratio)
#
Compute the imbalance of a column.
This is done by computing the distance of each item's ratio to what a perfectly balanced ratio would be. We then sum up the distances, dividing by the worst case to normalize between 0 and 1. 0 indicates a perfectly balanced column, 1 indicates a column where all items are of the same type.
PARAMETER | DESCRIPTION |
---|---|
ratios |
The ratios of each item in the column.
TYPE:
|
balanced_ratio |
The ratio of a column if perfectly balanced.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The imbalance of the column. |
Source code in src/amltk/metalearning/metafeatures.py
def metafeature_descriptions(features=None)
#
Get the descriptions of meatfeatures available.
PARAMETER | DESCRIPTION |
---|---|
features |
The metafeatures. If None, all metafeatures subclasses
of
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
dict[str, str]
|
The descriptions of the metafeatures. |
Source code in src/amltk/metalearning/metafeatures.py
def compute_metafeatures(X, y, *, features=None)
#
Compute metafeatures for a dataset.
PARAMETER | DESCRIPTION |
---|---|
X |
The features of the dataset.
TYPE:
|
y |
The labels of the dataset. |
features |
The metafeatures to compute. If None, all metafeatures subclasses
of
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Series
|
A series of metafeatures. |