Data
amltk.sklearn.data
#
Data utilities for scikit-learn.
split_data
#
split_data(
*items: Sequence,
splits: dict[str, float],
seed: Seed | None = None,
shuffle: bool = True,
stratify: Sequence | None = None
) -> dict[str, tuple[Sequence, ...]]
Split a set of items into multiple splits.
from amltk.sklearn.data import split
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
splits = split_data(x, y, splits={"train": 0.6, "val": 0.2, "test": 0.2})
train_x, train_y = splits["train"]
val_x, val_y = splits["val"]
test_x, test_y = splits["test"]
PARAMETER | DESCRIPTION |
---|---|
items |
The items to split. Must be indexible, like a list, np.ndarray, pandas dataframe/series or a tuple, etc...
TYPE:
|
splits |
A dictionary of split names and their percentage of the data. The percentages must sum to 1. |
seed |
The seed to use for the random state.
TYPE:
|
shuffle |
Whether to shuffle the data before splitting. Passed forward to sklearn.model_selection.train_test_split.
TYPE:
|
stratify |
The stratification to use for the split. This will be passed forward to sklearn.model_selection.train_test_split. We account for using the stratification for all splits, ensuring we split of the stratification values themselves.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
dict[str, tuple[Sequence, ...]]
|
A dictionary of split names and their split items. |
Source code in src/amltk/sklearn/data.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
train_val_test_split
#
train_val_test_split(
*items: Sequence,
splits: tuple[float, float, float],
seed: Seed | None = None,
shuffle: bool = True,
stratify: Sequence | None = None
) -> tuple[Sequence, ...]
Split a set of items into train, val and test splits.
from amltk.sklearn.data import train_val_test_split
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
train_x, train_y, val_x, val_y, test_x, test_y = train_val_test_split(
x, y, splits=(0.6, 0.2, 0.2),
)
PARAMETER | DESCRIPTION |
---|---|
items |
The items to split. Must be indexible, like a list, np.ndarray, pandas dataframe/series or a tuple, etc...
TYPE:
|
splits |
A tuple of the percentage of the data to use for the train, val and test splits. The percentages must sum to 1. |
seed |
The seed to use for the random state.
TYPE:
|
shuffle |
Whether to shuffle the data before splitting. Passed forward to sklearn.model_selection.train_test_split.
TYPE:
|
stratify |
The stratification to use for the split. This will be passed forward to sklearn.model_selection.train_test_split. We account for using the stratification for all splits, ensuring we split of the stratification values themselves.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[Sequence, ...]
|
A tuple containing the train, val and test splits. |