Optimizing RandomForest using DEHB¶
This notebook aims to build on the template from 00_interfacing_DEHB
and use it on an actual problem, to optimize the hyperparameters of a Random Forest model, for a dataset.
Additional requirements:
- scikit-learn>=0.24
import time
import numpy as np
import warnings
seed = 123
np.random.seed(seed)
warnings.filterwarnings('ignore')
The problem defined here is to optimize a Random Forest model, on any given dataset, using DEHB. The hyperparameters chosen to be optimized are:
max_depth
min_samples_split
max_features
min_samples_leaf
while then_estimators
hyperparameter to the Random Forest is chosen to be a fidelity parameter instead. Lesser number of trees ($<10$) in the Random Forest may not allow adequate ensembling for the grouped prediction to be significantly better than the individual tree predictions. Whereas a large number of trees (~$100$) often give accurate predictions but is naturally slower to train and predict on account of more trees to train. Therefore, a smallern_estimators
can be used as a cheaper approximation of the actual fidelity ofn_estimators=100
.
Defining fidelity range¶
min_fidelity, max_fidelity = 2, 50
For the remaining $4$ hyperparameters, the search space can be created as a ConfigSpace
object, with the domain of individual parameters defined.
Creating search space¶
import ConfigSpace as CS
def create_search_space(seed=123):
"""Parameter space to be optimized --- contains the hyperparameters
"""
cs = CS.ConfigurationSpace(seed=seed)
cs.add_hyperparameters([
CS.UniformIntegerHyperparameter(
'max_depth', lower=1, upper=15, default_value=2, log=False
),
CS.UniformIntegerHyperparameter(
'min_samples_split', lower=2, upper=128, default_value=2, log=True
),
CS.UniformFloatHyperparameter(
'max_features', lower=0.1, upper=0.9, default_value=0.5, log=False
),
CS.UniformIntegerHyperparameter(
'min_samples_leaf', lower=1, upper=64, default_value=1, log=True
),
])
return cs
cs = create_search_space(seed)
print(cs)
Configuration space object: Hyperparameters: max_depth, Type: UniformInteger, Range: [1, 15], Default: 2 max_features, Type: UniformFloat, Range: [0.1, 0.9], Default: 0.5 min_samples_leaf, Type: UniformInteger, Range: [1, 64], Default: 1, on log-scale min_samples_split, Type: UniformInteger, Range: [2, 128], Default: 2, on log-scale
dimensions = len(cs.get_hyperparameters())
print("Dimensionality of search space: {}".format(dimensions))
Dimensionality of search space: 4
Now the primary black/gray-box interface to the Random Forest model needs to be built for DEHB to query. As given in the 00_interfacing_DEHB
notebook, this function will have a signature akin to: target_function(config, fidelity)
, and return a two-element tuple of the score
and cost
. It must be noted that DEHB minimizes and therefore the score
being returned by this target_function
should account for it.
In this example, the target function trains a Random Forest model on a dataset. We load a dataset here and maintain a fixed, train-validation-test split for one complete run. Multiple DEHB runs can therefore optimize on the same validation split, and evaluate final performance on the same test set.
Creating target function to optimize (2 parts)¶
1 ) Preparing dataset and splits¶
from sklearn.datasets import load_iris, load_digits, load_wine
classification = {"iris": load_iris, "digits": load_digits, "wine": load_wine}
from sklearn.model_selection import train_test_split
def prepare_dataset(model_type="classification", dataset=None):
if model_type == "classification":
if dataset is None:
dataset = np.random.choice(list(classification.keys()))
_data = classification[dataset]()
else:
if dataset is None:
dataset = np.random.choice(list(regression.keys()))
_data = regression[dataset]()
train_X, rest_X, train_y, rest_y = train_test_split(
_data.get("data"),
_data.get("target"),
train_size=0.7,
shuffle=True,
random_state=seed
)
# 10% test and 20% validation data
valid_X, test_X, valid_y, test_y = train_test_split(
rest_X, rest_y,
test_size=0.3333,
shuffle=True,
random_state=seed
)
return train_X, train_y, valid_X, valid_y, test_X, test_y, dataset
train_X, train_y, valid_X, valid_y, test_X, test_y, dataset = \
prepare_dataset(model_type="classification")
print(dataset)
print("Train size: {}\nValid size: {}\nTest size: {}".format(
train_X.shape, valid_X.shape, test_X.shape
))
wine Train size: (124, 13) Valid size: (36, 13) Test size: (18, 13)
2 ) Function interface with DEHB¶
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
accuracy_scorer = make_scorer(accuracy_score)
def target_function(config, fidelity, **kwargs):
# Extracting support information
seed = kwargs["seed"]
train_X = kwargs["train_X"]
train_y = kwargs["train_y"]
valid_X = kwargs["valid_X"]
valid_y = kwargs["valid_y"]
max_fidelity = kwargs["max_fidelity"]
if fidelity is None:
fidelity = max_fidelity
start = time.time()
# Building model
model = RandomForestClassifier(
**config.get_dictionary(),
n_estimators=int(fidelity),
bootstrap=True,
random_state=seed,
)
# Training the model on the complete training set
model.fit(train_X, train_y)
# Evaluating the model on the validation set
valid_accuracy = accuracy_scorer(model, valid_X, valid_y)
cost = time.time() - start
# Evaluating the model on the test set as additional info
test_accuracy = accuracy_scorer(model, test_X, test_y)
result = {
"fitness": -valid_accuracy, # DE/DEHB minimizes
"cost": cost,
"info": {
"test_score": test_accuracy,
"fidelity": fidelity
}
}
return result
We now have all components to define the problem to be optimized. DEHB can be initialized using all these information.
Running DEHB¶
from dehb import DEHB
dehb = DEHB(
f=target_function,
cs=cs,
dimensions=dimensions,
min_fidelity=min_fidelity,
max_fidelity=max_fidelity,
n_workers=1,
output_path="./temp"
)
2024-08-12 11:52:31.013 | WARNING | dehb.optimizers.dehb:__init__:264 - A checkpoint already exists, results could potentially be overwritten.
trajectory, runtime, history = dehb.run(
total_cost=10,
# parameters expected as **kwargs in target_function is passed here
seed=123,
train_X=train_X,
train_y=train_y,
valid_X=valid_X,
valid_y=valid_y,
max_fidelity=dehb.max_fidelity
)
2024-08-12 11:52:41.024 | WARNING | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.
print(len(trajectory), len(runtime), len(history), end="\n\n")
# Last recorded function evaluation
last_eval = history[-1]
config_id, config, score, cost, fidelity, _info = last_eval
print("Last evaluated configuration, ")
print(dehb.vector_to_configspace(config), end="")
print("got a score of {}, was evaluated at a fidelity of {:.2f} and "
"took {:.3f} seconds to run.".format(score, fidelity, cost))
print("The additional info attached: {}".format(_info))
439 439 439 Last evaluated configuration, Configuration(values={ 'max_depth': 8, 'max_features': 0.6243371076843, 'min_samples_leaf': 10, 'min_samples_split': 3, })got a score of -1.0, was evaluated at a fidelity of 50.00 and took 0.060 seconds to run. The additional info attached: {'test_score': 1.0, 'fidelity': 50.0}
Below, we let DEHB optimize for $5$ different runs. The reset()
allows DEHB to begin optimization from the beginning by cleaning all history and starting with random samples. Each run of DEHB optimization is for just $10$ seconds as set by total_cost=10
. We then report the mean and the standard deviation of the best score seen across these $5$ runs.
runs = 5
best_config_list = []
for i in range(runs):
# Resetting to begin optimization again
dehb.reset()
# Executing a run of DEHB optimization lasting for 10s
trajectory, runtime, history = dehb.run(
total_cost=10,
seed=123,
train_X=train_X,
train_y=train_y,
valid_X=valid_X,
valid_y=valid_y,
max_fidelity=dehb.max_fidelity
)
best_config = dehb.vector_to_configspace(dehb.inc_config)
# Creating a model using the best configuration found
model = RandomForestClassifier(
**best_config.get_dictionary(),
n_estimators=int(max_fidelity),
bootstrap=True,
random_state=seed,
)
# Training the model on the complete training set
model.fit(
np.concatenate((train_X, valid_X)),
np.concatenate((train_y, valid_y))
)
# Evaluating the model on the held-out test set
test_accuracy = accuracy_scorer(model, test_X, test_y)
best_config_list.append((best_config, test_accuracy))
2024-08-12 11:52:51.100 | WARNING | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.
2024-08-12 11:53:01.199 | WARNING | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.
2024-08-12 11:53:11.333 | WARNING | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.
2024-08-12 11:53:21.448 | WARNING | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.
2024-08-12 11:53:31.569 | WARNING | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.
print("Mean score across trials: ", np.mean([score for _, score in best_config_list]))
print("Std. dev. of score across trials: ", np.std([score for _, score in best_config_list]))
Mean score across trials: 1.0 Std. dev. of score across trials: 0.0
for config, score in best_config_list:
print("{} got an accuracy of {} on the test set.".format(config, score))
print()
Configuration(values={ 'max_depth': 15, 'max_features': 0.7478507781239, 'min_samples_leaf': 7, 'min_samples_split': 2, }) got an accuracy of 1.0 on the test set. Configuration(values={ 'max_depth': 15, 'max_features': 0.7478507781239, 'min_samples_leaf': 7, 'min_samples_split': 2, }) got an accuracy of 1.0 on the test set. Configuration(values={ 'max_depth': 15, 'max_features': 0.7478507781239, 'min_samples_leaf': 7, 'min_samples_split': 2, }) got an accuracy of 1.0 on the test set. Configuration(values={ 'max_depth': 15, 'max_features': 0.7478507781239, 'min_samples_leaf': 7, 'min_samples_split': 2, }) got an accuracy of 1.0 on the test set. Configuration(values={ 'max_depth': 15, 'max_features': 0.7478507781239, 'min_samples_leaf': 7, 'min_samples_split': 2, }) got an accuracy of 1.0 on the test set.