Optimizing RandomForest using the Ask & Tell interface¶
This notebook aims to build on the template from 00_interfacing_DEHB
and use it on an actual problem, to optimize the hyperparameters of a Random Forest model, for a dataset. Here we use DEHB with the built-in ask and tell functionality.
Additional requirements:
- scikit-learn>=0.24
import time
import numpy as np
import warnings
seed = 123
np.random.seed(seed)
warnings.filterwarnings('ignore')
The problem defined here is to optimize a Random Forest model, on any given dataset, using DEHB. The hyperparameters chosen to be optimized are:
max_depth
min_samples_split
max_features
min_samples_leaf
while then_estimators
hyperparameter to the Random Forest is chosen to be a fidelity parameter instead. Lesser number of trees () in the Random Forest may not allow adequate ensembling for the grouped prediction to be significantly better than the individual tree predictions. Whereas a large number of trees (~) often give accurate predictions but is naturally slower to train and predict on account of more trees to train. Therefore, a smallern_estimators
can be used as a cheaper approximation of the actual fidelity ofn_estimators=100
.
Defining fidelity range¶
min_fidelity, max_fidelity = 2, 50
For the remaining hyperparameters, the search space can be created as a ConfigSpace
object, with the domain of individual parameters defined.
Creating search space¶
import ConfigSpace as CS
def create_search_space(seed=123):
"""Parameter space to be optimized --- contains the hyperparameters
"""
cs = CS.ConfigurationSpace(seed=seed)
cs.add_hyperparameters([
CS.UniformIntegerHyperparameter(
'max_depth', lower=1, upper=15, default_value=2, log=False
),
CS.UniformIntegerHyperparameter(
'min_samples_split', lower=2, upper=128, default_value=2, log=True
),
CS.UniformFloatHyperparameter(
'max_features', lower=0.1, upper=0.9, default_value=0.5, log=False
),
CS.UniformIntegerHyperparameter(
'min_samples_leaf', lower=1, upper=64, default_value=1, log=True
),
])
return cs
cs = create_search_space(seed)
print(cs)
Configuration space object: Hyperparameters: max_depth, Type: UniformInteger, Range: [1, 15], Default: 2 max_features, Type: UniformFloat, Range: [0.1, 0.9], Default: 0.5 min_samples_leaf, Type: UniformInteger, Range: [1, 64], Default: 1, on log-scale min_samples_split, Type: UniformInteger, Range: [2, 128], Default: 2, on log-scale
dimensions = len(cs.get_hyperparameters())
print("Dimensionality of search space: {}".format(dimensions))
Dimensionality of search space: 4
Now the primary black/gray-box interface to the Random Forest model needs to be built for DEHB to query. As given in the 00_interfacing_DEHB
notebook, this function will have a signature akin to: target_function(config, fidelity)
, and return a two-element tuple of the score
and cost
. It must be noted that DEHB minimizes and therefore the score
being returned by this target_function
should account for it.
In this example, the target function trains a Random Forest model on a dataset. We load a dataset here and maintain a fixed, train-validation-test split for one complete run. Multiple DEHB runs can therefore optimize on the same validation split, and evaluate final performance on the same test set.
Creating target function to optimize (2 parts)¶
1 ) Preparing dataset and splits¶
from sklearn.datasets import load_iris, load_digits, load_wine
classification = {"iris": load_iris, "digits": load_digits, "wine": load_wine}
from sklearn.model_selection import train_test_split
def prepare_dataset(model_type="classification", dataset=None):
if model_type == "classification":
if dataset is None:
dataset = np.random.choice(list(classification.keys()))
_data = classification[dataset]()
else:
if dataset is None:
dataset = np.random.choice(list(regression.keys()))
_data = regression[dataset]()
train_X, rest_X, train_y, rest_y = train_test_split(
_data.get("data"),
_data.get("target"),
train_size=0.7,
shuffle=True,
random_state=seed
)
# 10% test and 20% validation data
valid_X, test_X, valid_y, test_y = train_test_split(
rest_X, rest_y,
test_size=0.3333,
shuffle=True,
random_state=seed
)
return train_X, train_y, valid_X, valid_y, test_X, test_y, dataset
train_X, train_y, valid_X, valid_y, test_X, test_y, dataset = \
prepare_dataset(model_type="classification")
print(dataset)
print("Train size: {}\nValid size: {}\nTest size: {}".format(
train_X.shape, valid_X.shape, test_X.shape
))
wine Train size: (124, 13) Valid size: (36, 13) Test size: (18, 13)
2 ) Function interface with DEHB¶
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
accuracy_scorer = make_scorer(accuracy_score)
def target_function(config, fidelity, **kwargs):
# Extracting support information
seed = kwargs["seed"]
train_X = kwargs["train_X"]
train_y = kwargs["train_y"]
valid_X = kwargs["valid_X"]
valid_y = kwargs["valid_y"]
max_fidelity = kwargs["max_fidelity"]
if fidelity is None:
fidelity = max_fidelity
start = time.time()
# Building model
model = RandomForestClassifier(
**config.get_dictionary(),
n_estimators=int(fidelity),
bootstrap=True,
random_state=seed,
)
# Training the model on the complete training set
model.fit(train_X, train_y)
# Evaluating the model on the validation set
valid_accuracy = accuracy_scorer(model, valid_X, valid_y)
cost = time.time() - start
# Evaluating the model on the test set as additional info
test_accuracy = accuracy_scorer(model, test_X, test_y)
result = {
"fitness": -valid_accuracy, # DE/DEHB minimizes
"cost": cost,
"info": {
"test_score": test_accuracy,
"fidelity": fidelity
}
}
return result
We now have all components to define the problem to be optimized. DEHB can be initialized using all these information.
Running DEHB¶
from dehb import DEHB
dehb = DEHB(
f=target_function, # Here we do not need to necessarily specify the target function, but it can still be useful to call 'run' later.
cs=cs,
dimensions=dimensions,
min_fidelity=min_fidelity,
max_fidelity=max_fidelity,
n_workers=1,
output_path="./temp"
)
2024-08-12 11:53:34.972 | WARNING | dehb.optimizers.dehb:__init__:264 - A checkpoint already exists, results could potentially be overwritten.
n_function_evals = 50
for _ in range(n_function_evals):
# Ask for the job_info, including the configuration to run and the fidelity
job_info = dehb.ask()
# Evaluate the configuration on the given fidelity. Here you are free to use
# any technique to compute the result. This job could e.g. be forwarded to
# a worker on your cluster (Which is not required to use Dask).
# The results dict has to contain the keys "cost" and "fitness" with an additional "info"
# dict for additional, user-specific data.
res = target_function(job_info["config"], job_info["fidelity"],
# parameters as **kwargs in target_function
seed=123,
train_X=train_X,
train_y=train_y,
valid_X=valid_X,
valid_y=valid_y,
max_fidelity=dehb.max_fidelity)
# When the evaluation is done, report the results back to the DEHB controller.
dehb.tell(job_info, res)
trajectory = dehb.traj
runtime = dehb.runtime
history = dehb.history
print(len(trajectory), len(runtime), len(history), end="\n\n")
# Last recorded function evaluation
last_eval = history[-1]
config_id, config, score, cost, fidelity, _info = last_eval
print("Last evaluated configuration, ")
print(dehb.vector_to_configspace(config), end="")
print("got a score of {}, was evaluated at a fidelity of {:.2f} and "
"took {:.3f} seconds to run.".format(score, fidelity, cost))
print("The additional info attached: {}".format(_info))
print()
print("Best evaluated configuration, ")
best_config = dehb.vector_to_configspace(dehb.inc_config)
# Creating a model using the best configuration found
model = RandomForestClassifier(
**best_config.get_dictionary(),
n_estimators=int(max_fidelity),
bootstrap=True,
random_state=seed,
)
# Training the model on the complete training set
model.fit(
np.concatenate((train_X, valid_X)),
np.concatenate((train_y, valid_y))
)
# Evaluating the model on the held-out test set
test_accuracy = accuracy_scorer(model, test_X, test_y)
print(f"{best_config} got an accuracy of {test_accuracy} on the test set.")
50 50 50 Last evaluated configuration, Configuration(values={ 'max_depth': 9, 'max_features': 0.6012653351497, 'min_samples_leaf': 7, 'min_samples_split': 23, })got a score of -1.0, was evaluated at a fidelity of 16.67 and took 0.018 seconds to run. The additional info attached: {'test_score': 1.0, 'fidelity': 16.666666666666664} Best evaluated configuration, Configuration(values={ 'max_depth': 8, 'max_features': 0.6012653351497, 'min_samples_leaf': 14, 'min_samples_split': 23, }) got an accuracy of 0.9444444444444444 on the test set.
After running DEHB for 50 function evaluations using the ask and tell interface, we can still call the run
function in order keep optimizing without specifically using ask and tell.
# Continuing the ask/tell run of DEHB optimization for another 10s
trajectory, runtime, history = dehb.run(
total_cost=10,
seed=123,
train_X=train_X,
train_y=train_y,
valid_X=valid_X,
valid_y=valid_y,
max_fidelity=dehb.max_fidelity
)
best_config = dehb.vector_to_configspace(dehb.inc_config)
# Creating a model using the best configuration found
model = RandomForestClassifier(
**best_config.get_dictionary(),
n_estimators=int(max_fidelity),
bootstrap=True,
random_state=seed,
)
# Training the model on the complete training set
model.fit(
np.concatenate((train_X, valid_X)),
np.concatenate((train_y, valid_y))
)
# Evaluating the model on the held-out test set
test_accuracy = accuracy_scorer(model, test_X, test_y)
print(len(trajectory), len(runtime), len(history), end="\n\n")
# Last recorded function evaluation
last_eval = history[-1]
config_id, config, score, cost, fidelity, _info = last_eval
print("Last evaluated configuration, ")
print(dehb.vector_to_configspace(config), end="")
print("got a score of {}, was evaluated at a fidelity of {:.2f} and "
"took {:.3f} seconds to run.".format(score, fidelity, cost))
print("The additional info attached: {}".format(_info))
print()
print("Best evaluated configuration, ")
print(f"{best_config} got an accuracy of {test_accuracy} on the test set.")
2024-08-12 11:53:46.076 | WARNING | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.
477 477 477 Last evaluated configuration, Configuration(values={ 'max_depth': 2, 'max_features': 0.5378918797939, 'min_samples_leaf': 8, 'min_samples_split': 10, })got a score of -1.0, was evaluated at a fidelity of 50.00 and took 0.062 seconds to run. The additional info attached: {'test_score': 1.0, 'fidelity': 50.0} Best evaluated configuration, Configuration(values={ 'max_depth': 8, 'max_features': 0.5845328598914, 'min_samples_leaf': 1, 'min_samples_split': 19, }) got an accuracy of 1.0 on the test set.