Optimizing RandomForest using DEHB¶

This notebook aims to build on the template from 00_interfacing_DEHB and use it on an actual problem, to optimize the hyperparameters of a Random Forest model, for a dataset.

Additional requirements:

scikit-learn>=0.24

In [1]:

Copied!





import time
import numpy as np
import warnings


seed = 123
np.random.seed(seed)
warnings.filterwarnings('ignore')
import time
import numpy as np
import warnings


seed = 123
np.random.seed(seed)
warnings.filterwarnings('ignore')

The problem defined here is to optimize a Random Forest model, on any given dataset, using DEHB. The hyperparameters chosen to be optimized are:

max_depth
min_samples_split
max_features
min_samples_leaf while the n_estimators hyperparameter to the Random Forest is chosen to be a fidelity parameter instead. Lesser number of trees ($<10$) in the Random Forest may not allow adequate ensembling for the grouped prediction to be significantly better than the individual tree predictions. Whereas a large number of trees (~$100$) often give accurate predictions but is naturally slower to train and predict on account of more trees to train. Therefore, a smaller n_estimators can be used as a cheaper approximation of the actual fidelity of n_estimators=100.

Defining fidelity range¶

In [2]:

Copied!

min_fidelity, max_fidelity = 2, 50
min_fidelity, max_fidelity = 2, 50

For the remaining $4$ hyperparameters, the search space can be created as a ConfigSpace object, with the domain of individual parameters defined.

Creating search space¶

In [3]:

Copied!





import ConfigSpace as CS


def create_search_space(seed=123):
    """Parameter space to be optimized --- contains the hyperparameters
    """
    cs = CS.ConfigurationSpace(seed=seed)

    cs.add_hyperparameters([
        CS.UniformIntegerHyperparameter(
            'max_depth', lower=1, upper=15, default_value=2, log=False
        ),
        CS.UniformIntegerHyperparameter(
            'min_samples_split', lower=2, upper=128, default_value=2, log=True
        ),
        CS.UniformFloatHyperparameter(
            'max_features', lower=0.1, upper=0.9, default_value=0.5, log=False
        ),
        CS.UniformIntegerHyperparameter(
            'min_samples_leaf', lower=1, upper=64, default_value=1, log=True
        ),
    ])
    return cs
import ConfigSpace as CS


def create_search_space(seed=123):
    """Parameter space to be optimized --- contains the hyperparameters
    """
    cs = CS.ConfigurationSpace(seed=seed)

    cs.add_hyperparameters([
        CS.UniformIntegerHyperparameter(
            'max_depth', lower=1, upper=15, default_value=2, log=False
        ),
        CS.UniformIntegerHyperparameter(
            'min_samples_split', lower=2, upper=128, default_value=2, log=True
        ),
        CS.UniformFloatHyperparameter(
            'max_features', lower=0.1, upper=0.9, default_value=0.5, log=False
        ),
        CS.UniformIntegerHyperparameter(
            'min_samples_leaf', lower=1, upper=64, default_value=1, log=True
        ),
    ])
    return cs

In [4]:

Copied!

cs = create_search_space(seed)
print(cs)
cs = create_search_space(seed)
print(cs)

Configuration space object:
  Hyperparameters:
    max_depth, Type: UniformInteger, Range: [1, 15], Default: 2
    max_features, Type: UniformFloat, Range: [0.1, 0.9], Default: 0.5
    min_samples_leaf, Type: UniformInteger, Range: [1, 64], Default: 1, on log-scale
    min_samples_split, Type: UniformInteger, Range: [2, 128], Default: 2, on log-scale

In [5]:

Copied!

dimensions = len(cs.get_hyperparameters())
print("Dimensionality of search space: {}".format(dimensions))
dimensions = len(cs.get_hyperparameters())
print("Dimensionality of search space: {}".format(dimensions))

Dimensionality of search space: 4

Now the primary black/gray-box interface to the Random Forest model needs to be built for DEHB to query. As given in the 00_interfacing_DEHB notebook, this function will have a signature akin to: target_function(config, fidelity), and return a two-element tuple of the score and cost. It must be noted that DEHB minimizes and therefore the score being returned by this target_function should account for it.

In this example, the target function trains a Random Forest model on a dataset. We load a dataset here and maintain a fixed, train-validation-test split for one complete run. Multiple DEHB runs can therefore optimize on the same validation split, and evaluate final performance on the same test set.

Creating target function to optimize (2 parts)¶

1 ) Preparing dataset and splits¶

In [6]:

Copied!

from sklearn.datasets import load_iris, load_digits, load_wine

classification = {"iris": load_iris, "digits": load_digits, "wine": load_wine}
from sklearn.datasets import load_iris, load_digits, load_wine

classification = {"iris": load_iris, "digits": load_digits, "wine": load_wine}

In [7]:

Copied!





from sklearn.model_selection import train_test_split


def prepare_dataset(model_type="classification", dataset=None):

    if model_type == "classification":
        if dataset is None:
            dataset = np.random.choice(list(classification.keys())) 
        _data = classification[dataset]()
    else:
        if dataset is None:
            dataset = np.random.choice(list(regression.keys()))
        _data = regression[dataset]()

    train_X, rest_X, train_y, rest_y = train_test_split(
      _data.get("data"), 
      _data.get("target"), 
      train_size=0.7, 
      shuffle=True, 
      random_state=seed
    )
    
    # 10% test and 20% validation data
    valid_X, test_X, valid_y, test_y = train_test_split(
      rest_X, rest_y,
      test_size=0.3333, 
      shuffle=True, 
      random_state=seed
    )
    return train_X, train_y, valid_X, valid_y, test_X, test_y, dataset
from sklearn.model_selection import train_test_split


def prepare_dataset(model_type="classification", dataset=None):

    if model_type == "classification":
        if dataset is None:
            dataset = np.random.choice(list(classification.keys())) 
        _data = classification[dataset]()
    else:
        if dataset is None:
            dataset = np.random.choice(list(regression.keys()))
        _data = regression[dataset]()

    train_X, rest_X, train_y, rest_y = train_test_split(
      _data.get("data"), 
      _data.get("target"), 
      train_size=0.7, 
      shuffle=True, 
      random_state=seed
    )
    
    # 10% test and 20% validation data
    valid_X, test_X, valid_y, test_y = train_test_split(
      rest_X, rest_y,
      test_size=0.3333, 
      shuffle=True, 
      random_state=seed
    )
    return train_X, train_y, valid_X, valid_y, test_X, test_y, dataset

In [8]:

Copied!





train_X, train_y, valid_X, valid_y, test_X, test_y, dataset = \
    prepare_dataset(model_type="classification")

print(dataset)
print("Train size: {}\nValid size: {}\nTest size: {}".format(
    train_X.shape, valid_X.shape, test_X.shape
))
train_X, train_y, valid_X, valid_y, test_X, test_y, dataset = \
    prepare_dataset(model_type="classification")

print(dataset)
print("Train size: {}\nValid size: {}\nTest size: {}".format(
    train_X.shape, valid_X.shape, test_X.shape
))

wine
Train size: (124, 13)
Valid size: (36, 13)
Test size: (18, 13)

2 ) Function interface with DEHB¶

In [9]:

Copied!

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer

accuracy_scorer = make_scorer(accuracy_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer

accuracy_scorer = make_scorer(accuracy_score)

In [10]:

Copied!





def target_function(config, fidelity, **kwargs):
    # Extracting support information
    seed = kwargs["seed"]
    train_X = kwargs["train_X"]
    train_y = kwargs["train_y"]
    valid_X = kwargs["valid_X"]
    valid_y = kwargs["valid_y"]
    max_fidelity = kwargs["max_fidelity"]
    
    if fidelity is None:
        fidelity = max_fidelity
    
    start = time.time()
    # Building model 
    model = RandomForestClassifier(
        **config.get_dictionary(),
        n_estimators=int(fidelity),
        bootstrap=True,
        random_state=seed,
    )
    # Training the model on the complete training set
    model.fit(train_X, train_y)
    
    # Evaluating the model on the validation set
    valid_accuracy = accuracy_scorer(model, valid_X, valid_y)
    cost = time.time() - start
    
    # Evaluating the model on the test set as additional info
    test_accuracy = accuracy_scorer(model, test_X, test_y)
    
    result = {
        "fitness": -valid_accuracy,  # DE/DEHB minimizes
        "cost": cost,
        "info": {
            "test_score": test_accuracy,
            "fidelity": fidelity
        }
    }
    return result
def target_function(config, fidelity, **kwargs):
    # Extracting support information
    seed = kwargs["seed"]
    train_X = kwargs["train_X"]
    train_y = kwargs["train_y"]
    valid_X = kwargs["valid_X"]
    valid_y = kwargs["valid_y"]
    max_fidelity = kwargs["max_fidelity"]
    
    if fidelity is None:
        fidelity = max_fidelity
    
    start = time.time()
    # Building model 
    model = RandomForestClassifier(
        **config.get_dictionary(),
        n_estimators=int(fidelity),
        bootstrap=True,
        random_state=seed,
    )
    # Training the model on the complete training set
    model.fit(train_X, train_y)
    
    # Evaluating the model on the validation set
    valid_accuracy = accuracy_scorer(model, valid_X, valid_y)
    cost = time.time() - start
    
    # Evaluating the model on the test set as additional info
    test_accuracy = accuracy_scorer(model, test_X, test_y)
    
    result = {
        "fitness": -valid_accuracy,  # DE/DEHB minimizes
        "cost": cost,
        "info": {
            "test_score": test_accuracy,
            "fidelity": fidelity
        }
    }
    return result

We now have all components to define the problem to be optimized. DEHB can be initialized using all these information.

Running DEHB¶

In [11]:

Copied!





from dehb import DEHB

dehb = DEHB(
    f=target_function, 
    cs=cs, 
    dimensions=dimensions, 
    min_fidelity=min_fidelity, 
    max_fidelity=max_fidelity,
    n_workers=1,
    output_path="./temp"
)
from dehb import DEHB

dehb = DEHB(
    f=target_function, 
    cs=cs, 
    dimensions=dimensions, 
    min_fidelity=min_fidelity, 
    max_fidelity=max_fidelity,
    n_workers=1,
    output_path="./temp"
)

2025-04-25 18:56:27.948 | WARNING  | dehb.optimizers.dehb:__init__:264 - A checkpoint already exists, results could potentially be overwritten.

In [12]:

Copied!





trajectory, runtime, history = dehb.run(
    total_cost=10,
    # parameters expected as **kwargs in target_function is passed here
    seed=123,
    train_X=train_X,
    train_y=train_y,
    valid_X=valid_X,
    valid_y=valid_y,
    max_fidelity=dehb.max_fidelity
)
trajectory, runtime, history = dehb.run(
    total_cost=10,
    # parameters expected as **kwargs in target_function is passed here
    seed=123,
    train_X=train_X,
    train_y=train_y,
    valid_X=valid_X,
    valid_y=valid_y,
    max_fidelity=dehb.max_fidelity
)

2025-04-25 18:56:37.963 | WARNING  | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.

In [13]:

Copied!





print(len(trajectory), len(runtime), len(history), end="\n\n")

# Last recorded function evaluation
last_eval = history[-1]
config_id, config, score, cost, fidelity, _info = last_eval

print("Last evaluated configuration, ")
print(dehb.vector_to_configspace(config), end="")
print("got a score of {}, was evaluated at a fidelity of {:.2f} and "
      "took {:.3f} seconds to run.".format(score, fidelity, cost))
print("The additional info attached: {}".format(_info))
print(len(trajectory), len(runtime), len(history), end="\n\n")

# Last recorded function evaluation
last_eval = history[-1]
config_id, config, score, cost, fidelity, _info = last_eval

print("Last evaluated configuration, ")
print(dehb.vector_to_configspace(config), end="")
print("got a score of {}, was evaluated at a fidelity of {:.2f} and "
      "took {:.3f} seconds to run.".format(score, fidelity, cost))
print("The additional info attached: {}".format(_info))

362 362 362

Last evaluated configuration, 
Configuration(values={
  'max_depth': 8,
  'max_features': 0.7108937238273,
  'min_samples_leaf': 2,
  'min_samples_split': 11,
})got a score of -1.0, was evaluated at a fidelity of 5.56 and took 0.009 seconds to run.
The additional info attached: {'test_score': 0.9444444444444444, 'fidelity': 5.555555555555555}

Below, we let DEHB optimize for $5$ different runs. The reset() allows DEHB to begin optimization from the beginning by cleaning all history and starting with random samples. Each run of DEHB optimization is for just $10$ seconds as set by total_cost=10. We then report the mean and the standard deviation of the best score seen across these $5$ runs.

In [14]:

Copied!





runs = 5

best_config_list = []

for i in range(runs):
    # Resetting to begin optimization again
    dehb.reset()
    # Executing a run of DEHB optimization lasting for 10s
    trajectory, runtime, history = dehb.run(
        total_cost=10,
        seed=123,
        train_X=train_X,
        train_y=train_y,
        valid_X=valid_X,
        valid_y=valid_y,
        max_fidelity=dehb.max_fidelity
    )
    best_config = dehb.vector_to_configspace(dehb.inc_config)
    
    # Creating a model using the best configuration found
    model = RandomForestClassifier(
        **best_config.get_dictionary(),
        n_estimators=int(max_fidelity),
        bootstrap=True,
        random_state=seed,
    )
    # Training the model on the complete training set
    model.fit(
        np.concatenate((train_X, valid_X)), 
        np.concatenate((train_y, valid_y))
    )
    # Evaluating the model on the held-out test set
    test_accuracy = accuracy_scorer(model, test_X, test_y)
    best_config_list.append((best_config, test_accuracy))
runs = 5

best_config_list = []

for i in range(runs):
    # Resetting to begin optimization again
    dehb.reset()
    # Executing a run of DEHB optimization lasting for 10s
    trajectory, runtime, history = dehb.run(
        total_cost=10,
        seed=123,
        train_X=train_X,
        train_y=train_y,
        valid_X=valid_X,
        valid_y=valid_y,
        max_fidelity=dehb.max_fidelity
    )
    best_config = dehb.vector_to_configspace(dehb.inc_config)
    
    # Creating a model using the best configuration found
    model = RandomForestClassifier(
        **best_config.get_dictionary(),
        n_estimators=int(max_fidelity),
        bootstrap=True,
        random_state=seed,
    )
    # Training the model on the complete training set
    model.fit(
        np.concatenate((train_X, valid_X)), 
        np.concatenate((train_y, valid_y))
    )
    # Evaluating the model on the held-out test set
    test_accuracy = accuracy_scorer(model, test_X, test_y)
    best_config_list.append((best_config, test_accuracy))

2025-04-25 18:56:48.016 | WARNING  | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.

2025-04-25 18:56:58.131 | WARNING  | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.

2025-04-25 18:57:08.265 | WARNING  | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.

2025-04-25 18:57:18.423 | WARNING  | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.

2025-04-25 18:57:28.554 | WARNING  | dehb.optimizers.dehb:_timeout_handler:352 - Runtime budget exhausted. Saving optimization checkpoint now.

In [15]:

Copied!

print("Mean score across trials: ", np.mean([score for _, score in best_config_list]))
print("Std. dev. of score across trials: ", np.std([score for _, score in best_config_list]))
print("Mean score across trials: ", np.mean([score for _, score in best_config_list]))
print("Std. dev. of score across trials: ", np.std([score for _, score in best_config_list]))

Mean score across trials:  1.0
Std. dev. of score across trials:  0.0

In [16]:

Copied!

for config, score in best_config_list:
    print("{} got an accuracy of {} on the test set.".format(config, score))
    print()
for config, score in best_config_list:
    print("{} got an accuracy of {} on the test set.".format(config, score))
    print()

Configuration(values={
  'max_depth': 13,
  'max_features': 0.3887880420872,
  'min_samples_leaf': 1,
  'min_samples_split': 16,
}) got an accuracy of 1.0 on the test set.

Configuration(values={
  'max_depth': 13,
  'max_features': 0.3887880420872,
  'min_samples_leaf': 1,
  'min_samples_split': 16,
}) got an accuracy of 1.0 on the test set.

Configuration(values={
  'max_depth': 13,
  'max_features': 0.3887880420872,
  'min_samples_leaf': 1,
  'min_samples_split': 16,
}) got an accuracy of 1.0 on the test set.

Configuration(values={
  'max_depth': 13,
  'max_features': 0.3887880420872,
  'min_samples_leaf': 1,
  'min_samples_split': 16,
}) got an accuracy of 1.0 on the test set.

Configuration(values={
  'max_depth': 13,
  'max_features': 0.3887880420872,
  'min_samples_leaf': 1,
  'min_samples_split': 16,
}) got an accuracy of 1.0 on the test set.

In [ ]: