Parallelism¶
SMAC supports multiple workers natively via Dask. Just specify n_workers
in the scenario and you are ready to go.
Note
Please keep in mind that additional workers are only used to evaluate trials. The main thread still orchestrates the optimization process, including training the surrogate model.
Warning
Using high number of workers when the target function evaluation is fast might be counterproductive due to the overhead of communcation. Consider using only one worker in this case.
Warning
When using multiple workers, SMAC is not reproducible anymore.
Warning
You cannot use resource limitation (pynisher, via the scenario arguments trail_walltime_limit and trial_memory_limit). This is because pynisher works by running your function inside of a subprocess. Once in the subprocess, the resources will be limited for that process before running your function. This does not work together with pickling - which is required by dask to schedule jobs on the cluster, even on a local one.
Warning
Start/run SMAC inside if __name__ == "__main__"
in your script otherwise Dask is not able to correctly
spawn jobs and probably this runtime error will be raised:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Running on a Cluster¶
You can also pass a custom dask client, e.g. to run on a slurm cluster. See our parallelism example.
Warning
On some clusters you cannot spawn new jobs when running a SLURMCluster inside a job instead of on the login node. No obvious errors might be raised but it can hang silently.
Warning
Sometimes you need to modify your launch command which can be done with
SLURMCluster.job_class.submit_command
.
cluster.job_cls.submit_command = submit_command
cluster.job_cls.cancel_command = cancel_command