Multi-Objective Optimization#
Multi-Objective Optimization (MOO) addresses the challenge of optimizing multiple, often competing objectives simultaneously. Unlike single-objective optimization where there is one clear optimum, multi-objective problems have a set of trade-off solutions known as the Pareto Front. This is particularly relevant in deep learning, where practitioners often need to optimize for multiple goals such as validation accuracy, inference latency, training time, and fairness.
What is Multi-Objective Optimization?#
In multi-objective optimization, we seek to minimize (or maximize) a vector-valued objective function:
Since optimizing all objectives simultaneously is generally impossible, the goal is to find the Pareto Set, a set of non-dominated solutions where improving one objective necessarily worsens another.
Pareto Optimality#
A configuration \(\boldsymbol{\lambda}_2\) Pareto dominates \(\boldsymbol{\lambda}_1\) (denoted \(\boldsymbol{\lambda}_2 \prec \boldsymbol{\lambda}_1\)) if: - It is at least as good as \(\boldsymbol{\lambda}_1\) in all objectives - It is strictly better in at least one objective
A configuration is Pareto Optimal if no other configuration dominates it. The set of all Pareto Optimal configurations is the Pareto Set, and the set of their objective values is the Pareto Front.
Advantages of Multi-Objective Optimization
- Real-world relevance: Captures the practical reality that practitioners optimize for multiple goals simultaneously
- Informed trade-offs: Identifies the Pareto Front to help practitioners choose configurations that match their priorities
- Resource awareness: Can explicitly optimize for computational costs, inference latency, and other practical constraints
Challenges with Multi-Objective Optimization
- Computational cost: Exploring the entire Pareto Front is typically more expensive than finding a single optimum
- Scale with objectives: Difficulty increases with the number of objectives
- Prior integration: Traditional MOO algorithms lack robust mechanisms to incorporate expert knowledge for multiple objectives
PriMO: Prior-Informed Multi-Objective Optimizer#
PriMO (see paper) is the first hyperparameter optimization algorithm that effectively integrates multi-objective expert priors while exploiting cheap approximations of expensive objectives. It combines the strengths of multi-fidelity optimization with prior-guided Bayesian optimization, achieving state-of-the-art performance in both multi-objective and single-objective settings.
Key Components#
1. Multi-Objective Expert Priors#
PriMO extends single-objective priors to the multi-objective setting using a factorized prior approach. For each objective \(f_i\), a prior \(\pi_{f_i}(\boldsymbol{\lambda})\) represents a probability distribution over the location of that objective's optimum:
The compound prior is then:
This allows experts to specify beliefs about optimal regions for each objective independently, which is more intuitive than specifying priors over Pareto fronts directly.
2. Initial Design Strategy: Leveraging Cheap Approximations#
PriMO uses a multi-fidelity algorithm (MOASHA - Multi-Objective Asynchronous Successive Halving) during an initial design phase to generate strong seed points at full fidelity. This exploits cheap low-fidelity evaluations to quickly identify promising regions before transitioning to expensive full-fidelity optimization.
The initial design terminates once a budget threshold is reached, after which only full-fidelity evaluations are inserted into the dataset \(\mathcal{D}\) for use in the Bayesian optimization phase. This approach provides a "warm start" to the BO phase while consuming resources efficiently.
3. Prior-Augmented Bayesian Optimization with ε-Greedy Strategy#
PriMO's core innovation is an ε-greedy prior-weighted acquisition function that balances prior exploitation with exploration. At each iteration, the acquisition function is:
where: - \(\alpha(\boldsymbol{\lambda}, \mathcal{D})\) is the base acquisition function (qLogNoisyEI) - One prior is randomly selected at each iteration - The decay factor is: \(\gamma = \exp\left(-\frac{n_{\text{BO}}^2}{n_d}\right)\) - \(n_{\text{BO}}\) is the number of BO samples - \(n_d\) is the dimensionality of the search space
Key insight: Unlike previous methods that depend on priors too long, PriMO's quadratic decay ensures rapid reduction of prior influence, allowing quick recovery from misleading priors while still exploiting good priors early.
Algorithm Advantages#
Why Use PriMO?
- Robust to prior quality: The ε-greedy strategy with quadratic decay allows PriMO to leverage good priors for early speedup while recovering from bad priors
- Compute-efficient: Up to 10x speedups observed on DL benchmarks when good priors are available
- Multi-fidelity support: Leverages cheap proxy tasks to accelerate initial seed discovery
- True multi-objective: Unlike scalarization baselines, PriMO explores the full Pareto Front through random weight resampling
- No learning curve: Works effectively even with modest budgets typical in deep learning (10-20 evaluations)
When to Use PriMO#
PriMO should be your choice when:
- ✅ You have multiple objectives to optimize (e.g., accuracy and latency)
- ✅ You have expert knowledge about good hyperparameter regions for each objective
- ✅ You can leverage cheap approximations (lower fidelity runs, smaller datasets, fewer epochs)
- ✅ You have limited compute budget (typical practical deep learning scenarios)
PriMO may not be ideal when:
- ❌ You only have a single objective (use
PiBOinstead) - ❌ You have no access to multi-fidelity information
- ❌ You cannot express expert knowledge as priors for individual objectives
Practical Considerations#
Caution with Priors
- Misleading priors: Even if priors are bad, PriMO recovers after ~10 BO samples due to its decay schedule. However, multiple runs with different seeds may be needed for reliable coverage
- Prior generation: Good priors should reflect expert intuition or be derived from similar tasks. Use the highest-fidelity data available when constructing priors
- Objective scaling: Ensure objectives are on comparable scales for effective scalarization, or normalize them appropriately
See the algorithm's implementation details in the API.
Info
PriMO is the recommended choice in NePS when you have both Multi-Fidelity and Multi-Objective settings with expert priors available.
Single-Objective Special Case#
PriMO can be specialized to single-objective optimization by:
- Replacing the multi-fidelity MOASHA component with ASHA in the initial design
- Using the single objective's prior throughout (instead of randomly selecting among priors)
- Maintaining the ε-greedy prior-weighted acquisition function
Comparison to Related Work#
| Aspect | MOASHA | BO+RW | ParEGO | πBO+RW | MO-Priorband | PriMO |
|---|---|---|---|---|---|---|
| Multi-Objective | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Multi-Fidelity | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
| Multi-Objective Priors | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Robust to Bad Priors | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| Sample Efficiency | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |