Mighty exploration policy
mighty.mighty_exploration.mighty_exploration_policy
#
Mighty Exploration Policy.
MightyExplorationPolicy
#
Generic Exploration Policy Interface.
Now supports
- Discrete:
model(state)
→ logits → Categorical - Continuous (squashed-Gaussian):
model(state)
→ (action, z, mean, log_std) - Continuous (Standard PPO):
model(state)
→ (action, mean, log_std) - Continuous (legacy):
model(state)
→ (mean, std)
:param discrete: True if action-space is discrete
Source code in mighty/mighty_exploration/mighty_exploration_policy.py
__call__
#
Get action.
:param s: state :param return_logp: return logprobs :param metrics: current metric dict :param eval: eval mode :return: action or (action, logprobs)
Source code in mighty/mighty_exploration/mighty_exploration_policy.py
explore
#
Explore.
:param s: state :param return_logp: return logprobs :param _: not used :return: action or (action, logprobs)
Source code in mighty/mighty_exploration/mighty_exploration_policy.py
explore_func
#
sample_func_logits
#
state_np: np.ndarray of shape [batch, obs_dim] Returns: (action_tensor, log_prob_tensor)
Source code in mighty/mighty_exploration/mighty_exploration_policy.py
sample_func_q
#
Q-learning branch
• state_np: np.ndarray of shape [batch, obs_dim] • model(state) returns Q-values: tensor [batch, n_actions]
We choose action = argmax(Q), and also return the full Q‐vector.
Source code in mighty/mighty_exploration/mighty_exploration_policy.py
sample_nondeterministic_logprobs
#
sample_nondeterministic_logprobs(
z: Tensor,
mean: Tensor,
log_std: Tensor,
sac: bool = False,
) -> Tensor
Compute log-prob of a Gaussian sample z ~ N(mean, exp(log_std)), and if sac=True apply the tanh-squash correction to get log π(a).