mdp_playground.envs.rl_toy_env.RLToyEnv

class mdp_playground.envs.rl_toy_env.RLToyEnv(**config)[source]

Bases: gym.core.Env

The base toy environment in MDP Playground. It is parameterised by a config dict and can be instantiated to be an MDP with any of the possible dimensions from the accompanying research paper. The class extends OpenAI Gym’s environment gym.Env.

The accompanying paper is available at: https://arxiv.org/abs/1909.07750.

Instead of implementing a new class for every type of MDP, the intent is to capture as many common dimensions across different types of environments as possible and to be able to control the difficulty of an environment by allowing fine-grained control over each of these dimensions. The focus is to be as flexible as possible.

The configuration for the environment is passed as a dict at initialisation and contains all the information needed to determine the dynamics of the MDP that the instantiated environment will emulate. We recommend looking at the examples in example.py to begin using the environment since the dimensions and config options are mostly self-explanatory. If you want to specify custom MDPs, please see the use_custom_mdp config option below. For more details, we list here the dimensions and config options (their names here correspond to the keys to be passed in the config dict):
state_space_typestr

Specifies what the environment type is. Options are “continuous”, “discrete” and “grid”. The “grid” environment is, basically, a discretised version of the continuous environment.

delayint >= 0

Delays each reward by this number of timesteps.

sequence_lengthint >= 1

Intrinsic sequence length of the reward function of an environment. For discrete environments, randomly selected sequences of this length are set to be rewardable at initialisation if use_custom_mdp = false and generate_random_mdp = true.

transition_noisefloat in range [0, 1] or Python function(rng)

For discrete environments, it is a float that specifies the fraction of times the environment transitions to a noisy next state at each timestep, independently and uniformly at random. For continuous environments, if it’s a float, it’s used as the standard deviation of an i.i.d. normal distribution of noise. If it is a Python function with one argument, it is added to next state. The argument is the Random Number Generator (RNG) of the environment which is an np.random.RandomState object. This RNG should be used to perform calls to the desired random function to be used as noise to ensure reproducibility.

reward_noisefloat or Python function(rng)

If it’s a float, it’s used as the standard deviation of an i.i.d. normal distribution of noise. If it’s a Python function with one argument, it is added to the reward given at every time step. The argument is the Random Number Generator (RNG) of the environment which is an np.random.RandomState object. This RNG should be used to perform calls to the desired random function to be used as noise to ensure reproducibility.

reward_densityfloat in range [0, 1]

The fraction of possible sequences of a given length that will be selected to be rewardable at initialisation time.

reward_scalefloat

Multiplies the rewards by this value at every time step.

reward_shiftfloat

This value is added to the reward at every time step.

diameterint > 0

For discrete environments, if diameter = d, the set of states is set to be a d-partite graph (and NOT a complete d-partite graph), where, if we order the d sets as 1, 2, .., d, states from set 1 will have actions leading to states in set 2 and so on, with the final set d having actions leading to states in set 1. Number of actions for each state will, thus, be = (number of states) / (d).

terminal_state_densityfloat in range [0, 1]

For discrete environments, the fraction of states that are terminal; the terminal states are fixed to the “last” states when we consider them to be ordered by their numerical value. This is w.l.o.g. because discrete states are categorical. For continuous environments, please see terminal_states and term_state_edge for how to control terminal states.

term_state_rewardfloat

Adds this to the reward if a terminal state was reached at the current time step.

image_representationsboolean

Boolean to associate an image as the external observation with every discrete categorical state. For discrete envs, this is handled by an mdp_playground.spaces.ImageMultiDiscrete object. It associates the image of an n + 3 sided polygon for a categorical state n. More details can be found in the documentation for the ImageMultiDiscrete class. For continuous and grid envs, this is handled by an mdp_playground.spaces.ImageContinuous object. More details can be found in the documentation for the ImageContinuous class.

irrelevant_featuresboolean

If True, an additional irrelevant sub-space (irrelevant to achieving rewards) is present as part of the observation space. This sub-space has its own transition dynamics independent of the dynamics of the relevant sub-space. For discrete environments, additionally, state_space_size must be specified as a list. For continuous environments, the option relevant_indices must be specified. This option specifies the dimensions relevant to achieving rewards. For grid environments, nothing additional needs to be done as relevant grid shape is also used as the irrelevant grid shape.

use_custom_mdpboolean

If true, users specify their own transition and reward functions using the config options transition_function and reward_function (see below). Optionally, they can also use init_state_dist and terminal_states for discrete spaces (see below).

transition_functionPython function(state, action) or a 2-D numpy.ndarray

A Python function emulating P(s, a). For discrete envs it’s also possible to specify an |S|x|A| transition matrix.

reward_functionPython function(state_sequence, action_sequence) or a 2-D numpy.ndarray

A Python function emulating R(state_sequence, action_sequence). The state_sequence is recorded by the environment and transition_function is called before reward_function, so the “current” state (when step() was called) and next state are the last 2 states in the sequence. For discrete environments, it’s also possible to specify an |S|x|A| transition matrix where reward is assumed to be a function over the “current” state and action. If use_custom_mdp = false and the environment is continuous, this is a string that chooses one of the following predefined reward functions: move_along_a_line or move_to_a_point. If use_custom_mdp = false and the environment is grid, this is a string that chooses one of the following predefined reward functions: move_to_a_point. Support for sequences is planned.

Also see make_denser documentation.

Specific to discrete environments:
state_space_sizeint > 0 or list of length 2

A number specifying size of the state space for normal discrete environments and a list of len = 2 when irrelevant_features is True (The list contains sizes of relevant and irrelevant sub-spaces where the 1st sub-space is assumed relevant and the 2nd sub-space is assumed irrelevant). NOTE: When automatically generating MDPs, do not specify this value as its value depends on the action_space_size and the diameter as state_space_size = action_space_size * diameter.

action_space_sizeint > 0

Similar description as state_space_size. When automatically generating MDPs, however, its value determines the state_space_size.

reward_distlist with 2 floats or a Python function(env_rng, reward_sequence_dict)

If it’s a list with 2 floats, then these 2 values are interpreted as a closed interval and taken as the end points of a categorical distribution which points equally spaced along the interval. If it’s a Python function, it samples rewards for the rewardable_sequences dict of the environment. The rewardable_sequences dict of the environment holds the rewardable_sequences with the key as a tuple holding the sequence and value as the reward handed out. The 1st argument for the reward_dist function is the Random Number Generator (RNG) of the environment which is an np.random.RandomState object. This RNG should be used to perform calls to the desired random function to be used to sample rewards to ensure reproducibility. The 2nd argument is the rewardable_sequences dict of the environment. This is available because one may need access to the already created reward sequences in the reward_dist function.

init_state_dist1-D numpy.ndarray

Specifies an array of initialisation probabilities for the discrete state space.

terminal_statesPython function(state) or 1-D numpy.ndarray

A Python function with the state as argument that returns whether the state is terminal. If this is specified as an array, the array lists the discrete states that are terminal.

Specific to image_representations for discrete envs:
image_transformsstr

String containing the transforms that must be applied to the image representations. As long as one of the following words is present in the string - shift, scale, rotate, flip - the corresponding transform will be applied at random to the polygon in the image representation whenever an observation is generated. Care is either explicitly taken that the polygon remains inside the image region or a warning is generated.

sh_quantint

An int to quantise the shift transforms.

scale_range(float, float)

A tuple of real numbers to specify (min_scaling, max_scaling).

ro_quantint

An int to quantise the rotation transforms.

Specific to continuous environments:
state_space_dimint

A number specifying state space dimensionality. A Gym Box space of this dimensionality will be instantiated.

action_space_dimint

Same description as state_space_dim. This is currently set equal to the state_space_dim and doesn’t need to specified.

relevant_indiceslist

A list that provides the dimensions relevant to achieving rewards for continuous environments. The dynamics for these dimensions are independent of the dynamics for the remaining (irrelevant) dimensions.

state_space_maxfloat

Max absolute value that a dimension of the space can take. A Gym Box will be instantiated with range [-state_space_max, state_space_max]. Sampling will be done as for Gym Box spaces.

action_space_maxfloat

Similar description as for state_space_max.

terminal_statesnumpy.ndarray

The centres of hypercube sub-spaces which are terminal.

term_state_edgefloat

The edge of the hypercube sub-spaces which are terminal.

transition_dynamics_orderint

An order of n implies that the n-th state derivative is set equal to the action/inertia.

inertiafloat or numpy.ndarray

inertia of the rigid body or point object that is being simulated. If numpy.ndarray, it specifies independent inertiae for the dimensions and the shape should be (state_space_dim,).

time_unitfloat

time duration over which the action is applied to the system.

target_pointnumpy.ndarray

The target point in case move_to_a_point is the reward_function. If make_denser is false, target_radius determines distance from the target point at which the sparse reward is handed out.

action_loss_weightfloat

A coefficient to multiply the norm of the action and subtract it from the reward to penalise the action magnitude.

Specific to grid environments:
grid_shapetuple

Shape of the grid environment. If irrelevant_features is True, this is replicated to add a grid which is irrelevant to the reward.

target_pointnumpy.ndarray

The target point in case move_to_a_point is the reward_function. If make_denser is false, reward is only handed out when the target point is reached.

terminal_statesPython function(state) or 1-D numpy.ndarray

Same description as for terminal_states under discrete envs

Other important config:
Specific to discrete environments:
repeats_in_sequencesboolean

If true, allows rewardable sequences to have repeating states in them.

maximally_connectedboolean

If true, sets the transition function such that every state in independent set i can transition to every state in independent set i + 1. If false, then sets the transition function such that a state in independent set i may have any state in independent set i + 1 as the next state for a transition.

reward_every_n_stepsboolean

Hand out rewards only at multiples of sequence_length steps. This makes the probability that an agent is executing overlapping rewarding sequences 0. This makes it simpler to evaluate HRL algorithms and whether they can “discretise” time correctly. Noise is added at every step, regardless of this setting. Currently, not implemented for either the make_denser = true case or for continuous and grid environments.

generate_random_mdpboolean

If true, automatically generate MDPs when use_custom_mdp = false. Currently, this option doesn’t need to be specified because random MDPs are always generated when use_custom_mdp = false.

Specific to continuous environments:

none as of now

For all, continuous, discrete and grid environments: make_denser : boolean

If true, makes the reward denser in environments. For discrete environments, hands out a partial reward for completing partial sequences. For continuous environments, for reward function move_to_a_point, the base reward handed out is equal to the distance moved towards the target point in the current timestep. For grid envs, the base reward handed out is equal to the Manhattan distance moved towards the target point in the current timestep.

seedint or dict

Recommended to be passed as an int which generates seeds to be used for the various components of the environment. It is, however, possible to control individual seeds by passing it as a dict. Please see the default initialisation for seeds below to see how to do that.

log_filenamestr

The name of the log file to which logs are written.

log_levellogging.LOG_LEVEL option

Python log level for logging

Below, we list the important attributes and methods for this class.

config

the config contains all the details required to generate an environment

Type

dict

seed[source]

recommended to set to an int, which would set seeds for the env, relevant and irrelevant and externally visible observation and action spaces automatically. If fine-grained control over the seeds is necessary, a dict, with key values as in the source code further below, can be passed.

Type

int or dict

observation_space

The externally visible observation space for the enviroment.

Type

Gym.Space

action_space

The externally visible action space for the enviroment.

Type

Gym.Space

rewardable_sequences

holds the rewardable sequences. The keys are tuples of rewardable sequences and values are the rewards handed out. When make_denser is True for discrete environments, this dict also holds the rewardable partial sequences.

Type

dict

init_terminal_states()[source]

Initialises terminal states, T

init_init_state_dist()[source]

Initialises initial state distribution, rho_0

init_transition_function()[source]

Initialises transition function, P

init_reward_function()[source]

Initialises reward function, R

transition_function(state, action)[source]

the transition function of the MDP, P

P(state, action)

defined as a lambda function in the call to init_transition_function() and is equivalent to calling transition_function()

reward_function(state, action)[source]

the reward function of the MDP, R

R(state, action)

defined as a lambda function in the call to init_reward_function() and is equivalent to calling reward_function()

get_augmented_state()[source]

gets underlying Markovian state of the MDP

reset()[source]

Resets environment state

seed()[source]

Sets the seed for the numpy RNG used by the environment (state and action spaces have their own seeds as well)

step(action, imaginary_rollout=False)[source]

Performs 1 transition of the MDP

Initialises the MDP to be emulated using the settings provided in config.

Parameters

config (dict) – the member variable config is initialised to this value after inserting defaults

__init__(**config)[source]

Initialises the MDP to be emulated using the settings provided in config.

Parameters

config (dict) – the member variable config is initialised to this value after inserting defaults

Methods

__init__(**config)

Initialises the MDP to be emulated using the settings provided in config.

close()

Override close in your subclass to perform any necessary cleanup.

get_augmented_state()

Intended to return the full augmented state which would be Markovian.

init_init_state_dist()

Initialises initial state distrbution, rho_0, to be uniform over the non-terminal states for discrete environments.

init_reward_function()

Initialises reward function, R by selecting random sequences to be rewardable for discrete environments.

init_terminal_states()

Initialises terminal state set to be the ‘last’ states for discrete environments.

init_transition_function()

Initialises transition function, P by selecting random next states for every (state, action) tuple for discrete environments.

render([mode])

Renders the environment.

reset()

Resets the environment for the beginning of an episode and samples a start state from rho_0.

reward_function(state, action)

The reward function, R.

seed([seed])

Initialises the Numpy RNG for the environment by calling a utility for this in Gym.

step(action[, imaginary_rollout])

The step function for the environment.

transition_function(state, action)

The transition function, P.

Attributes

action_space

metadata

observation_space

reward_range

spec

unwrapped

Completely unwrap this env.

close()

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

get_augmented_state()[source]

Intended to return the full augmented state which would be Markovian. (However, it’s not Markovian wrt the noise in P and R because we’re not returning the underlying RNG.) Currently, returns the augmented state which is the sequence of length “delay + sequence_length + 1” of past states for both discrete and continuous environments. Additonally, the current state derivatives are also returned for continuous environments.

Returns

  • dict – Contains at the end of the current transition

  • #TODO For noisy processes, this would need the noise distribution and random seed too. Also add the irrelevant state parts, etc.? We don’t need the irrelevant parts for the state to be Markovian.

init_init_state_dist()[source]

Initialises initial state distrbution, rho_0, to be uniform over the non-terminal states for discrete environments. For both discrete and continuous environments, the uniform sampling over non-terminal states is taken care of in reset() when setting the initial state for an episode.

init_reward_function()[source]

Initialises reward function, R by selecting random sequences to be rewardable for discrete environments. For continuous environments, we have fixed available options for the reward function.

init_terminal_states()[source]

Initialises terminal state set to be the ‘last’ states for discrete environments. For continuous environments, terminal states will be in a hypercube centred around config[‘terminal_states’] with the edge of the hypercube of length config[‘term_state_edge’].

init_transition_function()[source]

Initialises transition function, P by selecting random next states for every (state, action) tuple for discrete environments. For continuous environments, we have 1 option for the transition function which varies depending on dynamics order and inertia and time_unit for a point object.

render(mode='human')

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.

  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.

  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note:
Make sure that your class’s metadata ‘render.modes’ key includes

the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Args:

mode (str): the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:

return np.array(…) # return RGB frame suitable for video

elif mode == ‘human’:

… # pop up a window and render

else:

super(MyEnv, self).render(mode=mode) # just raise an exception

reset()[source]

Resets the environment for the beginning of an episode and samples a start state from rho_0. For discrete environments uses the defined rho_0 directly. For continuous environments, samples a state and resamples until a non-terminal state is sampled.

Returns

The start state for a new episode.

Return type

int or np.array

reward_function(state, action)[source]

The reward function, R.

Rewards the sequences selected to be rewardable at initialisation for discrete environments. For continuous environments, we have fixed available options for the reward function:

move_to_a_point rewards for moving to a predefined location. It has sparse and dense settings. move_along_a_line rewards moving along ANY direction in space as long as it’s a fixed direction for sequence_length consecutive steps.

Parameters
  • state (list) – The underlying MDP state (also called augmented state in this code) that the environment uses to calculate its reward. Normally, just the sequence of past states of length delay + sequence_length + 1.

  • action (single action dependent on action space) – Action magnitudes are penalised immediately in the case of continuous spaces and, in effect, play no role for discrete spaces as the reward in that case only depends on sequences of states. We say “in effect” because it _is_ used in case of a custom R to calculate R(s, a) but that is equivalent to using the “next” state s’ as the reward determining criterion in case of deterministic transitions. _Sequences_ of _actions_ are currently NOT used to calculate the reward. Since the underlying MDP dynamics are deterministic, a state and action map 1-to-1 with the next state and so, just a sequence of _states_ should be enough to calculate the reward.

Returns

  • double – The reward at the end of the current transition

  • #TODO Make reward depend on the action sequence too instead of just state sequence, as it is currently?

seed(seed=None)[source]

Initialises the Numpy RNG for the environment by calling a utility for this in Gym.

The environment has its own RNG and so do the state and action spaces held by the environment.

Parameters

seed (int) – seed to initialise the np_random instance held by the environment. Cannot use numpy.int64 or similar because Gym doesn’t accept it.

Returns

The seed returned by Gym

Return type

int

step(action, imaginary_rollout=False)[source]

The step function for the environment.

Parameters
  • action (int or np.array) – The action that the environment will use to perform a transition.

  • imaginary_rollout (boolean) – Option for the user to perform “imaginary” transitions, e.g., for model-based RL. If set to true, underlying augmented state of the MDP is not changed and user is responsible to maintain and provide a list of states to this function to be able to perform a rollout.

Returns

The next state, reward, whether the episode terminated and additional info dict at the end of the current transition

Return type

int or np.array, double, boolean, dict

transition_function(state, action)[source]

The transition function, P.

Performs a transition according to the initialised P for discrete environments (with dynamics independent for relevant vs irrelevant dimension sub-spaces). For continuous environments, we have a fixed available option for the dynamics (which is the same for relevant or irrelevant dimensions): The order of the system decides the dynamics. For an nth order system, the nth order derivative of the state is set to the action value / inertia for time_unit seconds. And then the dynamics are integrated over the time_unit to obtain the next state.

Parameters
  • state (list) – The state that the environment will use to perform a transition.

  • action (list) – The action that the environment will use to perform a transition.

Returns

The state at the end of the current transition

Return type

int or np.array

property unwrapped

Completely unwrap this env.

Returns:

gym.Env: The base non-wrapped gym.Env instance