diff --git a/doc/source/rllib/key-concepts.rst b/doc/source/rllib/key-concepts.rst index 9efd1d86a3c9..447b7b5a24fa 100644 --- a/doc/source/rllib/key-concepts.rst +++ b/doc/source/rllib/key-concepts.rst @@ -1,4 +1,3 @@ - .. include:: /_includes/rllib/we_are_hiring.rst .. include:: /_includes/rllib/new_api_stack.rst @@ -95,53 +94,22 @@ which implements the proximal policy optimization algorithm in RLlib. # Train via Ray Tune. tune.run("PPO", config=config) + # Create rollout workers as Ray actors. + workers = [RolloutWorker.remote() for _ in range(num_workers)] + # Gather experiences in parallel. + trajectories = ray.get([worker.sample.remote() for worker in workers]) -RLlib `Algorithm classes `__ coordinate the distributed workflow of running rollouts and optimizing policies. -Algorithm classes leverage parallel iterators to implement the desired computation pattern. -The following figure shows *synchronous sampling*, the simplest of `these patterns `__: - -.. figure:: images/a2c-arch.svg - - Synchronous Sampling (e.g., A2C, PG, PPO) + # Concatenate the trajectories. + batch = concat(trajectories) -RLlib uses `Ray actors `__ to scale training from a single core to many thousands of cores in a cluster. -You can `configure the parallelism `__ used for training by changing the ``num_env_runners`` parameter. -See this `scaling guide `__ for more details here. - - -RL Modules ----------- - -`RLModules `__ are framework-specific neural network containers. -In a nutshell, they carry the neural networks and define how to use them during three phases that occur in -reinforcement learning: Exploration, inference and training. -A minimal RL Module can contain a single neural network and define its exploration-, inference- and -training logic to only map observations to actions. Since RL Modules can map observations to actions, they naturally -implement reinforcement learning policies in RLlib and can therefore be found in the :py:class:`~ray.rllib.evaluation.rollout_worker.RolloutWorker`, -where their exploration and inference logic is used to sample from an environment. -The second place in RLlib where RL Modules commonly occur is the :py:class:`~ray.rllib.core.learner.learner.Learner`, -where their training logic is used in training the neural network. -RL Modules extend to the multi-agent case, where a single :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` -contains multiple RL Modules. The following figure is a rough sketch of how the above can look in practice: - -.. image:: images/rllib-concepts-rlmodules-sketch.png - - -.. note:: - - RL Modules are currently in alpha stage. They are wrapped in legacy :py:class:`~ray.rllib.policy.Policy` objects - to be used in :py:class:`~ray.rllib.evaluation.rollout_worker.RolloutWorker` for sampling. - This should be transparent to the user, but the following - `Policy Evaluation `__ section still refers to these legacy Policy objects. - -.. policy-evaluation: - -Policy Evaluation ------------------ - -Given an environment and policy, policy evaluation produces `batches `__ of experiences. This is your classic "environment interaction loop". Efficient policy evaluation can be burdensome to get right, especially when leveraging vectorization, RNNs, or when operating in a multi-agent environment. RLlib provides a `RolloutWorker `__ class that manages all of this, and this class is used in most RLlib algorithms. + # Learn on the trajectory batch. + policy.learn_on_batch(batch) + # Broadcast the updated policy weights to the workers. + weights = policy.get_weights() + ray.get([worker.set_weights.remote(weights) for worker in workers]) +``` You can use rollout workers standalone to produce batches of experiences. This can be done by calling ``worker.sample()`` on a worker instance, or ``worker.sample.remote()`` in parallel on worker instances created as Ray actors (see `EnvRunnerGroup `__). Here is an example of creating a set of rollout workers and using them gather experiences in parallel. The trajectories are concatenated, the policy learns on the trajectory batch, and then we broadcast the policy weights to the workers for the next round of rollouts: @@ -203,6 +171,21 @@ serving as a container for the individual agents' sample batches. Training Step Method (``Algorithm.training_step()``) ---------------------------------------------------- +.. TODO all training_step snippets below must be tested +.. note:: + It's important to have a good understanding of the basic :ref:`ray core methods ` before reading this section. + Furthermore, we utilize concepts such as the ``SampleBatch`` (and its more advanced sibling: the ``MultiAgentBatch``), + ``RolloutWorker``, and ``Algorithm``, which can be read about on this page + and the :ref:`rollout worker reference docs `. + + Finally, developers who are looking to implement custom algorithms should familiarize themselves with the :ref:`Policy ` and + :ref:`Model ` classes. + +What is it? +~~~~~~~~~~~ +Training Step Method (``Algorithm.training_step()``) +---------------------------------------------------- + .. TODO all training_step snippets below must be tested .. note:: It's important to have a good understanding of the basic :ref:`ray core methods ` before reading this section. @@ -280,6 +263,15 @@ An example implementation of VPG could look like the following: Let's further break down our above ``training_step()`` code. In the first step, we collect trajectory data from the environment(s): +.. code-block:: python + + train_batch = synchronous_parallel_sample( + worker_set=self.env_runner_group, + max_env_steps=self.config["train_batch_size"] + ) +Let's further break down our above ``training_step()`` code. +In the first step, we collect trajectory data from the environment(s): + .. code-block:: python train_batch = synchronous_parallel_sample( @@ -363,8 +355,6 @@ By default, in RLlib, we create a set of workers that can be used for sampling a We create a :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` object inside of ``setup`` which is called when an RLlib algorithm is created. The :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` has a ``local_worker`` and ``remote_workers`` if ``num_env_runners > 0`` in the experiment config. In RLlib we typically use ``local_worker`` for training and ``remote_workers`` for sampling. - - :ref:`Train Ops `: These are methods that improve the policy and update workers. The most basic operator, ``train_one_step``, takes in as input a batch of experiences and emits a ``ResultDict`` with metrics as output. For training with GPUs, use @@ -373,4 +363,4 @@ training update. :ref:`Replay Buffers `: RLlib provides `a collection `__ of replay -buffers that can be used for storing and sampling experiences. +buffers that can be used for storing and sampling experiences. \ No newline at end of file diff --git a/doc/source/rllib/rllib-env.rst b/doc/source/rllib/rllib-env.rst index b992bbb7e8b7..069a21c4cd42 100644 --- a/doc/source/rllib/rllib-env.rst +++ b/doc/source/rllib/rllib-env.rst @@ -102,29 +102,12 @@ Performance Also check out the `scaling guide `__ for RLlib training. There are two ways to scale experience collection with Gym environments: - - 1. **Vectorization within a single process:** Though many envs can achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs. - - You can configure ``{"num_envs_per_env_runner": M}`` to have RLlib create ``M`` concurrent environments per worker. RLlib auto-vectorizes Gym environments via `VectorEnv.wrap() `__. - - 2. **Distribute across multiple processes:** You can also have RLlib create multiple processes (Ray actors) for experience collection. In most algorithms this can be controlled by setting the ``{"num_env_runners": N}`` config. - -.. image:: images/throughput.png - -You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to ∼200k actions/s, and Pendulum-v1 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs_per_env_runner=64``. - -Expensive Environments -~~~~~~~~~~~~~~~~~~~~~~ - -Some environments may be very resource-intensive to create. RLlib will create ``num_env_runners + 1`` copies of the environment since one copy is needed for the driver process. To avoid paying the extra overhead of the driver copy, which is needed to access the env's action and observation spaces, you can defer environment initialization until ``reset()`` is called. - -Vectorized +Gymnasium ---------- -RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_env_runner`` config is set, or you can define a custom environment class that subclasses `VectorEnv `__ to implement ``vector_step()`` and ``vector_reset()``. - -Note that auto-vectorization only applies to policy inference by default. This means that policy inference will be batched, but your envs will still be stepped one at a time. If you would like your envs to be stepped in parallel, you can set ``"remote_worker_envs": True``. This will create env instances in Ray actors and step them in parallel. These remote processes introduce communication overheads, so this only helps if your env is very expensive to step / reset. +RLlib uses Gymnasium as its environment interface for single-agent training. For more information on how to implement a custom Gymnasium environment, see the `gymnasium.Env class definition `__. You may find the `SimpleCorridor `__ example useful as a reference. +Note: With the recent update, ensure that your environment specifications and imports are compatible with Gymnasium version 1.0.0. For example, if you are using Atari environments, you should now use the `ale_py` prefix, such as `ale_py:ALE/Pong-v5`, instead of the previous `ALE/Pong-v5`. When using remote envs, you can control the batching level for inference with ``remote_env_batch_wait_ms``. The default value of 0ms means envs execute asynchronously and inference is only batched opportunistically. Setting the timeout to a large value will result in fully batched inference and effectively synchronous environment stepping. The optimal value depends on your environment step / reset time, and model inference speed. Multi-Agent and Hierarchical @@ -189,6 +172,22 @@ Here is an example of an env, in which all agents always step simultaneously: And another example, where agents step one after the other (turn-based game): +.. code-block:: python + + # Env, in which two agents step in sequence (tuen-based game). + # The env is in charge of the produced agent ID. Our env here produces + # agent IDs: "player1" and "player2". + env = TicTacToe() + + # Observations are a dict mapping agent names to their obs. Only those + # agents' names that require actions in the next call to `step()` should + # be present in the returned observation dict (here: one agent at a time). + print(env.reset()) + # ... { + # ... "player1": [[...]], + # ... } +And another example, where agents step one after the other (turn-based game): + .. code-block:: python # Env, in which two agents step in sequence (tuen-based game). @@ -271,6 +270,13 @@ If all the agents will be using the same algorithm class to train, then you can To exclude some policies in your ``multiagent.policies`` dictionary, you can use the ``multiagent.policies_to_train`` setting. For example, you may want to have one or more random (non learning) policies interact with your learning ones: +.. code-block:: python +while True: + print(algo.train()) + +To exclude some policies in your ``multiagent.policies`` dictionary, you can use the ``multiagent.policies_to_train`` setting. +For example, you may want to have one or more random (non learning) policies interact with your learning ones: + .. code-block:: python @@ -319,6 +325,14 @@ For how to use multiple training methods at once (here DQN and PPO), see the `two-algorithm example `__. Metrics are reported for each policy separately, for example: +.. code-block:: bash + :emphasize-lines: 6,14,22 +Here is a simple `example training script `__ +in which you can vary the number of agents and policies in the environment. +For how to use multiple training methods at once (here DQN and PPO), +see the `two-algorithm example `__. +Metrics are reported for each policy separately, for example: + .. code-block:: bash :emphasize-lines: 6,14,22 @@ -374,6 +388,16 @@ PettingZoo Multi-Agent Environments A more complete example is here: `rllib_pistonball.py `__ +Rock Paper Scissors Example +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The `rock_paper_scissors_heuristic_vs_learned.py `__ +and `rock_paper_scissors_learned_vs_learned.py `__ examples demonstrate several types of policies competing against each other: heuristic policies of repeating the same move, beating the last opponent move, and learned LSTM and feedforward policies. + +.. figure:: images/rock-paper-scissors.png +A more complete example is here: `rllib_pistonball.py `__ + + Rock Paper Scissors Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -431,6 +455,12 @@ To update the critic, you'll also have to modify the loss of the policy. For an Alternatively, you can use an observation function to share observations between agents. In this strategy, each observation includes all global state, and policies use a custom model to ignore state they aren't supposed to "see" when computing actions. The advantage of this approach is that it's very simple and you don't have to change the algorithm at all -- just use the observation func (i.e., like an env wrapper) and custom model. However, it is a bit less principled in that you have to change the agent observation spaces to include training-time only information. You can find a runnable example of this strategy at `examples/centralized_critic_2.py `__. +Grouping Agents +~~~~~~~~~~~~~~~ +**Strategy 2: Sharing observations through an observation function**: + +Alternatively, you can use an observation function to share observations between agents. In this strategy, each observation includes all global state, and policies use a custom model to ignore state they aren't supposed to "see" when computing actions. The advantage of this approach is that it's very simple and you don't have to change the algorithm at all -- just use the observation func (i.e., like an env wrapper) and custom model. However, it is a bit less principled in that you have to change the agent observation spaces to include training-time only information. You can find a runnable example of this strategy at `examples/centralized_critic_2.py `__. + Grouping Agents ~~~~~~~~~~~~~~~ @@ -485,6 +515,15 @@ External Agents and Applications In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with **external simulators** (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training. +.. figure:: images/rllib-training-inside-a-unity3d-env.png + :scale: 75 % +See this file for a runnable example: `hierarchical_training.py `__. + +External Agents and Applications +-------------------------------- + +In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with **external simulators** (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training. + .. figure:: images/rllib-training-inside-a-unity3d-env.png :scale: 75 % @@ -541,6 +580,23 @@ In remote inference mode, each computed action requires a network call to the se Example: +.. code-block:: python + + client = PolicyClient("http://localhost:9900", inference_mode="local") + episode_id = client.start_episode() + ... + action = client.get_action(episode_id, cur_obs) + ... + client.end_episode(episode_id, last_obs) + +To understand the difference between standard envs, external envs, and connecting with a ``PolicyClient``, refer to the following figure: +Clients can then connect in either *local* or *remote* inference mode. +In local inference mode, copies of the policy are downloaded from the server and cached on the client for a configurable period of time. +This allows actions to be computed by the client without requiring a network round trip each time. +In remote inference mode, each computed action requires a network call to the server. + +Example: + .. code-block:: python client = PolicyClient("http://localhost:9900", inference_mode="local") @@ -601,4 +657,4 @@ For more complex / high-performance environment integrations, you can instead ex `BaseEnv `__ class. This low-level API models multiple agents executing asynchronously in multiple environments. A call to ``BaseEnv:poll()`` returns observations from ready agents keyed by 1) their environment, then 2) agent ids. -Actions for those agents are sent back via ``BaseEnv:send_actions()``. BaseEnv is used to implement all the other env types in RLlib, so it offers a superset of their functionality. +Actions for those agents are sent back via ``BaseEnv:send_actions()``. BaseEnv is used to implement all the other env types in RLlib, so it offers a superset of their functionality. \ No newline at end of file diff --git a/doc/source/rllib/rllib-models.rst b/doc/source/rllib/rllib-models.rst index 717c6bb196c6..0363fb459131 100644 --- a/doc/source/rllib/rllib-models.rst +++ b/doc/source/rllib/rllib-models.rst @@ -72,6 +72,22 @@ These include options for the ``FullyConnectedNetworks`` (``fcnet_hiddens`` and ``VisionNetworks`` (``conv_filters`` and ``conv_activation``), auto-RNN wrapping, auto-Attention (`GTrXL `__) wrapping, and some special options for Atari environments: +.. literalinclude:: ../../../rllib/models/catalog.py + :language: python + :start-after: __sphinx_doc_begin__ + :end-before: __sphinx_doc_end__ +Default Model Config Settings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In the following paragraphs, we will first describe RLlib's default behavior for automatically constructing +models (if you don't setup a custom one), then dive into how you can customize your models by changing these +settings or writing your own model classes. + +By default, RLlib will use the following config settings for your models. +These include options for the ``FullyConnectedNetworks`` (``fcnet_hiddens`` and ``fcnet_activation``), +``VisionNetworks`` (``conv_filters`` and ``conv_activation``), auto-RNN wrapping, auto-Attention (`GTrXL `__) wrapping, +and some special options for Atari environments: + .. literalinclude:: ../../../rllib/models/catalog.py :language: python :start-after: __sphinx_doc_begin__ @@ -120,6 +136,10 @@ X=last Conv2D layer's number of filters, so that RLlib can flatten it. An inform .. _auto_lstm_and_attention: +Built-in auto-LSTM, and auto-Attention Wrappers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. _auto_lstm_and_attention: + Built-in auto-LSTM, and auto-Attention Wrappers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -186,6 +206,24 @@ For example, for manipulating your env's observations or rewards, do: return np.clip(reward, self.min, self.max) +Custom Models: Implementing your own Forward Logic +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you would like to provide your own model logic (instead of using RLlib's built-in defaults), you +can sub-class either ``TFModelV2`` (for TensorFlow) or ``TorchModelV2`` (for PyTorch) and then +register and specify your sub-class in the config as follows: + +.. _tensorflow-models: + +Custom TensorFlow Models +```````````````````````` +# Override `reward` to custom process the original reward coming + # from the env. + def reward(self, reward): + # E.g. simple clipping between min and max. + return np.clip(reward, self.min, self.max) + + Custom Models: Implementing your own Forward Logic ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -257,6 +295,7 @@ Usually, the dict contains only the current observation ``obs`` and an ``is_trai states (in case of RNNs or attention nets). You can also override extra methods of the model such as ``value_function`` to implement a custom value branch. +Additional supervised/self-supervised losses can be added via the ``TorchModelV2.custom_loss`` method: Additional supervised/self-supervised losses can be added via the ``TorchModelV2.custom_loss`` method: See these examples of `fully connected `__, `convolutional `__, and `recurrent `__ torch models. @@ -324,6 +363,17 @@ You can check out the `rnn_model.py `__ models as examples to implement +your own (either TF or Torch). + + .. _attention: Implementing custom Attention Networks @@ -390,6 +440,20 @@ Take a look at this model example that does exactly that: :end-before: __sphinx_doc_end__ +**Using the Trajectory View API: Passing in the last n actions (or rewards or observations) as inputs to a custom Model** +However, what if you had a complex observation space with one or more image components in +it (besides 1D Boxes and discrete spaces). You would probably want to preprocess each of the +image components using some convolutional network, and then concatenate their outputs +with the remaining non-image (flat) inputs (the 1D Box and discrete/one-hot components). + +Take a look at this model example that does exactly that: + +.. literalinclude:: ../../../rllib/models/tf/complex_input_net.py + :language: python + :start-after: __sphinx_doc_begin__ + :end-before: __sphinx_doc_end__ + + **Using the Trajectory View API: Passing in the last n actions (or rewards or observations) as inputs to a custom Model** It is sometimes helpful for learning not only to look at the current observation @@ -464,26 +528,61 @@ You can also use the ``custom_loss()`` API to add in self-supervised losses such Variable-length / Complex Observation Spaces -------------------------------------------- + class ParametricActionsModel(TFModelV2): + def __init__(self, obs_space, action_space, num_outputs, model_config, name): + super(ParametricActionsModel, self).__init__(obs_space, action_space, num_outputs, model_config, name) + self.action_embed_model = FullyConnectedNetwork( + Box(-1, 1, shape=(action_embedding_sz, )), + action_space, + action_embedding_sz, + model_config, + name + "_action_embed") + self.action_logits_model = FullyConnectedNetwork( + obs_space["real_obs"], + action_space, + num_outputs, + model_config, + name + "_action_logits") + + def forward(self, input_dict, state, seq_lens): + action_mask = input_dict["obs"]["action_mask"] + avail_actions = input_dict["obs"]["avail_actions"] + real_obs = input_dict["obs"]["real_obs"] + + action_embed = self.action_embed_model({"obs": avail_actions}) + action_logits = self.action_logits_model({"obs": real_obs}) + + # Compute dot product between action embeddings and logits + action_logits = tf.reduce_sum(action_logits * action_embed, axis=-1) + + # Mask out invalid actions + inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min) + return action_logits + inf_mask, state +``` + +3. Finally, the custom model can be registered and used with an RLlib algorithm: -RLlib supports complex and variable-length observation spaces, including ``gym.spaces.Tuple``, ``gym.spaces.Dict``, and ``rllib.utils.spaces.Repeated``. The handling of these spaces is transparent to the user. RLlib internally will insert preprocessors to insert padding for repeated elements, flatten complex observations into a fixed-size vector during transit, and unpack the vector into the structured tensor before sending it to the model. The flattened observation is available to the model as ``input_dict["obs_flat"]``, and the unpacked observation as ``input_dict["obs"]``. - -To enable batching of struct observations, RLlib unpacks them in a `StructTensor-like format `__. In summary, repeated fields are "pushed down" and become the outer dimensions of tensor batches, as illustrated in this figure from the StructTensor RFC. - -.. image:: images/struct-tensor.png +.. code-block:: python -For further information about complex observation spaces, see: - * A custom environment and model that uses `repeated struct fields `__. - * The pydoc of the `Repeated space `__. - * The pydoc of the batched `repeated values tensor `__. - * The `unit tests `__ for Tuple and Dict spaces. + ModelCatalog.register_custom_model("pa_model", ParametricActionsModel) + config = { + "env": MyParamActionEnv, + "model": { + "custom_model": "pa_model", + }, + ... + } + algo = PPO(config=config) + algo.train() -Variable-length / Parametric Action Spaces ------------------------------------------- +Note: The above example uses TensorFlow, but similar logic can be applied for PyTorch models. -Custom models can be used to work with environments where (1) the set of valid actions `varies per step `__, and/or (2) the number of valid actions is `very large `__. The general idea is that the meaning of actions can be completely conditioned on the observation, i.e., the ``a`` in ``Q(s, a)`` becomes just a token in ``[0, MAX_AVAIL_ACTIONS)`` that only has meaning in the context of ``s``. This works with algorithms in the `DQN and policy-gradient families `__ and can be implemented as follows: +Gymnasium Compatibility +----------------------- -1. The environment should return a mask and/or list of valid action embeddings as part of the observation for each step. To enable batching, the number of actions can be allowed to vary from 1 to some max number: +RLlib now supports environments and models using the `gymnasium` library, which is a fork of the original `gym` library. This includes support for the latest `gymnasium` version 1.0.0. Ensure that your custom environments and models are compatible with `gymnasium` by updating import statements and using the new API where necessary. For example, replace `import gym` with `import gymnasium as gym` and update any environment registration or creation logic accordingly. +For more information on transitioning to `gymnasium`, refer to the `gymnasium` documentation and migration guides. This update ensures that RLlib remains compatible with the latest advancements in the reinforcement learning ecosystem. .. code-block:: python class MyParamActionEnv(gym.Env): @@ -550,6 +649,10 @@ Note that since masking introduces ``tf.float32.min`` values into the model outp Autoregressive Action Distributions ----------------------------------- +In an action space with multiple components (e.g., ``Tuple(a1, a2)``), you might want ``a2`` to be conditioned on the sampled value of ``a1``, i.e., ``a2_sampled ~ P(a2 | a1_sampled, obs)``. Normally, ``a1`` and ``a2`` would be sampled independently, reducing the expressivity of the policy. +Autoregressive Action Distributions +----------------------------------- + In an action space with multiple components (e.g., ``Tuple(a1, a2)``), you might want ``a2`` to be conditioned on the sampled value of ``a1``, i.e., ``a2_sampled ~ P(a2 | a1_sampled, obs)``. Normally, ``a1`` and ``a2`` would be sampled independently, reducing the expressivity of the policy. To do this, you need both a custom model that implements the autoregressive pattern, and a custom action distribution class that leverages that model. The `autoregressive_action_dist.py `__ example shows how this can be implemented for a simple binary action space. For a more complex space, a more efficient architecture such as a `MADE `__ is recommended. Note that sampling a `N-part` action requires `N` forward passes through the model, however computing the log probability of an action can be done in one pass: @@ -627,6 +730,27 @@ To do this, you need both a custom model that implements the autoregressive patt name="a1_logits", activation=None, kernel_initializer=normc_initializer(0.01))(ctx_input) +# Inputs + obs_input = tf.keras.layers.Input( + shape=obs_space.shape, name="obs_input") + a1_input = tf.keras.layers.Input(shape=(1, ), name="a1_input") + ctx_input = tf.keras.layers.Input( + shape=(num_outputs, ), name="ctx_input") + + # Output of the model (normally 'logits', but for an autoregressive + # dist this is more like a context/feature layer encoding the obs) + context = tf.keras.layers.Dense( + num_outputs, + name="hidden", + activation=tf.nn.tanh, + kernel_initializer=normc_initializer(1.0))(obs_input) + + # P(a1 | obs) + a1_logits = tf.keras.layers.Dense( + 2, + name="a1_logits", + activation=None, + kernel_initializer=normc_initializer(0.01))(ctx_input) # P(a2 | a1) # --note: typically you'd want to implement P(a2 | a1, obs) as follows: @@ -659,4 +783,4 @@ To do this, you need both a custom model that implements the autoregressive patt .. note:: - Not all algorithms support autoregressive action distributions; see the `algorithm overview table `__ for more information. + Not all algorithms support autoregressive action distributions; see the `algorithm overview table `__ for more information. \ No newline at end of file diff --git a/doc/source/rllib/single-agent-episode.rst b/doc/source/rllib/single-agent-episode.rst index 87fa50790d45..dda8fbaf899c 100644 --- a/doc/source/rllib/single-agent-episode.rst +++ b/doc/source/rllib/single-agent-episode.rst @@ -77,6 +77,23 @@ The :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` construct APIs exposed to the user. +Using the getter APIs of SingleAgentEpisode +------------------------------------------- + +Now that there is a :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` to work with, one can explore +and extract information from this episode using its different "getter" methods: + +.. note:: + + The `SingleAgentEpisode` class now includes an `is_reset` property, which returns `True` if the `add_env_reset()` method has been called. This can be useful for checking whether an episode has been initialized with a reset observation. +**(Single-agent) Episode**: The episode starts with a single observation (the "reset observation"), then +continues on each timestep with a 3-tuple of `(observation, action, reward)`. Note that because of the reset observation, +every episode - at each timestep - always contains one more observation than it contains actions or rewards. +Important additional properties of an Episode are its `id_` (str), `terminated/truncated` (bool) flags, and `is_reset` (bool) flag indicating whether the episode has been reset. +See further below for a detailed description of the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` +APIs exposed to the user. + + Using the getter APIs of SingleAgentEpisode ------------------------------------------- @@ -151,6 +168,25 @@ episodes (one non-finalized the other finalized): +Episode.cut() and lookback buffers +---------------------------------- +.. figure:: images/episodes/sa_episode_non_finalized.svg + :width: 800 + :align: left + + **Complex observations in a non-finalized episode**: Each individual observation is a (complex) dict matching the + gym environment's observation space. There are three such observation items stored in the episode so far. + +.. figure:: images/episodes/sa_episode_finalized.svg + :width: 600 + :align: left + + **Complex observations in a finalized episode**: The entire observation record is a single (complex) dict matching the + gym environment's observation space. At the leafs of the structure are `NDArrays` holding the individual values of the leaf. + Note that these `NDArrays` have an extra batch dim (axis=0), whose length matches the length of the episode stored (here 3). + + + Episode.cut() and lookback buffers ---------------------------------- @@ -215,4 +251,4 @@ while looking back a certain amount of timesteps from each of these global times .. literalinclude:: doc_code/sa_episode.py :language: python :start-after: rllib-sa-episode-06-begin - :end-before: rllib-sa-episode-06-end + :end-before: rllib-sa-episode-06-end \ No newline at end of file