From 8bf50740d080023a8abadab85d84edcbf8a8aadf Mon Sep 17 00:00:00 2001
From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com>
Date: Tue, 5 Nov 2024 00:48:53 +0000
Subject: [PATCH 1/2] Docs update (a37083b)

---
 doc/source/rllib/rllib-advanced-api.rst   | 116 ++++++++++++---------
 doc/source/rllib/rllib-env.rst            | 118 ++++++++++++++++------
 doc/source/rllib/rllib-examples.rst       |  59 ++++++++++-
 doc/source/rllib/single-agent-episode.rst |  36 ++++++-
 4 files changed, 243 insertions(+), 86 deletions(-)

diff --git a/doc/source/rllib/rllib-advanced-api.rst b/doc/source/rllib/rllib-advanced-api.rst
index 96daf9d06d20..4c26dde5030b 100644
--- a/doc/source/rllib/rllib-advanced-api.rst
+++ b/doc/source/rllib/rllib-advanced-api.rst
@@ -1,4 +1,3 @@
-
 .. include:: /_includes/rllib/new_api_stack.rst
 
 .. _rllib-advanced-api-doc:
@@ -97,6 +96,24 @@ used implement communication patterns such as parameter servers and all-reduce.
 Callbacks and Custom Metrics
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+You can provide callbacks to be called at points during policy evaluation.
+These callbacks have access to state for the current
+`episode <https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py>`__.
+Certain callbacks such as ``on_postprocess_trajectory``, ``on_sample_end``,
+and ``on_train_result`` are also places where custom postprocessing can be applied to
+intermediate data or results.
+```
+.. literalinclude:: ./doc_code/advanced_api.py
+   :language: python
+   :start-after: __rllib-adv_api_counter_begin__
+   :end-before: __rllib-adv_api_counter_end__
+
+Ray actors provide high levels of performance, so in more complex cases they can be
+used implement communication patterns such as parameter servers and all-reduce.
+
+Callbacks and Custom Metrics
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 You can provide callbacks to be called at points during policy evaluation.
 These callbacks have access to state for the current
 `episode <https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py>`__.
@@ -176,6 +193,18 @@ that currently use these by default:
 .. View table below at: https://docs.google.com/drawings/d/1dEMhosbu7HVgHEwGBuMlEDyPiwjqp_g6bZ0DzCMaoUM/edit?usp=sharing
 .. image:: images/rllib-exploration-api-table.svg
 
+An Exploration class implements the ``get_exploration_action`` method,
+in which the exact exploratory behavior is defined.
+It takes the model’s output, the action distribution class, the model itself,
+a timestep (the global env-sampling steps already taken),
+and an ``explore`` switch and outputs a tuple of a) action and
+b) log-likelihood:
+The following table lists all built-in Exploration sub-classes and the agents
+that currently use these by default:
+
+.. View table below at: https://docs.google.com/drawings/d/1dEMhosbu7HVgHEwGBuMlEDyPiwjqp_g6bZ0DzCMaoUM/edit?usp=sharing
+.. image:: images/rllib-exploration-api-table.svg
+
 An Exploration class implements the ``get_exploration_action`` method,
 in which the exact exploratory behavior is defined.
 It takes the model’s output, the action distribution class, the model itself,
@@ -259,61 +288,33 @@ rewards with different settings (for example, with exploration turned off, or on
 of environment configurations). You can activate evaluating policies during training
 (``Algorithm.train()``) by setting the ``evaluation_interval`` to an int value (> 0)
 indicating every how many ``Algorithm.train()`` calls an "evaluation step" should be run:
-
-.. literalinclude:: ./doc_code/advanced_api.py
-   :language: python
-   :start-after: __rllib-adv_api_evaluation_1_begin__
-   :end-before: __rllib-adv_api_evaluation_1_end__
-
-An evaluation step runs - using its own ``EnvRunner`` instances - for ``evaluation_duration``
-episodes or time-steps, depending on the ``evaluation_duration_unit`` setting, which can take values
-of either ``"episodes"`` (default) or ``"timesteps"``.
-
-.. literalinclude:: ./doc_code/advanced_api.py
+    .. literalinclude:: ./doc_code/advanced_api.py
    :language: python
-   :start-after: __rllib-adv_api_evaluation_2_begin__
-   :end-before: __rllib-adv_api_evaluation_2_end__
+   :start-after: __rllib-adv_api_evaluation_4_begin__
+   :end-before: __rllib-adv_api_evaluation_4_end__
 
-Note: When using ``evaluation_duration_unit=timesteps`` and your ``evaluation_duration``
-setting isn't divisible by the number of evaluation workers (configurable with
-``evaluation_num_env_runners``), RLlib rounds up the number of time-steps specified to
-the nearest whole number of time-steps that is divisible by the number of evaluation
-workers.
-Also, when using ``evaluation_duration_unit=episodes`` and your
-``evaluation_duration`` setting isn't divisible by the number of evaluation workers
-(configurable with ``evaluation_num_env_runners``), RLlib runs the remainder of episodes
-on the first n evaluation EnvRunners and leave the remaining workers idle for that time.
+Note: The evaluation step will not be exactly the same length as the training step, but
+it will be close. This is useful for ensuring that evaluation does not become a bottleneck
+when running in parallel.
 
-For example:
+You can also customize the evaluation process by providing a custom evaluation function
+via the ``custom_eval_function`` config key. This function takes the current Algorithm
+instance and the evaluation workers as arguments and can return a custom evaluation result
+dict. This allows for more complex evaluation logic, such as evaluating on a different
+set of environments or using a different set of metrics.
 
 .. literalinclude:: ./doc_code/advanced_api.py
    :language: python
-   :start-after: __rllib-adv_api_evaluation_3_begin__
-   :end-before: __rllib-adv_api_evaluation_3_end__
-
-Before each evaluation step, weights from the main model are synchronized
-to all evaluation workers.
-
-By default, the evaluation step (if there is one in the current iteration) is run
-right **after** the respective training step.
-For example, for ``evaluation_interval=1``, the sequence of events is:
-``train(0->1), eval(1), train(1->2), eval(2), train(2->3), ...``.
-Here, the indices show the version of neural network weights used.
-``train(0->1)`` is an update step that changes the weights from version 0 to
-version 1 and ``eval(1)`` then uses weights version 1.
-Weights index 0 represents the randomly initialized weights of the neural network.
-
-Another example: For ``evaluation_interval=2``, the sequence is:
-``train(0->1), train(1->2), eval(2), train(2->3), train(3->4), eval(4), ...``.
-
-Instead of running ``train``- and ``eval``-steps in sequence, it is also possible to
-run them in parallel with the ``evaluation_parallel_to_training=True`` config setting.
-In this case, both training- and evaluation steps are run at the same time using multi-threading.
-This can speed up the evaluation process significantly, but leads to a 1-iteration
-delay between reported training- and evaluation results.
-The evaluation results are behind in this case b/c they use slightly outdated
-model weights (synchronized after the previous training step).
+   :start-after: __rllib-adv_api_evaluation_5_begin__
+   :end-before: __rllib-adv_api_evaluation_5_end__
+
+In this example, the custom evaluation function sets different corridor lengths for
+different evaluation workers, allowing for evaluation on different environment configurations.
+The evaluation results are then logged alongside the training results.
 
+For more information on customizing the evaluation process, see the documentation for
+the ``custom_eval_function`` config key in the RLlib API reference.
+```
 For example, for ``evaluation_parallel_to_training=True`` and ``evaluation_interval=1``,
 the sequence is now:
 ``train(0->1) + eval(0), train(1->2) + eval(1), train(2->3) + eval(2)``,
@@ -392,6 +393,21 @@ of training iterations before stopping.
 Below are some examples of how the custom evaluation metrics are reported nested under
 the ``evaluation`` key of normal training results:
 
+.. TODO make sure these outputs are still valid.
+.. code-block:: bash
+
+    ------------------------------------------------------------------------
+    Sample output for `python custom_evaluation.py --no-custom-eval`
+    ------------------------------------------------------------------------
+There is also an end-to-end example of how to set up a custom online evaluation in
+`custom_evaluation.py <https://github.com/ray-project/ray/blob/master/rllib/examples/evaluation/custom_evaluation.py>`__.
+Note that if you only want to evaluate your policy at the end of training,
+you can set ``evaluation_interval: [int]``, where ``[int]`` should be the number
+of training iterations before stopping.
+
+Below are some examples of how the custom evaluation metrics are reported nested under
+the ``evaluation`` key of normal training results:
+
 .. TODO make sure these outputs are still valid.
 .. code-block:: bash
 
@@ -471,4 +487,4 @@ using the following call:
     rollout_worker = get_global_worker()
 
 Policy losses are defined over the ``post_batch`` data, so you can mutate that in
-the callbacks to change what data the policy loss function sees.
+the callbacks to change what data the policy loss function sees.
\ No newline at end of file
diff --git a/doc/source/rllib/rllib-env.rst b/doc/source/rllib/rllib-env.rst
index b992bbb7e8b7..d0e6b07c55a9 100644
--- a/doc/source/rllib/rllib-env.rst
+++ b/doc/source/rllib/rllib-env.rst
@@ -20,7 +20,7 @@ RLlib works with several different types of environments, including `Farama-Foun
 Configuring Environments
 ------------------------
 
-You can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gym `environment name <https://www.gymlibrary.dev/>`__.
+You can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gymnasium `environment name <https://gymnasium.farama.org/>`__.
 Custom env classes passed directly to the algorithm must take a single ``env_config`` parameter in their constructor:
 
 .. code-block:: python
@@ -97,6 +97,23 @@ RLlib uses Gymnasium as its environment interface for single-agent training. For
 Performance
 ~~~~~~~~~~~
 
+.. tip::
+
+    Also check out the `scaling guide <rllib-training.html#scaling-guide>`__ for RLlib training.
+
+There are two ways to scale experience collection with Gymnasium environments:
+.. tip::
+
+   When using logging in an environment, the logging configuration needs to be done inside the environment, which runs inside Ray workers. Any configurations outside the environment, e.g., before starting Ray will be ignored.
+
+Gymnasium
+----------
+
+RLlib uses Gymnasium as its environment interface for single-agent training. For more information on how to implement a custom Gymnasium environment, see the `gymnasium.Env class definition <https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py>`__. You may find the `SimpleCorridor <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__ example useful as a reference.
+
+Performance
+~~~~~~~~~~~
+
 .. tip::
 
     Also check out the `scaling guide <rllib-training.html#scaling-guide>`__ for RLlib training.
@@ -132,6 +149,15 @@ Multi-Agent and Hierarchical
 
 In a multi-agent environment, there are more than one "agent" acting simultaneously, in a turn-based fashion, or in a combination of these two.
 
+For example, in a traffic simulation, there may be multiple "car" and "traffic light" agents in the environment,
+acting simultaneously. Whereas in a board game, you may have two or more agents acting in a turn-base fashion.
+When using remote envs, you can control the batching level for inference with ``remote_env_batch_wait_ms``. The default value of 0ms means envs execute asynchronously and inference is only batched opportunistically. Setting the timeout to a large value will result in fully batched inference and effectively synchronous environment stepping. The optimal value depends on your environment step / reset time, and model inference speed.
+
+Multi-Agent and Hierarchical
+----------------------------
+
+In a multi-agent environment, there are more than one "agent" acting simultaneously, in a turn-based fashion, or in a combination of these two.
+
 For example, in a traffic simulation, there may be multiple "car" and "traffic light" agents in the environment,
 acting simultaneously. Whereas in a board game, you may have two or more agents acting in a turn-base fashion.
 
@@ -203,37 +229,6 @@ And another example, where agents step one after the other (turn-based game):
     # ... {
     # ...   "player1": [[...]],
     # ... }
-
-    # In the following call to `step`, only those agents' actions should be
-    # provided that were present in the returned obs dict:
-    new_obs, rewards, dones, infos = env.step(actions={"player1": ...})
-
-    # Similarly, new_obs, rewards, dones, etc. also become dicts.
-    # Note that only in the `rewards` dict, any agent may be listed (even those that have
-    # not(!) acted in the `step()` call). Rewards for individual agents will be added
-    # up to the point where a new action for that agent is needed. This way, you may
-    # implement a turn-based 2-player game, in which player-2's reward is published
-    # in the `rewards` dict immediately after player-1 has acted.
-    print(rewards)
-    # ... {"player1": 0, "player2": 0}
-
-    # Individual agents can early exit; The entire episode is done when
-    # dones["__all__"] = True.
-    print(dones)
-    # ... {"player1": False, "__all__": False}
-
-    # In the next step, it's player2's turn. Therefore, `new_obs` only container
-    # this agent's ID:
-    print(new_obs)
-    # ... {
-    # ...   "player2": [[...]]
-    # ... }
-
-
-If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows:
-
-.. code-block:: python
-
     algo = pg.PGAgent(env="my_multiagent_env", config={
         "multiagent": {
             "policies": {
@@ -271,6 +266,13 @@ If all the agents will be using the same algorithm class to train, then you can
 To exclude some policies in your ``multiagent.policies`` dictionary, you can use the ``multiagent.policies_to_train`` setting.
 For example, you may want to have one or more random (non learning) policies interact with your learning ones:
 
+.. code-block:: python
+while True:
+        print(algo.train())
+
+To exclude some policies in your ``multiagent.policies`` dictionary, you can use the ``multiagent.policies_to_train`` setting.
+For example, you may want to have one or more random (non learning) policies interact with your learning ones:
+
 .. code-block:: python
 
 
@@ -319,6 +321,14 @@ For how to use multiple training methods at once (here DQN and PPO),
 see the `two-algorithm example <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/two_algorithms.py>`__.
 Metrics are reported for each policy separately, for example:
 
+.. code-block:: bash
+   :emphasize-lines: 6,14,22
+Here is a simple `example training script <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
+in which you can vary the number of agents and policies in the environment.
+For how to use multiple training methods at once (here DQN and PPO),
+see the `two-algorithm example <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/two_algorithms.py>`__.
+Metrics are reported for each policy separately, for example:
+
 .. code-block:: bash
    :emphasize-lines: 6,14,22
 
@@ -374,6 +384,16 @@ PettingZoo Multi-Agent Environments
 A more complete example is here: `rllib_pistonball.py <https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/Ray/rllib_pistonball.py>`__
 
 
+Rock Paper Scissors Example
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The `rock_paper_scissors_heuristic_vs_learned.py <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/rock_paper_scissors_heuristic_vs_learned.py>`__
+and `rock_paper_scissors_learned_vs_learned.py <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/rock_paper_scissors_learned_vs_learned.py>`__ examples demonstrate several types of policies competing against each other: heuristic policies of repeating the same move, beating the last opponent move, and learned LSTM and feedforward policies.
+
+.. figure:: images/rock-paper-scissors.png
+A more complete example is here: `rllib_pistonball.py <https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/Ray/rllib_pistonball.py>`__
+
+
 Rock Paper Scissors Example
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -431,6 +451,12 @@ To update the critic, you'll also have to modify the loss of the policy. For an
 
 Alternatively, you can use an observation function to share observations between agents. In this strategy, each observation includes all global state, and policies use a custom model to ignore state they aren't supposed to "see" when computing actions. The advantage of this approach is that it's very simple and you don't have to change the algorithm at all -- just use the observation func (i.e., like an env wrapper) and custom model. However, it is a bit less principled in that you have to change the agent observation spaces to include training-time only information. You can find a runnable example of this strategy at `examples/centralized_critic_2.py <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__.
 
+Grouping Agents
+~~~~~~~~~~~~~~~
+**Strategy 2: Sharing observations through an observation function**:
+
+Alternatively, you can use an observation function to share observations between agents. In this strategy, each observation includes all global state, and policies use a custom model to ignore state they aren't supposed to "see" when computing actions. The advantage of this approach is that it's very simple and you don't have to change the algorithm at all -- just use the observation func (i.e., like an env wrapper) and custom model. However, it is a bit less principled in that you have to change the agent observation spaces to include training-time only information. You can find a runnable example of this strategy at `examples/centralized_critic_2.py <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__.
+
 Grouping Agents
 ~~~~~~~~~~~~~~~
 
@@ -485,6 +511,15 @@ External Agents and Applications
 
 In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with **external simulators** (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training.
 
+.. figure:: images/rllib-training-inside-a-unity3d-env.png
+    :scale: 75 %
+See this file for a runnable example: `hierarchical_training.py <https://github.com/ray-project/ray/blob/master/rllib/examples/hierarchical/hierarchical_training.py>`__.
+
+External Agents and Applications
+--------------------------------
+
+In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with **external simulators** (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training.
+
 .. figure:: images/rllib-training-inside-a-unity3d-env.png
     :scale: 75 %
 
@@ -541,6 +576,23 @@ In remote inference mode, each computed action requires a network call to the se
 
 Example:
 
+.. code-block:: python
+
+    client = PolicyClient("http://localhost:9900", inference_mode="local")
+    episode_id = client.start_episode()
+    ...
+    action = client.get_action(episode_id, cur_obs)
+    ...
+    client.end_episode(episode_id, last_obs)
+
+To understand the difference between standard envs, external envs, and connecting with a ``PolicyClient``, refer to the following figure:
+Clients can then connect in either *local* or *remote* inference mode.
+In local inference mode, copies of the policy are downloaded from the server and cached on the client for a configurable period of time.
+This allows actions to be computed by the client without requiring a network round trip each time.
+In remote inference mode, each computed action requires a network call to the server.
+
+Example:
+
 .. code-block:: python
 
     client = PolicyClient("http://localhost:9900", inference_mode="local")
@@ -601,4 +653,4 @@ For more complex / high-performance environment integrations, you can instead ex
 `BaseEnv <https://github.com/ray-project/ray/blob/master/rllib/env/base_env.py>`__ class.
 This low-level API models multiple agents executing asynchronously in multiple environments.
 A call to ``BaseEnv:poll()`` returns observations from ready agents keyed by 1) their environment, then 2) agent ids.
-Actions for those agents are sent back via ``BaseEnv:send_actions()``. BaseEnv is used to implement all the other env types in RLlib, so it offers a superset of their functionality.
+Actions for those agents are sent back via ``BaseEnv:send_actions()``. BaseEnv is used to implement all the other env types in RLlib, so it offers a superset of their functionality.
\ No newline at end of file
diff --git a/doc/source/rllib/rllib-examples.rst b/doc/source/rllib/rllib-examples.rst
index 616290b6bdd8..77bf574c8232 100644
--- a/doc/source/rllib/rllib-examples.rst
+++ b/doc/source/rllib/rllib-examples.rst
@@ -82,6 +82,17 @@ Connectors
     (as opposed to ``Connector``, which only continue to work on the old API stack |old_stack|).
 
 
+- |new_stack| `How to frame-stack Atari image observations <https://github.com/ray-project/ray/blob/master/rllib/examples/connectors/frame_stacking.py>`__:
+   An example using Atari framestacking in a very efficient manner, not in the environment itself (as a `gym.Wrapper`),
+   but by stacking the observations on-the-fly using `EnvToModule` and `LearnerConnector` pipelines.
+   This method of framestacking is more efficient as it avoids having to send large observation
+   tensors through the network (ray).
+.. note::
+    RLlib's Connector API has been re-written from scratch for the new API stack (|new_stack|).
+    Connector-pieces and -pipelines are now referred to as :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2`
+    (as opposed to ``Connector``, which only continue to work on the old API stack |old_stack|).
+
+
 - |new_stack| `How to frame-stack Atari image observations <https://github.com/ray-project/ray/blob/master/rllib/examples/connectors/frame_stacking.py>`__:
    An example using Atari framestacking in a very efficient manner, not in the environment itself (as a `gym.Wrapper`),
    but by stacking the observations on-the-fly using `EnvToModule` and `LearnerConnector` pipelines.
@@ -130,6 +141,13 @@ Environments
 
 - |new_stack| `How to set up rendering (and recording) of the environment trajectories during training with WandB <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/env_rendering_and_recording.py>`__:
    Example showing how you can render and record episode trajectories of your gymnasium envs and log the videos to WandB.
+Environments
+------------
+- |new_stack| `How to register a custom gymnasium environment <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/custom_gym_env.py>`__:
+   Example showing how to write your own RL environment using ``gymnasium`` and register it to run train your algorithm against this env with RLlib.
+
+- |new_stack| `How to set up rendering (and recording) of the environment trajectories during training with WandB <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/env_rendering_and_recording.py>`__:
+   Example showing how you can render and record episode trajectories of your gymnasium envs and log the videos to WandB. Note: The environment names have been updated to use the `ale_py` prefix for Atari environments.
 
 - |old_stack| `How to run a Unity3D multi-agent environment locally <https://github.com/ray-project/ray/tree/master/rllib/examples/envs/external_envs/unity3d_env_local.py>`__:
    Example of how to setup an RLlib Algorithm against a locally running Unity3D editor instance to
@@ -170,6 +188,23 @@ GPU (for Training and Sampling)
 Hierarchical Training
 ---------------------
 
+- |old_stack| `How to setup hierarchical training <https://github.com/ray-project/ray/blob/master/rllib/examples/hierarchical/hierarchical_training.py>`__:
+   Example of hierarchical training using the multi-agent API.
+
+Inference (of Models/Policies)
+------------------------------
+GPU (for Training and Sampling)
+-------------------------------
+
+- |new_stack| `How to use fractional GPUs for training an RLModule <https://github.com/ray-project/ray/blob/master/rllib/examples/gpus/fractional_gpus_per_learner.py>`__:
+   If your model is small and easily fits on a single GPU and you want to therefore train
+   other models alongside it to save time and cost, this script shows you how to set up
+   your RLlib config with a fractional number of GPUs on the learner (model training)
+   side.
+
+Hierarchical Training
+---------------------
+
 - |old_stack| `How to setup hierarchical training <https://github.com/ray-project/ray/blob/master/rllib/examples/hierarchical/hierarchical_training.py>`__:
    Example of hierarchical training using the multi-agent API.
 
@@ -207,7 +242,6 @@ Multi-Agent RL
    Example of running a custom hand-coded policy alongside trainable policies.
 - |new_stack| `How to train a single policy (weight sharing) controlling more than one agents <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/multi_agent_cartpole.py>`__:
    Example of how to define weight-sharing layers between two different policies.
-
 - |old_stack| `Hwo to write and set up a model with centralized critic <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic.py>`__:
    Example of customizing PPO to leverage a centralized value function.
 - |old_stack| `How to write and set up a model with centralized critic in the env <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__:
@@ -246,6 +280,14 @@ Ray Tune and RLlib
 RLModules
 ---------
 
+- |new_stack| `How to configure an autoregressive action distribution <https://github.com/ray-project/ray/blob/master/rllib/examples/rl_modules/autoregressive_actions.py>`__:
+   Learning with an auto-regressive action distribution (for example, two action components, where distribution of the second component depends on the first's actually sampled value).
+- |new_stack| `How to Custom tune experiment <https://github.com/ray-project/ray/blob/master/rllib/examples/ray_tune/custom_experiment.py>`__:
+   How to run a custom Ray Tune experiment with RLlib with custom training- and evaluation phases.
+
+RLModules
+---------
+
 - |new_stack| `How to configure an autoregressive action distribution <https://github.com/ray-project/ray/blob/master/rllib/examples/rl_modules/autoregressive_actions.py>`__:
    Learning with an auto-regressive action distribution (for example, two action components, where distribution of the second component depends on the first's actually sampled value).
 
@@ -277,6 +319,17 @@ For example, see this tuned Atari example for PPO, which learns to solve the Pon
 in roughly 5min. It can be run like this on a single g5.24xlarge (or g6.24xlarge) machine with
 4 GPUs and 96 CPUs:
 
+.. code-block:: bash
+
+    $ cd ray/rllib/tuned_examples/ppo
+    $ python atari_ppo.py --env=ale_py:ALE/Pong-v5 --num-gpus=4 --num-env-runners=95
+
+Note that some of the files in this folder are used for RLlib's daily or weekly
+release tests as well.
+For example, see this tuned Atari example for PPO, which learns to solve the Pong environment
+in roughly 5min. It can be run like this on a single g5.24xlarge (or g6.24xlarge) machine with
+4 GPUs and 96 CPUs:
+
 .. code-block:: bash
 
     $ cd ray/rllib/tuned_examples/ppo
@@ -323,6 +376,8 @@ Community Examples
    Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent.
 
 
+Blog Posts
+++++++++++
 Blog Posts
 ++++++++++
 
@@ -335,4 +390,4 @@ Blog Posts
 - |old_stack| `Scaling Multi-Agent Reinforcement Learning <http://bair.berkeley.edu/blog/2018/12/12/rllib>`__:
    Blog post of a brief tutorial on multi-agent RL and its design in RLlib.
 - |old_stack| `Functional RL with Keras and TensorFlow Eager <https://medium.com/riselab/functional-rl-with-keras-and-tensorflow-eager-7973f81d6345>`__:
-   Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms.
+   Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms.
\ No newline at end of file
diff --git a/doc/source/rllib/single-agent-episode.rst b/doc/source/rllib/single-agent-episode.rst
index 87fa50790d45..f1f52ad92899 100644
--- a/doc/source/rllib/single-agent-episode.rst
+++ b/doc/source/rllib/single-agent-episode.rst
@@ -77,6 +77,21 @@ The :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` construct
     APIs exposed to the user.
 
 
+Using the getter APIs of SingleAgentEpisode
+-------------------------------------------
+
+Now that there is a :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` to work with, one can explore
+and extract information from this episode using its different "getter" methods:
+```
+**(Single-agent) Episode**: The episode starts with a single observation (the "reset observation"), then
+continues on each timestep with a 3-tuple of `(observation, action, reward)`. Note that because of the reset observation,
+every episode - at each timestep - always contains one more observation than it contains actions or rewards.
+Important additional properties of an Episode are its `id_` (str) and `terminated/truncated` (bool) flags.
+See further below for a detailed description of the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode`
+APIs exposed to the user.
+
+A new property `is_reset` has been added to the `SingleAgentEpisode` class, which returns `True` if the `add_env_reset()` method has already been called. This property helps in determining whether the episode has been reset.
+
 Using the getter APIs of SingleAgentEpisode
 -------------------------------------------
 
@@ -151,6 +166,25 @@ episodes (one non-finalized the other finalized):
 
 
 
+Episode.cut() and lookback buffers
+----------------------------------
+.. figure:: images/episodes/sa_episode_non_finalized.svg
+    :width: 800
+    :align: left
+
+    **Complex observations in a non-finalized episode**: Each individual observation is a (complex) dict matching the
+    gym environment's observation space. There are three such observation items stored in the episode so far.
+
+.. figure:: images/episodes/sa_episode_finalized.svg
+    :width: 600
+    :align: left
+
+    **Complex observations in a finalized episode**: The entire observation record is a single (complex) dict matching the
+    gym environment's observation space. At the leafs of the structure are `NDArrays` holding the individual values of the leaf.
+    Note that these `NDArrays` have an extra batch dim (axis=0), whose length matches the length of the episode stored (here 3).
+
+
+
 Episode.cut() and lookback buffers
 ----------------------------------
 
@@ -215,4 +249,4 @@ while looking back a certain amount of timesteps from each of these global times
 .. literalinclude:: doc_code/sa_episode.py
     :language: python
     :start-after: rllib-sa-episode-06-begin
-    :end-before: rllib-sa-episode-06-end
+    :end-before: rllib-sa-episode-06-end
\ No newline at end of file

From ac88da7e2094c5ff0affd2113ce820654978bd72 Mon Sep 17 00:00:00 2001
From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com>
Date: Tue, 5 Nov 2024 00:51:33 +0000
Subject: [PATCH 2/2] Docs update (a37083b)

---
 doc/source/rllib/key-concepts.rst         |  84 ++++++------
 doc/source/rllib/rllib-advanced-api.rst   | 116 +++++++---------
 doc/source/rllib/rllib-env.rst            |  96 +++++++-------
 doc/source/rllib/rllib-examples.rst       |  59 +--------
 doc/source/rllib/rllib-models.rst         | 154 +++++++++++++++++++---
 doc/source/rllib/single-agent-episode.rst |   8 +-
 6 files changed, 283 insertions(+), 234 deletions(-)

diff --git a/doc/source/rllib/key-concepts.rst b/doc/source/rllib/key-concepts.rst
index 9efd1d86a3c9..447b7b5a24fa 100644
--- a/doc/source/rllib/key-concepts.rst
+++ b/doc/source/rllib/key-concepts.rst
@@ -1,4 +1,3 @@
-
 .. include:: /_includes/rllib/we_are_hiring.rst
 
 .. include:: /_includes/rllib/new_api_stack.rst
@@ -95,53 +94,22 @@ which implements the proximal policy optimization algorithm in RLlib.
 
             # Train via Ray Tune.
             tune.run("PPO", config=config)
+    # Create rollout workers as Ray actors.
+    workers = [RolloutWorker.remote() for _ in range(num_workers)]
 
+    # Gather experiences in parallel.
+    trajectories = ray.get([worker.sample.remote() for worker in workers])
 
-RLlib `Algorithm classes <rllib-concepts.html#algorithms>`__ coordinate the distributed workflow of running rollouts and optimizing policies.
-Algorithm classes leverage parallel iterators to implement the desired computation pattern.
-The following figure shows *synchronous sampling*, the simplest of `these patterns <rllib-algorithms.html>`__:
-
-.. figure:: images/a2c-arch.svg
-
-    Synchronous Sampling (e.g., A2C, PG, PPO)
+    # Concatenate the trajectories.
+    batch = concat(trajectories)
 
-RLlib uses `Ray actors <actors.html>`__ to scale training from a single core to many thousands of cores in a cluster.
-You can `configure the parallelism <rllib-training.html#specifying-resources>`__ used for training by changing the ``num_env_runners`` parameter.
-See this `scaling guide <rllib-training.html#scaling-guide>`__ for more details here.
-
-
-RL Modules
-----------
-
-`RLModules <rllib-rlmodule.html>`__ are framework-specific neural network containers.
-In a nutshell, they carry the neural networks and define how to use them during three phases that occur in
-reinforcement learning: Exploration, inference and training.
-A minimal RL Module can contain a single neural network and define its exploration-, inference- and
-training logic to only map observations to actions. Since RL Modules can map observations to actions, they naturally
-implement reinforcement learning policies in RLlib and can therefore be found in the :py:class:`~ray.rllib.evaluation.rollout_worker.RolloutWorker`,
-where their exploration and inference logic is used to sample from an environment.
-The second place in RLlib where RL Modules commonly occur is the :py:class:`~ray.rllib.core.learner.learner.Learner`,
-where their training logic is used in training the neural network.
-RL Modules extend to the multi-agent case, where a single :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule`
-contains multiple RL Modules. The following figure is a rough sketch of how the above can look in practice:
-
-.. image:: images/rllib-concepts-rlmodules-sketch.png
-
-
-.. note::
-
-    RL Modules are currently in alpha stage. They are wrapped in legacy :py:class:`~ray.rllib.policy.Policy` objects
-    to be used in :py:class:`~ray.rllib.evaluation.rollout_worker.RolloutWorker` for sampling.
-    This should be transparent to the user, but the following
-    `Policy Evaluation <key-concepts.html#policy-evaluation>`__ section still refers to these legacy Policy objects.
-
-.. policy-evaluation:
-
-Policy Evaluation
------------------
-
-Given an environment and policy, policy evaluation produces `batches <https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py>`__ of experiences. This is your classic "environment interaction loop". Efficient policy evaluation can be burdensome to get right, especially when leveraging vectorization, RNNs, or when operating in a multi-agent environment. RLlib provides a `RolloutWorker <https://github.com/ray-project/ray/blob/master/rllib/evaluation/rollout_worker.py>`__ class that manages all of this, and this class is used in most RLlib algorithms.
+    # Learn on the trajectory batch.
+    policy.learn_on_batch(batch)
 
+    # Broadcast the updated policy weights to the workers.
+    weights = policy.get_weights()
+    ray.get([worker.set_weights.remote(weights) for worker in workers])
+```
 You can use rollout workers standalone to produce batches of experiences. This can be done by calling ``worker.sample()`` on a worker instance, or ``worker.sample.remote()`` in parallel on worker instances created as Ray actors (see `EnvRunnerGroup <https://github.com/ray-project/ray/blob/master/rllib/env/env_runner_group.py>`__).
 
 Here is an example of creating a set of rollout workers and using them gather experiences in parallel. The trajectories are concatenated, the policy learns on the trajectory batch, and then we broadcast the policy weights to the workers for the next round of rollouts:
@@ -203,6 +171,21 @@ serving as a container for the individual agents' sample batches.
 Training Step Method (``Algorithm.training_step()``)
 ----------------------------------------------------
 
+.. TODO all training_step snippets below must be tested
+.. note::
+    It's important to have a good understanding of the basic :ref:`ray core methods <core-walkthrough>` before reading this section.
+    Furthermore, we utilize concepts such as the ``SampleBatch`` (and its more advanced sibling: the ``MultiAgentBatch``),
+    ``RolloutWorker``, and ``Algorithm``, which can be read about on this page
+    and the :ref:`rollout worker reference docs <rolloutworker-reference-docs>`.
+
+    Finally, developers who are looking to implement custom algorithms should familiarize themselves with the :ref:`Policy <rllib-policy-walkthrough>` and
+    :ref:`Model <rllib-models-walkthrough>` classes.
+
+What is it?
+~~~~~~~~~~~
+Training Step Method (``Algorithm.training_step()``)
+----------------------------------------------------
+
 .. TODO all training_step snippets below must be tested
 .. note::
     It's important to have a good understanding of the basic :ref:`ray core methods <core-walkthrough>` before reading this section.
@@ -280,6 +263,15 @@ An example implementation of VPG could look like the following:
 Let's further break down our above ``training_step()`` code.
 In the first step, we collect trajectory data from the environment(s):
 
+.. code-block:: python
+
+    train_batch = synchronous_parallel_sample(
+                        worker_set=self.env_runner_group,
+                        max_env_steps=self.config["train_batch_size"]
+                    )
+Let's further break down our above ``training_step()`` code.
+In the first step, we collect trajectory data from the environment(s):
+
 .. code-block:: python
 
     train_batch = synchronous_parallel_sample(
@@ -363,8 +355,6 @@ By default, in RLlib, we create a set of workers that can be used for sampling a
 We create a :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` object inside of ``setup`` which is called when an RLlib algorithm is created. The :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` has a ``local_worker``
 and ``remote_workers`` if ``num_env_runners > 0`` in the experiment config. In RLlib we typically use ``local_worker``
 for training and ``remote_workers`` for sampling.
-
-
 :ref:`Train Ops <train-ops-docs>`:
 These are methods that improve the policy and update workers. The most basic operator, ``train_one_step``, takes in as
 input a batch of experiences and emits a ``ResultDict`` with metrics as output. For training with GPUs, use
@@ -373,4 +363,4 @@ training update.
 
 :ref:`Replay Buffers <replay-buffer-reference-docs>`:
 RLlib provides `a collection <https://github.com/ray-project/ray/tree/master/rllib/utils/replay_buffers>`__ of replay
-buffers that can be used for storing and sampling experiences.
+buffers that can be used for storing and sampling experiences.
\ No newline at end of file
diff --git a/doc/source/rllib/rllib-advanced-api.rst b/doc/source/rllib/rllib-advanced-api.rst
index 4c26dde5030b..96daf9d06d20 100644
--- a/doc/source/rllib/rllib-advanced-api.rst
+++ b/doc/source/rllib/rllib-advanced-api.rst
@@ -1,3 +1,4 @@
+
 .. include:: /_includes/rllib/new_api_stack.rst
 
 .. _rllib-advanced-api-doc:
@@ -96,24 +97,6 @@ used implement communication patterns such as parameter servers and all-reduce.
 Callbacks and Custom Metrics
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You can provide callbacks to be called at points during policy evaluation.
-These callbacks have access to state for the current
-`episode <https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py>`__.
-Certain callbacks such as ``on_postprocess_trajectory``, ``on_sample_end``,
-and ``on_train_result`` are also places where custom postprocessing can be applied to
-intermediate data or results.
-```
-.. literalinclude:: ./doc_code/advanced_api.py
-   :language: python
-   :start-after: __rllib-adv_api_counter_begin__
-   :end-before: __rllib-adv_api_counter_end__
-
-Ray actors provide high levels of performance, so in more complex cases they can be
-used implement communication patterns such as parameter servers and all-reduce.
-
-Callbacks and Custom Metrics
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
 You can provide callbacks to be called at points during policy evaluation.
 These callbacks have access to state for the current
 `episode <https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py>`__.
@@ -193,18 +176,6 @@ that currently use these by default:
 .. View table below at: https://docs.google.com/drawings/d/1dEMhosbu7HVgHEwGBuMlEDyPiwjqp_g6bZ0DzCMaoUM/edit?usp=sharing
 .. image:: images/rllib-exploration-api-table.svg
 
-An Exploration class implements the ``get_exploration_action`` method,
-in which the exact exploratory behavior is defined.
-It takes the model’s output, the action distribution class, the model itself,
-a timestep (the global env-sampling steps already taken),
-and an ``explore`` switch and outputs a tuple of a) action and
-b) log-likelihood:
-The following table lists all built-in Exploration sub-classes and the agents
-that currently use these by default:
-
-.. View table below at: https://docs.google.com/drawings/d/1dEMhosbu7HVgHEwGBuMlEDyPiwjqp_g6bZ0DzCMaoUM/edit?usp=sharing
-.. image:: images/rllib-exploration-api-table.svg
-
 An Exploration class implements the ``get_exploration_action`` method,
 in which the exact exploratory behavior is defined.
 It takes the model’s output, the action distribution class, the model itself,
@@ -288,33 +259,61 @@ rewards with different settings (for example, with exploration turned off, or on
 of environment configurations). You can activate evaluating policies during training
 (``Algorithm.train()``) by setting the ``evaluation_interval`` to an int value (> 0)
 indicating every how many ``Algorithm.train()`` calls an "evaluation step" should be run:
-    .. literalinclude:: ./doc_code/advanced_api.py
-   :language: python
-   :start-after: __rllib-adv_api_evaluation_4_begin__
-   :end-before: __rllib-adv_api_evaluation_4_end__
 
-Note: The evaluation step will not be exactly the same length as the training step, but
-it will be close. This is useful for ensuring that evaluation does not become a bottleneck
-when running in parallel.
+.. literalinclude:: ./doc_code/advanced_api.py
+   :language: python
+   :start-after: __rllib-adv_api_evaluation_1_begin__
+   :end-before: __rllib-adv_api_evaluation_1_end__
 
-You can also customize the evaluation process by providing a custom evaluation function
-via the ``custom_eval_function`` config key. This function takes the current Algorithm
-instance and the evaluation workers as arguments and can return a custom evaluation result
-dict. This allows for more complex evaluation logic, such as evaluating on a different
-set of environments or using a different set of metrics.
+An evaluation step runs - using its own ``EnvRunner`` instances - for ``evaluation_duration``
+episodes or time-steps, depending on the ``evaluation_duration_unit`` setting, which can take values
+of either ``"episodes"`` (default) or ``"timesteps"``.
 
 .. literalinclude:: ./doc_code/advanced_api.py
    :language: python
-   :start-after: __rllib-adv_api_evaluation_5_begin__
-   :end-before: __rllib-adv_api_evaluation_5_end__
+   :start-after: __rllib-adv_api_evaluation_2_begin__
+   :end-before: __rllib-adv_api_evaluation_2_end__
 
-In this example, the custom evaluation function sets different corridor lengths for
-different evaluation workers, allowing for evaluation on different environment configurations.
-The evaluation results are then logged alongside the training results.
+Note: When using ``evaluation_duration_unit=timesteps`` and your ``evaluation_duration``
+setting isn't divisible by the number of evaluation workers (configurable with
+``evaluation_num_env_runners``), RLlib rounds up the number of time-steps specified to
+the nearest whole number of time-steps that is divisible by the number of evaluation
+workers.
+Also, when using ``evaluation_duration_unit=episodes`` and your
+``evaluation_duration`` setting isn't divisible by the number of evaluation workers
+(configurable with ``evaluation_num_env_runners``), RLlib runs the remainder of episodes
+on the first n evaluation EnvRunners and leave the remaining workers idle for that time.
+
+For example:
+
+.. literalinclude:: ./doc_code/advanced_api.py
+   :language: python
+   :start-after: __rllib-adv_api_evaluation_3_begin__
+   :end-before: __rllib-adv_api_evaluation_3_end__
+
+Before each evaluation step, weights from the main model are synchronized
+to all evaluation workers.
+
+By default, the evaluation step (if there is one in the current iteration) is run
+right **after** the respective training step.
+For example, for ``evaluation_interval=1``, the sequence of events is:
+``train(0->1), eval(1), train(1->2), eval(2), train(2->3), ...``.
+Here, the indices show the version of neural network weights used.
+``train(0->1)`` is an update step that changes the weights from version 0 to
+version 1 and ``eval(1)`` then uses weights version 1.
+Weights index 0 represents the randomly initialized weights of the neural network.
+
+Another example: For ``evaluation_interval=2``, the sequence is:
+``train(0->1), train(1->2), eval(2), train(2->3), train(3->4), eval(4), ...``.
+
+Instead of running ``train``- and ``eval``-steps in sequence, it is also possible to
+run them in parallel with the ``evaluation_parallel_to_training=True`` config setting.
+In this case, both training- and evaluation steps are run at the same time using multi-threading.
+This can speed up the evaluation process significantly, but leads to a 1-iteration
+delay between reported training- and evaluation results.
+The evaluation results are behind in this case b/c they use slightly outdated
+model weights (synchronized after the previous training step).
 
-For more information on customizing the evaluation process, see the documentation for
-the ``custom_eval_function`` config key in the RLlib API reference.
-```
 For example, for ``evaluation_parallel_to_training=True`` and ``evaluation_interval=1``,
 the sequence is now:
 ``train(0->1) + eval(0), train(1->2) + eval(1), train(2->3) + eval(2)``,
@@ -393,21 +392,6 @@ of training iterations before stopping.
 Below are some examples of how the custom evaluation metrics are reported nested under
 the ``evaluation`` key of normal training results:
 
-.. TODO make sure these outputs are still valid.
-.. code-block:: bash
-
-    ------------------------------------------------------------------------
-    Sample output for `python custom_evaluation.py --no-custom-eval`
-    ------------------------------------------------------------------------
-There is also an end-to-end example of how to set up a custom online evaluation in
-`custom_evaluation.py <https://github.com/ray-project/ray/blob/master/rllib/examples/evaluation/custom_evaluation.py>`__.
-Note that if you only want to evaluate your policy at the end of training,
-you can set ``evaluation_interval: [int]``, where ``[int]`` should be the number
-of training iterations before stopping.
-
-Below are some examples of how the custom evaluation metrics are reported nested under
-the ``evaluation`` key of normal training results:
-
 .. TODO make sure these outputs are still valid.
 .. code-block:: bash
 
@@ -487,4 +471,4 @@ using the following call:
     rollout_worker = get_global_worker()
 
 Policy losses are defined over the ``post_batch`` data, so you can mutate that in
-the callbacks to change what data the policy loss function sees.
\ No newline at end of file
+the callbacks to change what data the policy loss function sees.
diff --git a/doc/source/rllib/rllib-env.rst b/doc/source/rllib/rllib-env.rst
index d0e6b07c55a9..069a21c4cd42 100644
--- a/doc/source/rllib/rllib-env.rst
+++ b/doc/source/rllib/rllib-env.rst
@@ -20,7 +20,7 @@ RLlib works with several different types of environments, including `Farama-Foun
 Configuring Environments
 ------------------------
 
-You can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gymnasium `environment name <https://gymnasium.farama.org/>`__.
+You can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gym `environment name <https://www.gymlibrary.dev/>`__.
 Custom env classes passed directly to the algorithm must take a single ``env_config`` parameter in their constructor:
 
 .. code-block:: python
@@ -101,56 +101,13 @@ Performance
 
     Also check out the `scaling guide <rllib-training.html#scaling-guide>`__ for RLlib training.
 
-There are two ways to scale experience collection with Gymnasium environments:
-.. tip::
-
-   When using logging in an environment, the logging configuration needs to be done inside the environment, which runs inside Ray workers. Any configurations outside the environment, e.g., before starting Ray will be ignored.
-
+There are two ways to scale experience collection with Gym environments:
 Gymnasium
 ----------
 
 RLlib uses Gymnasium as its environment interface for single-agent training. For more information on how to implement a custom Gymnasium environment, see the `gymnasium.Env class definition <https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py>`__. You may find the `SimpleCorridor <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__ example useful as a reference.
 
-Performance
-~~~~~~~~~~~
-
-.. tip::
-
-    Also check out the `scaling guide <rllib-training.html#scaling-guide>`__ for RLlib training.
-
-There are two ways to scale experience collection with Gym environments:
-
-    1. **Vectorization within a single process:** Though many envs can achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs.
-
-      You can configure ``{"num_envs_per_env_runner": M}`` to have RLlib create ``M`` concurrent environments per worker. RLlib auto-vectorizes Gym environments via `VectorEnv.wrap() <https://github.com/ray-project/ray/blob/master/rllib/env/vector_env.py>`__.
-
-    2. **Distribute across multiple processes:** You can also have RLlib create multiple processes (Ray actors) for experience collection. In most algorithms this can be controlled by setting the ``{"num_env_runners": N}`` config.
-
-.. image:: images/throughput.png
-
-You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to ∼200k actions/s, and Pendulum-v1 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs_per_env_runner=64``.
-
-Expensive Environments
-~~~~~~~~~~~~~~~~~~~~~~
-
-Some environments may be very resource-intensive to create. RLlib will create ``num_env_runners + 1`` copies of the environment since one copy is needed for the driver process. To avoid paying the extra overhead of the driver copy, which is needed to access the env's action and observation spaces, you can defer environment initialization until ``reset()`` is called.
-
-Vectorized
-----------
-
-RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_env_runner`` config is set, or you can define a custom environment class that subclasses `VectorEnv <https://github.com/ray-project/ray/blob/master/rllib/env/vector_env.py>`__ to implement ``vector_step()`` and ``vector_reset()``.
-
-Note that auto-vectorization only applies to policy inference by default. This means that policy inference will be batched, but your envs will still be stepped one at a time. If you would like your envs to be stepped in parallel, you can set ``"remote_worker_envs": True``. This will create env instances in Ray actors and step them in parallel. These remote processes introduce communication overheads, so this only helps if your env is very expensive to step / reset.
-
-When using remote envs, you can control the batching level for inference with ``remote_env_batch_wait_ms``. The default value of 0ms means envs execute asynchronously and inference is only batched opportunistically. Setting the timeout to a large value will result in fully batched inference and effectively synchronous environment stepping. The optimal value depends on your environment step / reset time, and model inference speed.
-
-Multi-Agent and Hierarchical
-----------------------------
-
-In a multi-agent environment, there are more than one "agent" acting simultaneously, in a turn-based fashion, or in a combination of these two.
-
-For example, in a traffic simulation, there may be multiple "car" and "traffic light" agents in the environment,
-acting simultaneously. Whereas in a board game, you may have two or more agents acting in a turn-base fashion.
+Note: With the recent update, ensure that your environment specifications and imports are compatible with Gymnasium version 1.0.0. For example, if you are using Atari environments, you should now use the `ale_py` prefix, such as `ale_py:ALE/Pong-v5`, instead of the previous `ALE/Pong-v5`.
 When using remote envs, you can control the batching level for inference with ``remote_env_batch_wait_ms``. The default value of 0ms means envs execute asynchronously and inference is only batched opportunistically. Setting the timeout to a large value will result in fully batched inference and effectively synchronous environment stepping. The optimal value depends on your environment step / reset time, and model inference speed.
 
 Multi-Agent and Hierarchical
@@ -229,6 +186,53 @@ And another example, where agents step one after the other (turn-based game):
     # ... {
     # ...   "player1": [[...]],
     # ... }
+And another example, where agents step one after the other (turn-based game):
+
+.. code-block:: python
+
+    # Env, in which two agents step in sequence (tuen-based game).
+    # The env is in charge of the produced agent ID. Our env here produces
+    # agent IDs: "player1" and "player2".
+    env = TicTacToe()
+
+    # Observations are a dict mapping agent names to their obs. Only those
+    # agents' names that require actions in the next call to `step()` should
+    # be present in the returned observation dict (here: one agent at a time).
+    print(env.reset())
+    # ... {
+    # ...   "player1": [[...]],
+    # ... }
+
+    # In the following call to `step`, only those agents' actions should be
+    # provided that were present in the returned obs dict:
+    new_obs, rewards, dones, infos = env.step(actions={"player1": ...})
+
+    # Similarly, new_obs, rewards, dones, etc. also become dicts.
+    # Note that only in the `rewards` dict, any agent may be listed (even those that have
+    # not(!) acted in the `step()` call). Rewards for individual agents will be added
+    # up to the point where a new action for that agent is needed. This way, you may
+    # implement a turn-based 2-player game, in which player-2's reward is published
+    # in the `rewards` dict immediately after player-1 has acted.
+    print(rewards)
+    # ... {"player1": 0, "player2": 0}
+
+    # Individual agents can early exit; The entire episode is done when
+    # dones["__all__"] = True.
+    print(dones)
+    # ... {"player1": False, "__all__": False}
+
+    # In the next step, it's player2's turn. Therefore, `new_obs` only container
+    # this agent's ID:
+    print(new_obs)
+    # ... {
+    # ...   "player2": [[...]]
+    # ... }
+
+
+If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows:
+
+.. code-block:: python
+
     algo = pg.PGAgent(env="my_multiagent_env", config={
         "multiagent": {
             "policies": {
diff --git a/doc/source/rllib/rllib-examples.rst b/doc/source/rllib/rllib-examples.rst
index 77bf574c8232..616290b6bdd8 100644
--- a/doc/source/rllib/rllib-examples.rst
+++ b/doc/source/rllib/rllib-examples.rst
@@ -82,17 +82,6 @@ Connectors
     (as opposed to ``Connector``, which only continue to work on the old API stack |old_stack|).
 
 
-- |new_stack| `How to frame-stack Atari image observations <https://github.com/ray-project/ray/blob/master/rllib/examples/connectors/frame_stacking.py>`__:
-   An example using Atari framestacking in a very efficient manner, not in the environment itself (as a `gym.Wrapper`),
-   but by stacking the observations on-the-fly using `EnvToModule` and `LearnerConnector` pipelines.
-   This method of framestacking is more efficient as it avoids having to send large observation
-   tensors through the network (ray).
-.. note::
-    RLlib's Connector API has been re-written from scratch for the new API stack (|new_stack|).
-    Connector-pieces and -pipelines are now referred to as :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2`
-    (as opposed to ``Connector``, which only continue to work on the old API stack |old_stack|).
-
-
 - |new_stack| `How to frame-stack Atari image observations <https://github.com/ray-project/ray/blob/master/rllib/examples/connectors/frame_stacking.py>`__:
    An example using Atari framestacking in a very efficient manner, not in the environment itself (as a `gym.Wrapper`),
    but by stacking the observations on-the-fly using `EnvToModule` and `LearnerConnector` pipelines.
@@ -141,13 +130,6 @@ Environments
 
 - |new_stack| `How to set up rendering (and recording) of the environment trajectories during training with WandB <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/env_rendering_and_recording.py>`__:
    Example showing how you can render and record episode trajectories of your gymnasium envs and log the videos to WandB.
-Environments
-------------
-- |new_stack| `How to register a custom gymnasium environment <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/custom_gym_env.py>`__:
-   Example showing how to write your own RL environment using ``gymnasium`` and register it to run train your algorithm against this env with RLlib.
-
-- |new_stack| `How to set up rendering (and recording) of the environment trajectories during training with WandB <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/env_rendering_and_recording.py>`__:
-   Example showing how you can render and record episode trajectories of your gymnasium envs and log the videos to WandB. Note: The environment names have been updated to use the `ale_py` prefix for Atari environments.
 
 - |old_stack| `How to run a Unity3D multi-agent environment locally <https://github.com/ray-project/ray/tree/master/rllib/examples/envs/external_envs/unity3d_env_local.py>`__:
    Example of how to setup an RLlib Algorithm against a locally running Unity3D editor instance to
@@ -188,23 +170,6 @@ GPU (for Training and Sampling)
 Hierarchical Training
 ---------------------
 
-- |old_stack| `How to setup hierarchical training <https://github.com/ray-project/ray/blob/master/rllib/examples/hierarchical/hierarchical_training.py>`__:
-   Example of hierarchical training using the multi-agent API.
-
-Inference (of Models/Policies)
-------------------------------
-GPU (for Training and Sampling)
--------------------------------
-
-- |new_stack| `How to use fractional GPUs for training an RLModule <https://github.com/ray-project/ray/blob/master/rllib/examples/gpus/fractional_gpus_per_learner.py>`__:
-   If your model is small and easily fits on a single GPU and you want to therefore train
-   other models alongside it to save time and cost, this script shows you how to set up
-   your RLlib config with a fractional number of GPUs on the learner (model training)
-   side.
-
-Hierarchical Training
----------------------
-
 - |old_stack| `How to setup hierarchical training <https://github.com/ray-project/ray/blob/master/rllib/examples/hierarchical/hierarchical_training.py>`__:
    Example of hierarchical training using the multi-agent API.
 
@@ -242,6 +207,7 @@ Multi-Agent RL
    Example of running a custom hand-coded policy alongside trainable policies.
 - |new_stack| `How to train a single policy (weight sharing) controlling more than one agents <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/multi_agent_cartpole.py>`__:
    Example of how to define weight-sharing layers between two different policies.
+
 - |old_stack| `Hwo to write and set up a model with centralized critic <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic.py>`__:
    Example of customizing PPO to leverage a centralized value function.
 - |old_stack| `How to write and set up a model with centralized critic in the env <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__:
@@ -280,14 +246,6 @@ Ray Tune and RLlib
 RLModules
 ---------
 
-- |new_stack| `How to configure an autoregressive action distribution <https://github.com/ray-project/ray/blob/master/rllib/examples/rl_modules/autoregressive_actions.py>`__:
-   Learning with an auto-regressive action distribution (for example, two action components, where distribution of the second component depends on the first's actually sampled value).
-- |new_stack| `How to Custom tune experiment <https://github.com/ray-project/ray/blob/master/rllib/examples/ray_tune/custom_experiment.py>`__:
-   How to run a custom Ray Tune experiment with RLlib with custom training- and evaluation phases.
-
-RLModules
----------
-
 - |new_stack| `How to configure an autoregressive action distribution <https://github.com/ray-project/ray/blob/master/rllib/examples/rl_modules/autoregressive_actions.py>`__:
    Learning with an auto-regressive action distribution (for example, two action components, where distribution of the second component depends on the first's actually sampled value).
 
@@ -319,17 +277,6 @@ For example, see this tuned Atari example for PPO, which learns to solve the Pon
 in roughly 5min. It can be run like this on a single g5.24xlarge (or g6.24xlarge) machine with
 4 GPUs and 96 CPUs:
 
-.. code-block:: bash
-
-    $ cd ray/rllib/tuned_examples/ppo
-    $ python atari_ppo.py --env=ale_py:ALE/Pong-v5 --num-gpus=4 --num-env-runners=95
-
-Note that some of the files in this folder are used for RLlib's daily or weekly
-release tests as well.
-For example, see this tuned Atari example for PPO, which learns to solve the Pong environment
-in roughly 5min. It can be run like this on a single g5.24xlarge (or g6.24xlarge) machine with
-4 GPUs and 96 CPUs:
-
 .. code-block:: bash
 
     $ cd ray/rllib/tuned_examples/ppo
@@ -376,8 +323,6 @@ Community Examples
    Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent.
 
 
-Blog Posts
-++++++++++
 Blog Posts
 ++++++++++
 
@@ -390,4 +335,4 @@ Blog Posts
 - |old_stack| `Scaling Multi-Agent Reinforcement Learning <http://bair.berkeley.edu/blog/2018/12/12/rllib>`__:
    Blog post of a brief tutorial on multi-agent RL and its design in RLlib.
 - |old_stack| `Functional RL with Keras and TensorFlow Eager <https://medium.com/riselab/functional-rl-with-keras-and-tensorflow-eager-7973f81d6345>`__:
-   Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms.
\ No newline at end of file
+   Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms.
diff --git a/doc/source/rllib/rllib-models.rst b/doc/source/rllib/rllib-models.rst
index 717c6bb196c6..0363fb459131 100644
--- a/doc/source/rllib/rllib-models.rst
+++ b/doc/source/rllib/rllib-models.rst
@@ -72,6 +72,22 @@ These include options for the ``FullyConnectedNetworks`` (``fcnet_hiddens`` and
 ``VisionNetworks`` (``conv_filters`` and ``conv_activation``), auto-RNN wrapping, auto-Attention (`GTrXL <https://arxiv.org/abs/1910.06764>`__) wrapping,
 and some special options for Atari environments:
 
+.. literalinclude:: ../../../rllib/models/catalog.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+Default Model Config Settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In the following paragraphs, we will first describe RLlib's default behavior for automatically constructing
+models (if you don't setup a custom one), then dive into how you can customize your models by changing these
+settings or writing your own model classes.
+
+By default, RLlib will use the following config settings for your models.
+These include options for the ``FullyConnectedNetworks`` (``fcnet_hiddens`` and ``fcnet_activation``),
+``VisionNetworks`` (``conv_filters`` and ``conv_activation``), auto-RNN wrapping, auto-Attention (`GTrXL <https://arxiv.org/abs/1910.06764>`__) wrapping,
+and some special options for Atari environments:
+
 .. literalinclude:: ../../../rllib/models/catalog.py
    :language: python
    :start-after: __sphinx_doc_begin__
@@ -120,6 +136,10 @@ X=last Conv2D layer's number of filters, so that RLlib can flatten it. An inform
 
 .. _auto_lstm_and_attention:
 
+Built-in auto-LSTM, and auto-Attention Wrappers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. _auto_lstm_and_attention:
+
 Built-in auto-LSTM, and auto-Attention Wrappers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -186,6 +206,24 @@ For example, for manipulating your env's observations or rewards, do:
             return np.clip(reward, self.min, self.max)
 
 
+Custom Models: Implementing your own Forward Logic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you would like to provide your own model logic (instead of using RLlib's built-in defaults), you
+can sub-class either ``TFModelV2`` (for TensorFlow) or ``TorchModelV2`` (for PyTorch) and then
+register and specify your sub-class in the config as follows:
+
+.. _tensorflow-models:
+
+Custom TensorFlow Models
+````````````````````````
+# Override `reward` to custom process the original reward coming
+        # from the env.
+        def reward(self, reward):
+            # E.g. simple clipping between min and max.
+            return np.clip(reward, self.min, self.max)
+
+
 Custom Models: Implementing your own Forward Logic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -257,6 +295,7 @@ Usually, the dict contains only the current observation ``obs`` and an ``is_trai
 states (in case of RNNs or attention nets). You can also override extra methods of the model such as ``value_function`` to implement
 a custom value branch.
 
+Additional supervised/self-supervised losses can be added via the ``TorchModelV2.custom_loss`` method:
 Additional supervised/self-supervised losses can be added via the ``TorchModelV2.custom_loss`` method:
 
 See these examples of `fully connected <https://github.com/ray-project/ray/blob/master/rllib/models/torch/fcnet.py>`__, `convolutional <https://github.com/ray-project/ray/blob/master/rllib/models/torch/visionnet.py>`__, and `recurrent <https://github.com/ray-project/ray/blob/master/rllib/models/torch/recurrent_net.py>`__ torch models.
@@ -324,6 +363,17 @@ You can check out the `rnn_model.py <https://github.com/ray-project/ray/blob/mas
 your own (either TF or Torch).
 
 
+.. _attention:
+
+Implementing custom Attention Networks
+``````````````````````````````````````
+
+Similar to the RNN case described above, you could also implement your own attention-based networks, instead of using the
+``use_attention: True`` flag in your model config.
+You can check out the `rnn_model.py <https://github.com/ray-project/ray/blob/master/rllib/examples/_old_api_stack/models/rnn_model.py>`__ models as examples to implement
+your own (either TF or Torch).
+
+
 .. _attention:
 
 Implementing custom Attention Networks
@@ -390,6 +440,20 @@ Take a look at this model example that does exactly that:
    :end-before: __sphinx_doc_end__
 
 
+**Using the Trajectory View API: Passing in the last n actions (or rewards or observations) as inputs to a custom Model**
+However, what if you had a complex observation space with one or more image components in
+it (besides 1D Boxes and discrete spaces). You would probably want to preprocess each of the
+image components using some convolutional network, and then concatenate their outputs
+with the remaining non-image (flat) inputs (the 1D Box and discrete/one-hot components).
+
+Take a look at this model example that does exactly that:
+
+.. literalinclude:: ../../../rllib/models/tf/complex_input_net.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+
 **Using the Trajectory View API: Passing in the last n actions (or rewards or observations) as inputs to a custom Model**
 
 It is sometimes helpful for learning not only to look at the current observation
@@ -464,26 +528,61 @@ You can also use the ``custom_loss()`` API to add in self-supervised losses such
 
 Variable-length / Complex Observation Spaces
 --------------------------------------------
+   class ParametricActionsModel(TFModelV2):
+       def __init__(self, obs_space, action_space, num_outputs, model_config, name):
+           super(ParametricActionsModel, self).__init__(obs_space, action_space, num_outputs, model_config, name)
+           self.action_embed_model = FullyConnectedNetwork(
+               Box(-1, 1, shape=(action_embedding_sz, )),
+               action_space,
+               action_embedding_sz,
+               model_config,
+               name + "_action_embed")
+           self.action_logits_model = FullyConnectedNetwork(
+               obs_space["real_obs"],
+               action_space,
+               num_outputs,
+               model_config,
+               name + "_action_logits")
+
+       def forward(self, input_dict, state, seq_lens):
+           action_mask = input_dict["obs"]["action_mask"]
+           avail_actions = input_dict["obs"]["avail_actions"]
+           real_obs = input_dict["obs"]["real_obs"]
+
+           action_embed = self.action_embed_model({"obs": avail_actions})
+           action_logits = self.action_logits_model({"obs": real_obs})
+
+           # Compute dot product between action embeddings and logits
+           action_logits = tf.reduce_sum(action_logits * action_embed, axis=-1)
+
+           # Mask out invalid actions
+           inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
+           return action_logits + inf_mask, state
+```
+
+3. Finally, the custom model can be registered and used with an RLlib algorithm:
 
-RLlib supports complex and variable-length observation spaces, including ``gym.spaces.Tuple``, ``gym.spaces.Dict``, and ``rllib.utils.spaces.Repeated``. The handling of these spaces is transparent to the user. RLlib internally will insert preprocessors to insert padding for repeated elements, flatten complex observations into a fixed-size vector during transit, and unpack the vector into the structured tensor before sending it to the model. The flattened observation is available to the model as ``input_dict["obs_flat"]``, and the unpacked observation as ``input_dict["obs"]``.
-
-To enable batching of struct observations, RLlib unpacks them in a `StructTensor-like format <https://github.com/tensorflow/community/blob/master/rfcs/20190910-struct-tensor.md>`__. In summary, repeated fields are "pushed down" and become the outer dimensions of tensor batches, as illustrated in this figure from the StructTensor RFC.
-
-.. image:: images/struct-tensor.png
+.. code-block:: python
 
-For further information about complex observation spaces, see:
-  * A custom environment and model that uses `repeated struct fields <https://github.com/ray-project/ray/blob/master/rllib/examples/complex_struct_space.py>`__.
-  * The pydoc of the `Repeated space <https://github.com/ray-project/ray/blob/master/rllib/utils/spaces/repeated.py>`__.
-  * The pydoc of the batched `repeated values tensor <https://github.com/ray-project/ray/blob/master/rllib/models/repeated_values.py>`__.
-  * The `unit tests <https://github.com/ray-project/ray/blob/master/rllib/tests/test_nested_observation_spaces.py>`__ for Tuple and Dict spaces.
+   ModelCatalog.register_custom_model("pa_model", ParametricActionsModel)
+   config = {
+       "env": MyParamActionEnv,
+       "model": {
+           "custom_model": "pa_model",
+       },
+       ...
+   }
+   algo = PPO(config=config)
+   algo.train()
 
-Variable-length / Parametric Action Spaces
-------------------------------------------
+Note: The above example uses TensorFlow, but similar logic can be applied for PyTorch models.
 
-Custom models can be used to work with environments where (1) the set of valid actions `varies per step <https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five>`__, and/or (2) the number of valid actions is `very large <https://arxiv.org/abs/1811.00260>`__. The general idea is that the meaning of actions can be completely conditioned on the observation, i.e., the ``a`` in ``Q(s, a)`` becomes just a token in ``[0, MAX_AVAIL_ACTIONS)`` that only has meaning in the context of ``s``. This works with algorithms in the `DQN and policy-gradient families <rllib-env.html>`__ and can be implemented as follows:
+Gymnasium Compatibility
+-----------------------
 
-1. The environment should return a mask and/or list of valid action embeddings as part of the observation for each step. To enable batching, the number of actions can be allowed to vary from 1 to some max number:
+RLlib now supports environments and models using the `gymnasium` library, which is a fork of the original `gym` library. This includes support for the latest `gymnasium` version 1.0.0. Ensure that your custom environments and models are compatible with `gymnasium` by updating import statements and using the new API where necessary. For example, replace `import gym` with `import gymnasium as gym` and update any environment registration or creation logic accordingly.
 
+For more information on transitioning to `gymnasium`, refer to the `gymnasium` documentation and migration guides. This update ensures that RLlib remains compatible with the latest advancements in the reinforcement learning ecosystem.
 .. code-block:: python
 
    class MyParamActionEnv(gym.Env):
@@ -550,6 +649,10 @@ Note that since masking introduces ``tf.float32.min`` values into the model outp
 Autoregressive Action Distributions
 -----------------------------------
 
+In an action space with multiple components (e.g., ``Tuple(a1, a2)``), you might want ``a2`` to be conditioned on the sampled value of ``a1``, i.e., ``a2_sampled ~ P(a2 | a1_sampled, obs)``. Normally, ``a1`` and ``a2`` would be sampled independently, reducing the expressivity of the policy.
+Autoregressive Action Distributions
+-----------------------------------
+
 In an action space with multiple components (e.g., ``Tuple(a1, a2)``), you might want ``a2`` to be conditioned on the sampled value of ``a1``, i.e., ``a2_sampled ~ P(a2 | a1_sampled, obs)``. Normally, ``a1`` and ``a2`` would be sampled independently, reducing the expressivity of the policy.
 
 To do this, you need both a custom model that implements the autoregressive pattern, and a custom action distribution class that leverages that model. The `autoregressive_action_dist.py <https://github.com/ray-project/ray/blob/master/rllib/examples/autoregressive_action_dist.py>`__ example shows how this can be implemented for a simple binary action space. For a more complex space, a more efficient architecture such as a `MADE <https://arxiv.org/abs/1502.03509>`__ is recommended. Note that sampling a `N-part` action requires `N` forward passes through the model, however computing the log probability of an action can be done in one pass:
@@ -627,6 +730,27 @@ To do this, you need both a custom model that implements the autoregressive patt
                 name="a1_logits",
                 activation=None,
                 kernel_initializer=normc_initializer(0.01))(ctx_input)
+# Inputs
+            obs_input = tf.keras.layers.Input(
+                shape=obs_space.shape, name="obs_input")
+            a1_input = tf.keras.layers.Input(shape=(1, ), name="a1_input")
+            ctx_input = tf.keras.layers.Input(
+                shape=(num_outputs, ), name="ctx_input")
+
+            # Output of the model (normally 'logits', but for an autoregressive
+            # dist this is more like a context/feature layer encoding the obs)
+            context = tf.keras.layers.Dense(
+                num_outputs,
+                name="hidden",
+                activation=tf.nn.tanh,
+                kernel_initializer=normc_initializer(1.0))(obs_input)
+
+            # P(a1 | obs)
+            a1_logits = tf.keras.layers.Dense(
+                2,
+                name="a1_logits",
+                activation=None,
+                kernel_initializer=normc_initializer(0.01))(ctx_input)
 
             # P(a2 | a1)
             # --note: typically you'd want to implement P(a2 | a1, obs) as follows:
@@ -659,4 +783,4 @@ To do this, you need both a custom model that implements the autoregressive patt
 
 .. note::
 
-   Not all algorithms support autoregressive action distributions; see the `algorithm overview table <rllib-algorithms.html#available-algorithms-overview>`__ for more information.
+   Not all algorithms support autoregressive action distributions; see the `algorithm overview table <rllib-algorithms.html#available-algorithms-overview>`__ for more information.
\ No newline at end of file
diff --git a/doc/source/rllib/single-agent-episode.rst b/doc/source/rllib/single-agent-episode.rst
index f1f52ad92899..dda8fbaf899c 100644
--- a/doc/source/rllib/single-agent-episode.rst
+++ b/doc/source/rllib/single-agent-episode.rst
@@ -82,15 +82,17 @@ Using the getter APIs of SingleAgentEpisode
 
 Now that there is a :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` to work with, one can explore
 and extract information from this episode using its different "getter" methods:
-```
+
+.. note::
+
+    The `SingleAgentEpisode` class now includes an `is_reset` property, which returns `True` if the `add_env_reset()` method has been called. This can be useful for checking whether an episode has been initialized with a reset observation.
 **(Single-agent) Episode**: The episode starts with a single observation (the "reset observation"), then
 continues on each timestep with a 3-tuple of `(observation, action, reward)`. Note that because of the reset observation,
 every episode - at each timestep - always contains one more observation than it contains actions or rewards.
-Important additional properties of an Episode are its `id_` (str) and `terminated/truncated` (bool) flags.
+Important additional properties of an Episode are its `id_` (str), `terminated/truncated` (bool) flags, and `is_reset` (bool) flag indicating whether the episode has been reset.
 See further below for a detailed description of the :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode`
 APIs exposed to the user.
 
-A new property `is_reset` has been added to the `SingleAgentEpisode` class, which returns `True` if the `add_env_reset()` method has already been called. This property helps in determining whether the episode has been reset.
 
 Using the getter APIs of SingleAgentEpisode
 -------------------------------------------