Promptless · promptless · Nov 5, 2024 · Nov 5, 2024
diff --git a/doc/source/rllib/key-concepts.rst b/doc/source/rllib/key-concepts.rst
@@ -1,4 +1,3 @@
-
 .. include:: /_includes/rllib/we_are_hiring.rst
 
 .. include:: /_includes/rllib/new_api_stack.rst
@@ -95,53 +94,22 @@ which implements the proximal policy optimization algorithm in RLlib.
 
             # Train via Ray Tune.
             tune.run("PPO", config=config)
+    # Create rollout workers as Ray actors.
+    workers = [RolloutWorker.remote() for _ in range(num_workers)]
 
+    # Gather experiences in parallel.
+    trajectories = ray.get([worker.sample.remote() for worker in workers])
 
-RLlib `Algorithm classes <rllib-concepts.html#algorithms>`__ coordinate the distributed workflow of running rollouts and optimizing policies.
-Algorithm classes leverage parallel iterators to implement the desired computation pattern.
-The following figure shows *synchronous sampling*, the simplest of `these patterns <rllib-algorithms.html>`__:
-
-.. figure:: images/a2c-arch.svg
-
-    Synchronous Sampling (e.g., A2C, PG, PPO)
+    # Concatenate the trajectories.
+    batch = concat(trajectories)
 
-RLlib uses `Ray actors <actors.html>`__ to scale training from a single core to many thousands of cores in a cluster.
-You can `configure the parallelism <rllib-training.html#specifying-resources>`__ used for training by changing the ``num_env_runners`` parameter.
-See this `scaling guide <rllib-training.html#scaling-guide>`__ for more details here.
-
-
-RL Modules
-----------
-
-`RLModules <rllib-rlmodule.html>`__ are framework-specific neural network containers.
-In a nutshell, they carry the neural networks and define how to use them during three phases that occur in
-reinforcement learning: Exploration, inference and training.
-A minimal RL Module can contain a single neural network and define its exploration-, inference- and
-training logic to only map observations to actions. Since RL Modules can map observations to actions, they naturally
-implement reinforcement learning policies in RLlib and can therefore be found in the :py:class:`~ray.rllib.evaluation.rollout_worker.RolloutWorker`,
-where their exploration and inference logic is used to sample from an environment.
-The second place in RLlib where RL Modules commonly occur is the :py:class:`~ray.rllib.core.learner.learner.Learner`,
-where their training logic is used in training the neural network.
-RL Modules extend to the multi-agent case, where a single :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule`
-contains multiple RL Modules. The following figure is a rough sketch of how the above can look in practice:
-
-.. image:: images/rllib-concepts-rlmodules-sketch.png
-
-
-.. note::
-
-    RL Modules are currently in alpha stage. They are wrapped in legacy :py:class:`~ray.rllib.policy.Policy` objects
-    to be used in :py:class:`~ray.rllib.evaluation.rollout_worker.RolloutWorker` for sampling.
-    This should be transparent to the user, but the following
-    `Policy Evaluation <key-concepts.html#policy-evaluation>`__ section still refers to these legacy Policy objects.
-
-.. policy-evaluation:
-
-Policy Evaluation
------------------
-
-Given an environment and policy, policy evaluation produces `batches <https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py>`__ of experiences. This is your classic "environment interaction loop". Efficient policy evaluation can be burdensome to get right, especially when leveraging vectorization, RNNs, or when operating in a multi-agent environment. RLlib provides a `RolloutWorker <https://github.com/ray-project/ray/blob/master/rllib/evaluation/rollout_worker.py>`__ class that manages all of this, and this class is used in most RLlib algorithms.
+    # Learn on the trajectory batch.
+    policy.learn_on_batch(batch)
 
+    # Broadcast the updated policy weights to the workers.
+    weights = policy.get_weights()
+    ray.get([worker.set_weights.remote(weights) for worker in workers])
+```
 You can use rollout workers standalone to produce batches of experiences. This can be done by calling ``worker.sample()`` on a worker instance, or ``worker.sample.remote()`` in parallel on worker instances created as Ray actors (see `EnvRunnerGroup <https://github.com/ray-project/ray/blob/master/rllib/env/env_runner_group.py>`__).
 
 Here is an example of creating a set of rollout workers and using them gather experiences in parallel. The trajectories are concatenated, the policy learns on the trajectory batch, and then we broadcast the policy weights to the workers for the next round of rollouts:
@@ -203,6 +171,21 @@ serving as a container for the individual agents' sample batches.
 Training Step Method (``Algorithm.training_step()``)
 ----------------------------------------------------
 
+.. TODO all training_step snippets below must be tested
+.. note::
+    It's important to have a good understanding of the basic :ref:`ray core methods <core-walkthrough>` before reading this section.
+    Furthermore, we utilize concepts such as the ``SampleBatch`` (and its more advanced sibling: the ``MultiAgentBatch``),
+    ``RolloutWorker``, and ``Algorithm``, which can be read about on this page
+    and the :ref:`rollout worker reference docs <rolloutworker-reference-docs>`.
+
+    Finally, developers who are looking to implement custom algorithms should familiarize themselves with the :ref:`Policy <rllib-policy-walkthrough>` and
+    :ref:`Model <rllib-models-walkthrough>` classes.
+
+What is it?
+~~~~~~~~~~~
+Training Step Method (``Algorithm.training_step()``)
+----------------------------------------------------
+
 .. TODO all training_step snippets below must be tested
 .. note::
     It's important to have a good understanding of the basic :ref:`ray core methods <core-walkthrough>` before reading this section.
@@ -280,6 +263,15 @@ An example implementation of VPG could look like the following:
 Let's further break down our above ``training_step()`` code.
 In the first step, we collect trajectory data from the environment(s):
 
+.. code-block:: python
+
+    train_batch = synchronous_parallel_sample(
+                        worker_set=self.env_runner_group,
+                        max_env_steps=self.config["train_batch_size"]
+                    )
+Let's further break down our above ``training_step()`` code.
+In the first step, we collect trajectory data from the environment(s):
+
 .. code-block:: python
 
     train_batch = synchronous_parallel_sample(
@@ -363,8 +355,6 @@ By default, in RLlib, we create a set of workers that can be used for sampling a
 We create a :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` object inside of ``setup`` which is called when an RLlib algorithm is created. The :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` has a ``local_worker``
 and ``remote_workers`` if ``num_env_runners > 0`` in the experiment config. In RLlib we typically use ``local_worker``
 for training and ``remote_workers`` for sampling.
-
-
 :ref:`Train Ops <train-ops-docs>`:
 These are methods that improve the policy and update workers. The most basic operator, ``train_one_step``, takes in as
 input a batch of experiences and emits a ``ResultDict`` with metrics as output. For training with GPUs, use
@@ -373,4 +363,4 @@ training update.
 
 :ref:`Replay Buffers <replay-buffer-reference-docs>`:
 RLlib provides `a collection <https://github.com/ray-project/ray/tree/master/rllib/utils/replay_buffers>`__ of replay
-buffers that can be used for storing and sampling experiences.
+buffers that can be used for storing and sampling experiences.
diff --git a/doc/source/rllib/rllib-env.rst b/doc/source/rllib/rllib-env.rst
@@ -102,29 +102,12 @@ Performance
     Also check out the `scaling guide <rllib-training.html#scaling-guide>`__ for RLlib training.
 
 There are two ways to scale experience collection with Gym environments:
-
-    1. **Vectorization within a single process:** Though many envs can achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs.
-
-      You can configure ``{"num_envs_per_env_runner": M}`` to have RLlib create ``M`` concurrent environments per worker. RLlib auto-vectorizes Gym environments via `VectorEnv.wrap() <https://github.com/ray-project/ray/blob/master/rllib/env/vector_env.py>`__.
-
-    2. **Distribute across multiple processes:** You can also have RLlib create multiple processes (Ray actors) for experience collection. In most algorithms this can be controlled by setting the ``{"num_env_runners": N}`` config.
-
-.. image:: images/throughput.png
-
-You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to ∼200k actions/s, and Pendulum-v1 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs_per_env_runner=64``.
-
-Expensive Environments
-~~~~~~~~~~~~~~~~~~~~~~
-
-Some environments may be very resource-intensive to create. RLlib will create ``num_env_runners + 1`` copies of the environment since one copy is needed for the driver process. To avoid paying the extra overhead of the driver copy, which is needed to access the env's action and observation spaces, you can defer environment initialization until ``reset()`` is called.
-
-Vectorized
+Gymnasium
 ----------
 
-RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_env_runner`` config is set, or you can define a custom environment class that subclasses `VectorEnv <https://github.com/ray-project/ray/blob/master/rllib/env/vector_env.py>`__ to implement ``vector_step()`` and ``vector_reset()``.
-
-Note that auto-vectorization only applies to policy inference by default. This means that policy inference will be batched, but your envs will still be stepped one at a time. If you would like your envs to be stepped in parallel, you can set ``"remote_worker_envs": True``. This will create env instances in Ray actors and step them in parallel. These remote processes introduce communication overheads, so this only helps if your env is very expensive to step / reset.
+RLlib uses Gymnasium as its environment interface for single-agent training. For more information on how to implement a custom Gymnasium environment, see the `gymnasium.Env class definition <https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py>`__. You may find the `SimpleCorridor <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__ example useful as a reference.
 
+Note: With the recent update, ensure that your environment specifications and imports are compatible with Gymnasium version 1.0.0. For example, if you are using Atari environments, you should now use the `ale_py` prefix, such as `ale_py:ALE/Pong-v5`, instead of the previous `ALE/Pong-v5`.
 When using remote envs, you can control the batching level for inference with ``remote_env_batch_wait_ms``. The default value of 0ms means envs execute asynchronously and inference is only batched opportunistically. Setting the timeout to a large value will result in fully batched inference and effectively synchronous environment stepping. The optimal value depends on your environment step / reset time, and model inference speed.
 
 Multi-Agent and Hierarchical
@@ -189,6 +172,22 @@ Here is an example of an env, in which all agents always step simultaneously:
 
 And another example, where agents step one after the other (turn-based game):
 
+.. code-block:: python
+
+    # Env, in which two agents step in sequence (tuen-based game).
+    # The env is in charge of the produced agent ID. Our env here produces
+    # agent IDs: "player1" and "player2".
+    env = TicTacToe()
+
+    # Observations are a dict mapping agent names to their obs. Only those
+    # agents' names that require actions in the next call to `step()` should
+    # be present in the returned observation dict (here: one agent at a time).
+    print(env.reset())
+    # ... {
+    # ...   "player1": [[...]],
+    # ... }
+And another example, where agents step one after the other (turn-based game):
+
 .. code-block:: python
 
     # Env, in which two agents step in sequence (tuen-based game).
@@ -271,6 +270,13 @@ If all the agents will be using the same algorithm class to train, then you can
 To exclude some policies in your ``multiagent.policies`` dictionary, you can use the ``multiagent.policies_to_train`` setting.
 For example, you may want to have one or more random (non learning) policies interact with your learning ones:
 
+.. code-block:: python
+while True:
+        print(algo.train())
+
+To exclude some policies in your ``multiagent.policies`` dictionary, you can use the ``multiagent.policies_to_train`` setting.
+For example, you may want to have one or more random (non learning) policies interact with your learning ones:
+
 .. code-block:: python
 
 
@@ -319,6 +325,14 @@ For how to use multiple training methods at once (here DQN and PPO),
 see the `two-algorithm example <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/two_algorithms.py>`__.
 Metrics are reported for each policy separately, for example:
 
+.. code-block:: bash
+   :emphasize-lines: 6,14,22
+Here is a simple `example training script <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
+in which you can vary the number of agents and policies in the environment.
+For how to use multiple training methods at once (here DQN and PPO),
+see the `two-algorithm example <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/two_algorithms.py>`__.
+Metrics are reported for each policy separately, for example:
+
 .. code-block:: bash
    :emphasize-lines: 6,14,22
 
@@ -374,6 +388,16 @@ PettingZoo Multi-Agent Environments
 A more complete example is here: `rllib_pistonball.py <https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/Ray/rllib_pistonball.py>`__
 
 
+Rock Paper Scissors Example
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The `rock_paper_scissors_heuristic_vs_learned.py <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/rock_paper_scissors_heuristic_vs_learned.py>`__
+and `rock_paper_scissors_learned_vs_learned.py <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent/rock_paper_scissors_learned_vs_learned.py>`__ examples demonstrate several types of policies competing against each other: heuristic policies of repeating the same move, beating the last opponent move, and learned LSTM and feedforward policies.
+
+.. figure:: images/rock-paper-scissors.png
+A more complete example is here: `rllib_pistonball.py <https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/Ray/rllib_pistonball.py>`__
+
+
 Rock Paper Scissors Example
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -431,6 +455,12 @@ To update the critic, you'll also have to modify the loss of the policy. For an
 
 Alternatively, you can use an observation function to share observations between agents. In this strategy, each observation includes all global state, and policies use a custom model to ignore state they aren't supposed to "see" when computing actions. The advantage of this approach is that it's very simple and you don't have to change the algorithm at all -- just use the observation func (i.e., like an env wrapper) and custom model. However, it is a bit less principled in that you have to change the agent observation spaces to include training-time only information. You can find a runnable example of this strategy at `examples/centralized_critic_2.py <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__.
 
+Grouping Agents
+~~~~~~~~~~~~~~~
+**Strategy 2: Sharing observations through an observation function**:
+
+Alternatively, you can use an observation function to share observations between agents. In this strategy, each observation includes all global state, and policies use a custom model to ignore state they aren't supposed to "see" when computing actions. The advantage of this approach is that it's very simple and you don't have to change the algorithm at all -- just use the observation func (i.e., like an env wrapper) and custom model. However, it is a bit less principled in that you have to change the agent observation spaces to include training-time only information. You can find a runnable example of this strategy at `examples/centralized_critic_2.py <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__.
+
 Grouping Agents
 ~~~~~~~~~~~~~~~
 
@@ -485,6 +515,15 @@ External Agents and Applications
 
 In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with **external simulators** (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training.
 
+.. figure:: images/rllib-training-inside-a-unity3d-env.png
+    :scale: 75 %
+See this file for a runnable example: `hierarchical_training.py <https://github.com/ray-project/ray/blob/master/rllib/examples/hierarchical/hierarchical_training.py>`__.
+
+External Agents and Applications
+--------------------------------
+
+In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with **external simulators** (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training.
+
 .. figure:: images/rllib-training-inside-a-unity3d-env.png
     :scale: 75 %
 
@@ -541,6 +580,23 @@ In remote inference mode, each computed action requires a network call to the se
 
 Example:
 
+.. code-block:: python
+
+    client = PolicyClient("http://localhost:9900", inference_mode="local")
+    episode_id = client.start_episode()
+    ...
+    action = client.get_action(episode_id, cur_obs)
+    ...
+    client.end_episode(episode_id, last_obs)
+
+To understand the difference between standard envs, external envs, and connecting with a ``PolicyClient``, refer to the following figure:
+Clients can then connect in either *local* or *remote* inference mode.
+In local inference mode, copies of the policy are downloaded from the server and cached on the client for a configurable period of time.
+This allows actions to be computed by the client without requiring a network round trip each time.
+In remote inference mode, each computed action requires a network call to the server.
+
+Example:
+
 .. code-block:: python
 
     client = PolicyClient("http://localhost:9900", inference_mode="local")
@@ -601,4 +657,4 @@ For more complex / high-performance environment integrations, you can instead ex
 `BaseEnv <https://github.com/ray-project/ray/blob/master/rllib/env/base_env.py>`__ class.
 This low-level API models multiple agents executing asynchronously in multiple environments.
 A call to ``BaseEnv:poll()`` returns observations from ready agents keyed by 1) their environment, then 2) agent ids.
-Actions for those agents are sent back via ``BaseEnv:send_actions()``. BaseEnv is used to implement all the other env types in RLlib, so it offers a superset of their functionality.
+Actions for those agents are sent back via ``BaseEnv:send_actions()``. BaseEnv is used to implement all the other env types in RLlib, so it offers a superset of their functionality.