[Question] I do not understand the GPU and memory usage of SB3 #1630

EloyAnguiano · 2023-07-27T08:56:39Z

❓ Question

I think I do not underestand the memory usage of SB3. I have a Dict observation space of some huge matrixes, so my observation space is 17MB approx:

(Pdb) [sys.getsizeof(v) for k, v in obs.items()]
[2039312, 2968, 12235248, 105800, 2968, 2968, 2968, 2039312, 116, 2039312, 2968, 2968, 2968]
(Pdb) sum([sys.getsizeof(v) for k, v in obs.items()])/1024/1024
17.623783111572266

I training a PPO agent over a Vectorized environment with the make_vec_env function at n_envs = 2 and the hyperparameters of my PPO agent are n_steps = 6 and my batch_size is 16. If I underestood well, my rollout buffer will be n_steps x n_envs = 12 so the rollout_buffer will be 17 x 12 = 204 MB. I assume that the batch_size of 16 will get the minimum so it is equivalent of having a batch size of 12.

The problem here is that when I'm using a GPU device (80GB A100) it stabilizes at 70GB of usage at the beginning and a little bit later it stops for the lack of space at the device. How is this even possible?

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2023-07-27T09:47:51Z

Hello,
there is an important information missing which is your network architecture.
The rollout buffer store things in the RAM not on the GPU.
And most GPU memory is taken by weights and gradients.

Might be a duplicate of #863

EloyAnguiano · 2023-07-27T10:25:06Z

Printing my model size with this:

def print_model_size(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()

    size_all_mb = (param_size + buffer_size) / 1024**2
    print(f'model size: {size_all_mb:.3f}MB')  # noqa: T201

And calling it like:
print_model_size(agent.policy)

Returns:
model size: 8.369MB

Is there any part of the agent that could be bigger? I am using my custom FeatureExtractor class but I assume its included at policy argument

EloyAnguiano · 2023-07-27T10:26:17Z

Also, whenever yoou run a batch od data on GPU ypu have to transfer those data to the cuda device, so the data are in GPU at some point, isn't it?

EloyAnguiano · 2023-10-18T14:40:07Z

I am still unable to figure out the problem in here nor in the #863 issue. There, the solution was to flatten the observation, but this does not explain anything

EloyAnguiano · 2023-10-23T14:00:17Z

@araffin I think that GPU usage could be a bit more optimal. First of all, debugging the PPO class (the train method) I found that the GPU usage is a bit confusing, if I keep every hyperparameter fixed (n_steps, batch_size, etc...) but I change the number of environments at the vectorized environment, the GPU usage differs from one another:

1 environment: 1815MiB
16 environments: 8551MiB

I do not understand this as the self.rollout_buffer.size() is still the n_steps as before (32 in my case), so I do not know where does this come from. Indeed the only specifications that should affect the memory usage should be the policy size itself, the batch_size (this is key, the rollout_buffer should be at RAM and whenever we want to train with a batch, you retrieve those data to GPU) and the gradients of the model for the backpropagation.

Does this make any sense? Am I missing something?

araffin · 2023-10-23T15:07:53Z

This should answer your question:

stable-baselines3/stable_baselines3/common/buffers.py

Lines 391 to 398 in aab5459

    
           self.observations = np.zeros((self.buffer_size, self.n_envs, *self.obs_shape), dtype=np.float32) 
        
           self.actions = np.zeros((self.buffer_size, self.n_envs, self.action_dim), dtype=np.float32) 
        
           self.rewards = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32) 
        
           self.returns = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32) 
        
           self.episode_starts = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32) 
        
           self.values = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32) 
        
           self.log_probs = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32) 
        
           self.advantages = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)

EloyAnguiano · 2023-10-24T07:12:35Z

Yes, it does. Thanks a lot. Also, this opens up my ather question and it is that I think that the rollout buffer should not be on GPU and that the GPU usage should be controlled by the batch size at each epoch. Thus, you could collect a giant rollout_buffer but train on a small but fast GPU by choosing a correct batch_size. Isn't it?

araffin · 2023-10-24T07:53:24Z

#1720 (comment)

EloyAnguiano · 2023-10-24T08:08:35Z

Sorry, I do not understand. If rollout buffer is always on cpu, why at #1630 (comment) the number of used environments improves the GPU usage?

EloyAnguiano · 2023-10-24T08:10:17Z

Indeed, if I debug PPO training at a GPU, I get this:

(Pdb) self.rollout_buffer.device
device(type='cuda', index=2)

This should mean that the data of the rollout_buffer are alocated at GPU

araffin · 2023-10-24T08:23:27Z

the number of used environments improves the GPU usage?

are you using subprocesses? if so, that might be due to the way python multiprocessing work.

This should mean that the data of the rollout_buffer are alocated at GPU

if you look at the code (and you should), the device is only used here:

stable-baselines3/stable_baselines3/common/buffers.py

Lines 127 to 139 in aab5459

    
               def to_torch(self, array: np.ndarray, copy: bool = True) -> th.Tensor: 
        
                   """ 
        
                   Convert a numpy array to a PyTorch tensor. 
        
                   Note: it copies the data by default 
        
                   :param array: 
        
                   :param copy: Whether to copy or not the data (may be useful to avoid changing things 
        
                       by reference). This argument is inoperative if the device is not the CPU. 
        
                   :return: 
        
                   """ 
        
                   if copy: 
        
                       return th.tensor(array, device=self.device) 
        
                   return th.as_tensor(array, device=self.device)

when sampling the data there:

stable-baselines3/stable_baselines3/common/buffers.py

Line 520 in aab5459

return RolloutBufferSamples(*tuple(map(self.to_torch, data)))

EloyAnguiano · 2023-10-24T08:35:19Z

I am creating the environment like this:

gym_env = make_vec_env(make_env,
                               env_kwargs=env_kwargs,
                               n_envs=args.n_envs,
                               vec_env_cls=SubprocVecEnv)

So I assume it uses some kind of multiprocessing, yes. What does this has to do with GPU usage?

EloyAnguiano · 2023-10-30T11:47:10Z

Hi again @araffin . I am still unable to figure out how, it the transition of data from RolloutBuffer is done at each sampling, how can the GPU usage be so big just when the code goes into train method, as this should not have any data on GPU, only the model.

EloyAnguiano added the question Further information is requested label Jul 27, 2023

EloyAnguiano changed the title ~~[Question] I do not underestanc the GPU and memory usage of SB3~~ [Question] I do not underestand the GPU and memory usage of SB3 Jul 27, 2023

EloyAnguiano mentioned this issue Jul 27, 2023

[Bug] GPU memory explodes when using Conv2D layers in Dict Observations FeatureExtractor #863

Closed

EloyAnguiano changed the title ~~[Question] I do not underestand the GPU and memory usage of SB3~~ [Question] I do not understand the GPU and memory usage of SB3 Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] I do not understand the GPU and memory usage of SB3 #1630

[Question] I do not understand the GPU and memory usage of SB3 #1630

EloyAnguiano commented Jul 27, 2023 •

edited

Loading

araffin commented Jul 27, 2023

EloyAnguiano commented Jul 27, 2023

EloyAnguiano commented Jul 27, 2023

EloyAnguiano commented Oct 18, 2023

EloyAnguiano commented Oct 23, 2023 •

edited

Loading

araffin commented Oct 23, 2023

EloyAnguiano commented Oct 24, 2023

araffin commented Oct 24, 2023

EloyAnguiano commented Oct 24, 2023

EloyAnguiano commented Oct 24, 2023 •

edited

Loading

araffin commented Oct 24, 2023

EloyAnguiano commented Oct 24, 2023 •

edited

Loading

EloyAnguiano commented Oct 30, 2023

[Question] I do not understand the GPU and memory usage of SB3 #1630

[Question] I do not understand the GPU and memory usage of SB3 #1630

Comments

EloyAnguiano commented Jul 27, 2023 • edited Loading

❓ Question

Checklist

araffin commented Jul 27, 2023

EloyAnguiano commented Jul 27, 2023

EloyAnguiano commented Jul 27, 2023

EloyAnguiano commented Oct 18, 2023

EloyAnguiano commented Oct 23, 2023 • edited Loading

araffin commented Oct 23, 2023

EloyAnguiano commented Oct 24, 2023

araffin commented Oct 24, 2023

EloyAnguiano commented Oct 24, 2023

EloyAnguiano commented Oct 24, 2023 • edited Loading

araffin commented Oct 24, 2023

EloyAnguiano commented Oct 24, 2023 • edited Loading

EloyAnguiano commented Oct 30, 2023

EloyAnguiano commented Jul 27, 2023 •

edited

Loading

EloyAnguiano commented Oct 23, 2023 •

edited

Loading

EloyAnguiano commented Oct 24, 2023 •

edited

Loading

EloyAnguiano commented Oct 24, 2023 •

edited

Loading