Skip to content

Conversation

@alirahkay
Copy link
Collaborator

No description provided.

)
batch = self.preprocess_update_batch(batch)

self._model_optimizer.zero_grad()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved down to right above line 435

:py:class:`~hive.replays.circular_replay.CircularReplayBuffer`.
discount_rate (float): A number between 0 and 1 specifying how much
future rewards are discounted by the agent.
n_step (int): The horizon used in n-step returns to compute TD(n) targets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubt: Is the length of the horizon while planning to tune the policy?

stack_size=stack_size,
gamma=discount_rate,
)
self._planning_buffer = planning_buffer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubt: Why are there separate replay buffers for planning and learning?

):
self._logger.log_scalar("train_qval", torch.max(qvals), self._timescale)
agent_traj_state = {}
return action, agent_traj_state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: Defining agent_traj_state might not be necessary.

"observation": update_info["observation"],
"action": update_info["action"],
"reward": update_info["reward"],
"done": update_info["terminated"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "or update_info["truncated"]" in not added for this replay buffer?

return

(
preprocessed_learning_update_info,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have 2 replay buffers? From what I understood, both replay buffers are storing the same transitions. It's just that the batch_size for planning and model learning might change. But that can be passed as a separate instead. Also, having 2 buffers increases the memory required by the model.


# Observations
obs_pred_list = []
for a in range(self._act_dim):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious question: Isn't there a better way to do it without the for loop?

# Observations
self._obs_encoder = observation_encoder_net(in_dim)
obs_predictor_in_dim = (
np.prod(calculate_output_dim(self._obs_encoder, in_dim)) + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is the dimension 1 added for the action? I thought the actions are one-hot in general for discrete action spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants