Thanks for releasing the codes. I have 2 questions about the algorithm implementation.
- The PEX codes are building on the IQL(offline and online). When online fine-tuning, the algorithm uses the "dist.sample" to choose w and action_2 for interaction with environment. But I want to know when evaluating, why do you choose epsilon-greedy to choose w and action_2 instead of purely greedy operation.
- I am going to use the SAC as the backbone algorithm for EPX like you did in the paper. Due to the lack of SAC version, I want to know how to transfer offline training Q to online SAC. Because the Q of SAC incorporating the entropy term,I don't know if it is reasonable to directly use the offline training Q as the SAC's Q and use the soft bellman equation to update the Q.
- Furthermore, I don't understand the adaption for the SAC actor training, showing in the picture below. Could you give me more descriptions about the adaption?
