First, I would like to express my gratitude to the researchers who transparently shared their outstanding work.
I have a question regarding the interchangeability discussed in Section 3.5.1 of the technical report. The report states that the best learning results were observed when Muon was used as the optimizer in both the pre-training and SFT processes. I am curious whether the same results were observed in RL beyond SFT.