Dear authors, we're working on reproducing the Extended RL stage from your BusterX++ paper and have run into some implementation questions regarding the Thinking Reward. Would it be possible to share a few more details?
1.Thinking Reward scoring:
- The paper mentions evaluating "coherence and completeness" – how exactly is this quantified?
- Could you share the prompt used with SophiaVL-R1-Thinking-Reward-Model-3B?
- Do you evaluate only the
<think>...</think> portion or the entire response?
2.Reward combination in Extended RL:
- Is it simply: total = format + accuracy + thinking ?
- Are these three rewards summed directly, or are there weighting coefficients?
3.Edge cases:
- How do you handle cases where the model doesn't generate
<think> tags or the reasoning is very short?
- Is a zero reward given, or is there some fallback scoring?
Thanks for your time and the great work!
Dear authors, we're working on reproducing the Extended RL stage from your BusterX++ paper and have run into some implementation questions regarding the Thinking Reward. Would it be possible to share a few more details?
1.Thinking Reward scoring:
<think>...</think>portion or the entire response?2.Reward combination in Extended RL:
3.Edge cases:
<think>tags or the reasoning is very short?Thanks for your time and the great work!