Added deepseek_v3 in models#2681
Conversation
|
|
||
| from prime_rl.trainer.models.deepseek_v3.attention_deepseek_v3 import DeepSeekAttentionCore | ||
|
|
||
| DeepSeekAttentionCore._compute_attention = _ring_compute_attention |
There was a problem hiding this comment.
Ring CP breaks DeepSeek scaling
High Severity
With context parallel ring attention enabled, DeepSeekAttentionCore._compute_attention is replaced by a helper that does not take or forward softmax_scale, while the varlen path still passes MLA/YARN self.scaling. Training with cp > 1 and flash attention can raise a keyword error or run ring flash with the wrong softmax temperature.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 01496f7. Configure here.
| "DeepseekV3ForCausalLM", | ||
| "DeepseekV3Model", | ||
| "DeepseekV3PreTrainedModel", | ||
| ] |
There was a problem hiding this comment.
Missing KL validation table
Medium Severity
This PR adds a new custom deepseek_v3 implementation but does not include the required mean KL mismatch table (20 steps, math env, batch_size=64, all entries below 0.015).
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 01496f7. Configure here.
S1ro1
left a comment
There was a problem hiding this comment.
Left some comments that are blocking currently, else looks reasonable to me.
There are few blocking things - we now require 20 steps with kl_mismatch < 0.015 across all steps on math env with BS=64 before merging new models. Is this something you can do on your end? If not feel free to drop a config for it and I can run it at the earliest convenience.
Also can you add/mention relevant parts to the model in docs/README where other model impls are mentioned.
| @@ -0,0 +1,131 @@ | |||
| import torch | |||
There was a problem hiding this comment.
This file seems to copy most of the current attention impl without any (or few) changes, any reason for it? If there are any changes, let's move them to the shared impl if not breaking?
There was a problem hiding this comment.
I wasn't been able to reuse them, because 'FlashAttention' and 'SDPAAttention' have their own versions of q,k,v projections which are differ from the ones that are used in DeepSeek. Because of this to reuse original attention I would need to rewrite 'init', 'forward' and 'attn_projections' methods. Which is basically the same as rewriting whole module from scratch.
| @@ -0,0 +1,65 @@ | |||
|
|
|||
There was a problem hiding this comment.
Can we change configs to full shape and also fix formatting?
There was a problem hiding this comment.
Thx. Where can I find example of the full shape of the config?
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b70c306. Configure here.
| "DeepseekV3ForCausalLM", | ||
| "DeepseekV3Model", | ||
| "DeepseekV3PreTrainedModel", | ||
| ] |
There was a problem hiding this comment.
New custom model missing required KL mismatch table
Medium Severity
This PR introduces deepseek_v3 as a new custom model but does not include the required table showing mean KL mismatch across 20 steps on a math environment with batch_size=64. Per project rules, all entries in such a table must be lower than 0.015 before the PR can be accepted.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit b70c306. Configure here.


whoever for correct cp implementation 'softmax_scalling' factor must be added to ring attention.
Note
Medium Risk
Large new model path with MoE routing, MLA attention scaling, and distributed CP patching; incorrect softmax_scale under ring CP could skew training numerics.
Overview
Adds a custom PrimeRL DeepSeek V3 stack (config, MLA attention, group-aware MoE router, HF↔Prime weight conversion) and wires it into
AutoConfig/AutoModelForCausalLMPrimeRL.DeepSeekAttentionCoresupports SDPA, packed flash-attn, and varlen flash paths with an MLA-specificsoftmax_scalehook for ring attention (noted in the PR description as required for correct CP). Ring and Ulysses CP substitution now patchDeepSeekAttentionCore._compute_attentionalongside other models.Includes
configs/deepseek_v3/sft.tomlfor a small test checkpoint and unit tests that compare logits/grads to HuggingFace, cover weight conversion, and verify CP patching.Reviewed by Cursor Bugbot for commit b70c306. Bugbot is set up for automated code reviews on this repo. Configure here.