Added deepseek_v3 in models by Sirorezka · Pull Request #2681 · PrimeIntellect-ai/prime-rl

Sirorezka · 2026-06-01T18:44:22Z

whoever for correct cp implementation 'softmax_scalling' factor must be added to ring attention.

Note

Medium Risk
Large new model path with MoE routing, MLA attention scaling, and distributed CP patching; incorrect softmax_scale under ring CP could skew training numerics.

Overview
Adds a custom PrimeRL DeepSeek V3 stack (config, MLA attention, group-aware MoE router, HF↔Prime weight conversion) and wires it into AutoConfig / AutoModelForCausalLMPrimeRL.

DeepSeekAttentionCore supports SDPA, packed flash-attn, and varlen flash paths with an MLA-specific softmax_scale hook for ring attention (noted in the PR description as required for correct CP). Ring and Ulysses CP substitution now patch DeepSeekAttentionCore._compute_attention alongside other models.

Includes configs/deepseek_v3/sft.toml for a small test checkpoint and unit tests that compare logits/grads to HuggingFace, cover weight conversion, and verify CP patching.

^{Reviewed by Cursor Bugbot for commit b70c306. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-06-01T19:16:46Z

+
+    from prime_rl.trainer.models.deepseek_v3.attention_deepseek_v3 import DeepSeekAttentionCore
+
+    DeepSeekAttentionCore._compute_attention = _ring_compute_attention


Ring CP breaks DeepSeek scaling

High Severity

With context parallel ring attention enabled, DeepSeekAttentionCore._compute_attention is replaced by a helper that does not take or forward softmax_scale, while the varlen path still passes MLA/YARN self.scaling. Training with cp > 1 and flash attention can raise a keyword error or run ring flash with the wrong softmax temperature.

Additional Locations (1)

src/prime_rl/trainer/models/deepseek_v3/attention_deepseek_v3.py#L69-L128

^{Reviewed by Cursor Bugbot for commit 01496f7. Configure here.}

Seems relevant as well

cursor · 2026-06-01T19:16:46Z

+    "DeepseekV3ForCausalLM",
+    "DeepseekV3Model",
+    "DeepseekV3PreTrainedModel",
+]


Missing KL validation table

Medium Severity

This PR adds a new custom deepseek_v3 implementation but does not include the required mean KL mismatch table (20 steps, math env, batch_size=64, all entries below 0.015).

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit 01496f7. Configure here.}

What is KL validation table?

Explained in my comment

S1ro1

Left some comments that are blocking currently, else looks reasonable to me.

There are few blocking things - we now require 20 steps with kl_mismatch < 0.015 across all steps on math env with BS=64 before merging new models. Is this something you can do on your end? If not feel free to drop a config for it and I can run it at the earliest convenience.

Also can you add/mention relevant parts to the model in docs/README where other model impls are mentioned.

S1ro1 · 2026-06-02T01:07:10Z

@@ -0,0 +1,131 @@
+import torch


This file seems to copy most of the current attention impl without any (or few) changes, any reason for it? If there are any changes, let's move them to the shared impl if not breaking?

I wasn't been able to reuse them, because 'FlashAttention' and 'SDPAAttention' have their own versions of q,k,v projections which are differ from the ones that are used in DeepSeek. Because of this to reuse original attention I would need to rewrite 'init', 'forward' and 'attn_projections' methods. Which is basically the same as rewriting whole module from scratch.

S1ro1 · 2026-06-02T01:09:24Z

@@ -0,0 +1,65 @@
+


Can we change configs to full shape and also fix formatting?

Thx. Where can I find example of the full shape of the config?

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b70c306. Configure here.}

cursor · 2026-06-02T16:56:16Z

+    "DeepseekV3ForCausalLM",
+    "DeepseekV3Model",
+    "DeepseekV3PreTrainedModel",
+]


New custom model missing required KL mismatch table

Medium Severity

This PR introduces deepseek_v3 as a new custom model but does not include the required table showing mean KL mismatch across 20 steps on a math environment with batch_size=64. Per project rules, all entries in such a table must be lower than 0.015 before the PR can be accepted.

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit b70c306. Configure here.}

added deepseek_v3

af3d764

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Sirorezka added 2 commits June 1, 2026 22:10

fixing issues revealed by bot

72eb006

Merge branch 'main' into feat_deepseak_v3

01496f7

cursor Bot reviewed Jun 1, 2026

View reviewed changes

S1ro1 reviewed Jun 2, 2026

View reviewed changes

fixing issues reveald in pr

3a3c8dc

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/models/deepseek_v3/configuration_deepseek_v3.py Outdated

Comment thread src/prime_rl/trainer/models/deepseek_v3/converting_deepseek_v3.py Outdated

fixing issues reveald in pr

b70c306

cursor Bot reviewed Jun 2, 2026

View reviewed changes


		from prime_rl.trainer.models.deepseek_v3.attention_deepseek_v3 import DeepSeekAttentionCore

		DeepSeekAttentionCore._compute_attention = _ring_compute_attention

Conversation

Sirorezka commented Jun 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 1, 2026

Choose a reason for hiding this comment

Ring CP breaks DeepSeek scaling

Uh oh!

S1ro1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 1, 2026

Choose a reason for hiding this comment

Missing KL validation table

Uh oh!

Sirorezka Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

S1ro1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

S1ro1 left a comment

Choose a reason for hiding this comment

Uh oh!

S1ro1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Sirorezka Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

S1ro1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Sirorezka Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 2, 2026

Choose a reason for hiding this comment

New custom model missing required KL mismatch table

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sirorezka commented Jun 1, 2026 •

edited by cursor Bot

Loading

Sirorezka Jun 2, 2026 •

edited

Loading