Skip to content

Added deepseek_v3 in models#2681

Open
Sirorezka wants to merge 5 commits into
PrimeIntellect-ai:mainfrom
Sirorezka:feat_deepseak_v3
Open

Added deepseek_v3 in models#2681
Sirorezka wants to merge 5 commits into
PrimeIntellect-ai:mainfrom
Sirorezka:feat_deepseak_v3

Conversation

@Sirorezka
Copy link
Copy Markdown

@Sirorezka Sirorezka commented Jun 1, 2026

whoever for correct cp implementation 'softmax_scalling' factor must be added to ring attention.


Note

Medium Risk
Large new model path with MoE routing, MLA attention scaling, and distributed CP patching; incorrect softmax_scale under ring CP could skew training numerics.

Overview
Adds a custom PrimeRL DeepSeek V3 stack (config, MLA attention, group-aware MoE router, HF↔Prime weight conversion) and wires it into AutoConfig / AutoModelForCausalLMPrimeRL.

DeepSeekAttentionCore supports SDPA, packed flash-attn, and varlen flash paths with an MLA-specific softmax_scale hook for ring attention (noted in the PR description as required for correct CP). Ring and Ulysses CP substitution now patch DeepSeekAttentionCore._compute_attention alongside other models.

Includes configs/deepseek_v3/sft.toml for a small test checkpoint and unit tests that compare logits/grads to HuggingFace, cover weight conversion, and verify CP patching.

Reviewed by Cursor Bugbot for commit b70c306. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread src/prime_rl/trainer/models/deepseek_v3/converting_deepseek_v3.py
Comment thread tests/unit/train/models/test_deepseek_v3.py
Comment thread configs/deepseek_v3/sft.toml
Comment thread src/prime_rl/trainer/models/deepseek_v3/modeling_deepseek_v3.py Outdated
Comment thread src/prime_rl/trainer/models/deepseek_v3/modeling_deepseek_v3.py
Comment thread src/prime_rl/trainer/models/deepseek_v3/attention_deepseek_v3.py
Comment thread src/prime_rl/trainer/models/deepseek_v3/attention_deepseek_v3.py Outdated
Comment thread src/prime_rl/trainer/models/deepseek_v3/converting_deepseek_v3.py

from prime_rl.trainer.models.deepseek_v3.attention_deepseek_v3 import DeepSeekAttentionCore

DeepSeekAttentionCore._compute_attention = _ring_compute_attention
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ring CP breaks DeepSeek scaling

High Severity

With context parallel ring attention enabled, DeepSeekAttentionCore._compute_attention is replaced by a helper that does not take or forward softmax_scale, while the varlen path still passes MLA/YARN self.scaling. Training with cp > 1 and flash attention can raise a keyword error or run ring flash with the wrong softmax temperature.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 01496f7. Configure here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems relevant as well

"DeepseekV3ForCausalLM",
"DeepseekV3Model",
"DeepseekV3PreTrainedModel",
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing KL validation table

Medium Severity

This PR adds a new custom deepseek_v3 implementation but does not include the required mean KL mismatch table (20 steps, math env, batch_size=64, all entries below 0.015).

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 01496f7. Configure here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is KL validation table?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in my comment

Comment thread src/prime_rl/trainer/models/deepseek_v3/attention_deepseek_v3.py
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments that are blocking currently, else looks reasonable to me.

There are few blocking things - we now require 20 steps with kl_mismatch < 0.015 across all steps on math env with BS=64 before merging new models. Is this something you can do on your end? If not feel free to drop a config for it and I can run it at the earliest convenience.

Also can you add/mention relevant parts to the model in docs/README where other model impls are mentioned.

@@ -0,0 +1,131 @@
import torch
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems to copy most of the current attention impl without any (or few) changes, any reason for it? If there are any changes, let's move them to the shared impl if not breaking?

Copy link
Copy Markdown
Author

@Sirorezka Sirorezka Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't been able to reuse them, because 'FlashAttention' and 'SDPAAttention' have their own versions of q,k,v projections which are differ from the ones that are used in DeepSeek. Because of this to reuse original attention I would need to rewrite 'init', 'forward' and 'attn_projections' methods. Which is basically the same as rewriting whole module from scratch.

Comment thread src/prime_rl/trainer/models/deepseek_v3/converting_deepseek_v3.py
Comment thread configs/deepseek_v3/sft.toml Outdated
@@ -0,0 +1,65 @@

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change configs to full shape and also fix formatting?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. Where can I find example of the full shape of the config?

Comment thread src/prime_rl/trainer/models/deepseek_v3/configuration_deepseek_v3.py Outdated
Comment thread src/prime_rl/trainer/models/deepseek_v3/converting_deepseek_v3.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b70c306. Configure here.

"DeepseekV3ForCausalLM",
"DeepseekV3Model",
"DeepseekV3PreTrainedModel",
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New custom model missing required KL mismatch table

Medium Severity

This PR introduces deepseek_v3 as a new custom model but does not include the required table showing mean KL mismatch across 20 steps on a math environment with batch_size=64. Per project rules, all entries in such a table must be lower than 0.015 before the PR can be accepted.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit b70c306. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants