Fix speculative decoding example #214

Framartin · 2025-06-13T17:54:14Z

What does this PR do?

Bug fix: ?
Fix #211

Fix several bugs related to the speculative decoding example.

Overview:

add support to freeze the base model (this was recommended, but not actually implemented, contrary to what the README previously stated) in main.py, launch.sh and README
fix data format to support the Daring-Anteater dataset
use --chat flag in the generate_server.py calls
add system prompt to fastchat prompt
use client.chat.completions.create() instead client.completions.create() and fix vllm-specific args
expose --gradient_accumulation_steps
replace assert to handle the exception
add --train_bs in README to ease adaptation to multiple GPUs while keeping the effective batch-size constant

Testing

I've run the modified scripts.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: Yes
Did you update Changelog?: No

Additional Information

Signed-off-by: Martin Gubri <[email protected]>

examples/speculative_decoding/server_generate.py

Signed-off-by: Martin Gubri <[email protected]>

…ception Signed-off-by: Martin Gubri <[email protected]>

Signed-off-by: Martin Gubri <[email protected]>

Framartin · 2025-06-17T09:57:30Z

@yeyu-nvidia In addition to fixing your comment above, I've added additional commits to fix other issues:

add support to freeze the base model. This was recommended, but not actually implemented, contrary to what the README previously stated
support the data format of server_generate.py in medusa_utils.py and eagle_utils.py
expose --train_bs (to hint how to keep the effective batch-size constant when increasing the number of GPUs)
expose --gradient_accumulation_steps
replace an assert to handle the exception
update the README accordingly

This is good on my side. Please let me know if there are any issue.

yeyu-nvidia · 2025-06-17T17:10:08Z

examples/speculative_decoding/README.md

    }
 mtsp.convert(model, [(mode, config)])

+for name, param in model.named_parameters():


This is not needed. We have freeze_base_model in forward and it is default set to True https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/plugins/transformers.py#L102

yeyu-nvidia · 2025-06-17T17:13:52Z

examples/speculative_decoding/launch.sh

      if [[ "$1" != *=* ]]; then shift; fi
      DO_EVAL="${1#*=}"
      ;;
+    --freeze_base_model*)


This is not needed.

yeyu-nvidia · 2025-06-17T17:14:22Z

examples/speculative_decoding/main.py

        else:
            raise Exception(f"{training_args.mode} is not supported!")

+        if training_args.freeze_base_model:


Again, this part is not needed.

Tala-mahhmmoodi · 2025-08-05T09:47:34Z

Hey @Framartin, while reviewing your PR, I'd suggest the following code changes:

👉 Code Suggestion for #214

#214

You can also review and apply these suggestions locally on your machine.

Learn more about GitKraken Code Suggest

Code Suggest liberates your code reviews from GitHub's restrictive, comment-only feedback style. As simple as suggesting changes in a Google-doc, provide real code suggestions from where you code, e.g. your IDE, and on anything in your project — not just on the lines of code changed in the PR.

Join your team on GitKraken to speed up PR review.

Tala-mahhmmoodi · 2025-08-05T11:41:55Z

Hey @Framartin, while reviewing your PR, I'd suggest the following code changes:

👉 Code Suggestion for #214

#214

You can also review and apply these suggestions locally on your machine.

Learn more about GitKraken Code Suggest

Code Suggest liberates your code reviews from GitHub's restrictive, comment-only feedback style. As simple as suggesting changes in a Google-doc, provide real code suggestions from where you code, e.g. your IDE, and on anything in your project — not just on the lines of code changed in the PR.

Join your team on GitKraken to speed up PR review.

kevalmorabia97 · 2025-08-07T17:47:52Z

Hi @Framartin can you please address feedback and update your PR?

Framartin added 2 commits June 13, 2025 19:48

add --chat flag to server_generate.py calls

0b88d7c

Signed-off-by: Martin Gubri <[email protected]>

fix server_generate.py

179f9d6

Signed-off-by: Martin Gubri <[email protected]>

Framartin changed the title ~~Fix spec dec signed~~ Fix speculative decoding example Jun 13, 2025

kevalmorabia97 requested a review from yeyu-nvidia June 16, 2025 21:11

yeyu-nvidia reviewed Jun 16, 2025

View reviewed changes

examples/speculative_decoding/server_generate.py Outdated Show resolved Hide resolved

Framartin changed the title ~~Fix speculative decoding example~~ [WIP] Fix speculative decoding example Jun 17, 2025

Framartin added 6 commits June 17, 2025 10:31

Fix openai completion args

0d14e05

Signed-off-by: Martin Gubri <[email protected]>

Add medusa support to chat data format

d82c57c

Signed-off-by: Martin Gubri <[email protected]>

Add eagle support to chat data format, remove assert to handle the ex…

4248d81

…ception Signed-off-by: Martin Gubri <[email protected]>

Add --freeze_base_model, add warning print

5b8a411

Signed-off-by: Martin Gubri <[email protected]>

Add --freeze_base_model and --gradient_accumulation_steps to launch.sh

fed042b

Signed-off-by: Martin Gubri <[email protected]>

Update README to freeze base model

1269efc

Signed-off-by: Martin Gubri <[email protected]>

Framartin changed the title ~~[WIP] Fix speculative decoding example~~ Fix speculative decoding example Jun 17, 2025

yeyu-nvidia reviewed Jun 17, 2025

View reviewed changes

Tala-mahhmmoodi approved these changes Aug 20, 2025

View reviewed changes

kevalmorabia97 requested a review from ChenhanYu as a code owner August 21, 2025 18:50

kevalmorabia97 force-pushed the main branch from 361a349 to 5f190eb Compare August 21, 2025 18:52

kevalmorabia97 requested a review from a team as a code owner September 2, 2025 14:29

kevalmorabia97 added the stale Not updated in a long time label Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix speculative decoding example #214

Fix speculative decoding example #214

Uh oh!

Framartin commented Jun 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Framartin commented Jun 17, 2025

Uh oh!

yeyu-nvidia Jun 17, 2025 •

edited

Loading

Uh oh!

yeyu-nvidia Jun 17, 2025

Uh oh!

yeyu-nvidia Jun 17, 2025

Uh oh!

Tala-mahhmmoodi commented Aug 5, 2025

Uh oh!

Tala-mahhmmoodi commented Aug 5, 2025

Uh oh!

kevalmorabia97 commented Aug 7, 2025

Uh oh!

Uh oh!

Fix speculative decoding example #214

Are you sure you want to change the base?

Fix speculative decoding example #214

Uh oh!

Conversation

Framartin commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

Uh oh!

Framartin commented Jun 17, 2025

Uh oh!

yeyu-nvidia Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeyu-nvidia Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

yeyu-nvidia Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Tala-mahhmmoodi commented Aug 5, 2025

Uh oh!

Tala-mahhmmoodi commented Aug 5, 2025

Uh oh!

kevalmorabia97 commented Aug 7, 2025

Uh oh!

Uh oh!

Framartin commented Jun 13, 2025 •

edited

Loading

yeyu-nvidia Jun 17, 2025 •

edited

Loading