Skip to content

Conversation

Framartin
Copy link

@Framartin Framartin commented Jun 13, 2025

What does this PR do?

Bug fix: ?
Fix #211

Fix several bugs related to the speculative decoding example.

Overview:

  • add support to freeze the base model (this was recommended, but not actually implemented, contrary to what the README previously stated) in main.py, launch.sh and README
  • fix data format to support the Daring-Anteater dataset
  • use --chat flag in the generate_server.py calls
  • add system prompt to fastchat prompt
  • use client.chat.completions.create() instead client.completions.create() and fix vllm-specific args
  • expose --gradient_accumulation_steps
  • replace assert to handle the exception
  • add --train_bs in README to ease adaptation to multiple GPUs while keeping the effective batch-size constant

Testing

I've run the modified scripts.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: Yes
  • Did you update Changelog?: No

Additional Information

@Framartin Framartin changed the title Fix spec dec signed Fix speculative decoding example Jun 13, 2025
@Framartin Framartin changed the title Fix speculative decoding example [WIP] Fix speculative decoding example Jun 17, 2025
@Framartin Framartin changed the title [WIP] Fix speculative decoding example Fix speculative decoding example Jun 17, 2025
@Framartin
Copy link
Author

@yeyu-nvidia In addition to fixing your comment above, I've added additional commits to fix other issues:

  • add support to freeze the base model. This was recommended, but not actually implemented, contrary to what the README previously stated
  • support the data format of server_generate.py in medusa_utils.py and eagle_utils.py
  • expose --train_bs (to hint how to keep the effective batch-size constant when increasing the number of GPUs)
  • expose --gradient_accumulation_steps
  • replace an assert to handle the exception
  • update the README accordingly

This is good on my side. Please let me know if there are any issue.

}
mtsp.convert(model, [(mode, config)])

for name, param in model.named_parameters():
Copy link
Contributor

@yeyu-nvidia yeyu-nvidia Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. We have freeze_base_model in forward and it is default set to True https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/plugins/transformers.py#L102

if [[ "$1" != *=* ]]; then shift; fi
DO_EVAL="${1#*=}"
;;
--freeze_base_model*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed.

else:
raise Exception(f"{training_args.mode} is not supported!")

if training_args.freeze_base_model:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this part is not needed.

@Tala-mahhmmoodi
Copy link

Hey @Framartin, while reviewing your PR, I'd suggest the following code changes:

👉 Code Suggestion for #214

#214

You can also review and apply these suggestions locally on your machine.

Learn more about GitKraken Code Suggest

Code Suggest liberates your code reviews from GitHub's restrictive, comment-only feedback style. As simple as suggesting changes in a Google-doc, provide real code suggestions from where you code, e.g. your IDE, and on anything in your project — not just on the lines of code changed in the PR.

Join your team on GitKraken to speed up PR review.

@Tala-mahhmmoodi
Copy link

Hey @Framartin, while reviewing your PR, I'd suggest the following code changes:

👉 Code Suggestion for #214

#214

You can also review and apply these suggestions locally on your machine.

Learn more about GitKraken Code Suggest

Code Suggest liberates your code reviews from GitHub's restrictive, comment-only feedback style. As simple as suggesting changes in a Google-doc, provide real code suggestions from where you code, e.g. your IDE, and on anything in your project — not just on the lines of code changed in the PR.

Join your team on GitKraken to speed up PR review.

@kevalmorabia97
Copy link
Collaborator

Hi @Framartin can you please address feedback and update your PR?

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner September 2, 2025 14:29
@kevalmorabia97 kevalmorabia97 added the stale Not updated in a long time label Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Not updated in a long time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bugs in the speculative decoding example
4 participants