-
Notifications
You must be signed in to change notification settings - Fork 214
Enable Nemotron nano vlm v1&v2 nvfp4 PTQ workflow #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kevalmorabia97
merged 30 commits into
main
from
zhiyu/support-nemotron-nano-vlm-v1-nvfp4
Oct 24, 2025
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
476b59f
Add option to benchmark pipeline in diffusion_trt.py (#457)
ajrasane b583d98
default attn_implementaion to eager to avoid issues
Edwardf0t1 8b76102
add proper detection and handling for nemotron VL model in ptq examples
Edwardf0t1 89d207c
create fake vl inputs in export for nemotron VL model
Edwardf0t1 6991fdf
update fake inputs generation, initialize distributed for Nemotron mo…
Edwardf0t1 80fecf0
remove distributed prcessing setup and vision input generation since …
Edwardf0t1 15f0d61
special handling for nemotron VL preview generation in hf_ptq
Edwardf0t1 ae42b9b
fix mypy error
Edwardf0t1 587d427
add support for v2 model inference (.generate) with image inputs
Edwardf0t1 208cb9e
debug loading v2 converted nvfp4 weights from mcore
Edwardf0t1 f94558f
load scalers only for v2 fp4
Edwardf0t1 31c4f75
re-use existing vlm detection util function
Edwardf0t1 ec4a0ef
refactor and create a utils script for vlm
Edwardf0t1 5f0ea72
remove dulicated is_nemotron_vl usage
Edwardf0t1 60a698a
update
Edwardf0t1 446e135
add a util function to extract language model from VLM, update changelog
Edwardf0t1 f849c17
fix format
Edwardf0t1 96e1613
update
Edwardf0t1 c572513
update
Edwardf0t1 8e6dea3
update
Edwardf0t1 16bea91
WIP: local changes before pulling remote updates
Edwardf0t1 8e1d6cb
Increase gpu_tests timeout from 90 to 120 mins
kevalmorabia97 4561de9
revert torch_onnx.py
Edwardf0t1 57d388e
revert diffusion_trt.py
Edwardf0t1 f9b88fd
minor
Edwardf0t1 0e00954
update
Edwardf0t1 1a3bac1
update
Edwardf0t1 a4fa12d
update
Edwardf0t1 6216038
update
Edwardf0t1 4352ab6
update
Edwardf0t1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -39,6 +39,91 @@ | |
| SPECULATIVE_MODEL_LIST = ["Eagle", "Medusa"] | ||
|
|
||
|
|
||
| def run_nemotron_vl_preview( | ||
| full_model, tokenizer, input_ids, pyt_ckpt_path, stage_name, allow_fallback=False | ||
| ): | ||
| """Run text-only and VL preview generation for Nemotron VL models. | ||
|
|
||
| Args: | ||
| full_model: The full VL model | ||
| tokenizer: The tokenizer | ||
| input_ids: Input tensor for generation | ||
| pyt_ckpt_path: Path to the model checkpoint | ||
| stage_name: Description of the stage (e.g., "before quantization", "after quantization") | ||
| allow_fallback: Whether to allow fallback to standard generate on failure | ||
|
|
||
| Returns: | ||
| Generated text response or None if generation failed | ||
| """ | ||
| from vlm_utils import run_text_only_generation, run_vl_preview_generation | ||
|
|
||
| print(f"Running text-only preview generation for Nemotron VL model ({stage_name})...") | ||
| question = tokenizer.decode(input_ids[0], skip_special_tokens=True) | ||
| generation_config = { | ||
| "max_new_tokens": 100, | ||
| "do_sample": False, | ||
| "eos_token_id": tokenizer.eos_token_id, | ||
| } | ||
|
|
||
| # Try text-only generation | ||
| text_response = run_text_only_generation( | ||
| full_model, tokenizer, question, generation_config, pyt_ckpt_path | ||
| ) | ||
|
|
||
| if text_response is not None: | ||
| print(f"✅ Text-only generation successful: {text_response[:100]}...") | ||
| generated_ids = text_response | ||
| elif allow_fallback: | ||
| print("Text-only generation failed, falling back to standard generate...") | ||
| generated_ids = full_model.generate(input_ids, max_new_tokens=100) | ||
| else: | ||
| generated_ids = None | ||
|
|
||
| # Run additional VL test with images | ||
| print(f"Running additional VL test with images ({stage_name})...") | ||
| run_vl_preview_generation(full_model, tokenizer, pyt_ckpt_path, stage_name) | ||
|
|
||
| return generated_ids | ||
|
|
||
|
|
||
| def _is_multimodal_config(config): | ||
| """Check if a config indicates a multimodal model (config-only version of is_multimodal_model).""" | ||
| return ( | ||
| hasattr(config, "vision_config") # Standard vision config (e.g., Qwen2.5-VL) | ||
| or getattr(config, "model_type", "") == "phi4mm" # Phi-4 multimodal | ||
| or hasattr(config, "vision_lora") # Vision LoRA configurations | ||
| or hasattr(config, "audio_processor") # Audio processing capabilities | ||
| or ( | ||
| hasattr(config, "embd_layer") and hasattr(config.embd_layer, "image_embd_layer") | ||
| ) # Image embedding layers | ||
| ) | ||
|
|
||
|
|
||
| def is_nemotron_vl(model_or_config): | ||
| """Check if model or config indicates a Nemotron VL model. | ||
|
|
||
| Args: | ||
| model_or_config: Either a model instance or a config object. | ||
|
|
||
| Returns: | ||
| bool: True if it's a Nemotron VL model, False otherwise. | ||
| """ | ||
| # Try to get config from model, or use directly if it's a config | ||
| if hasattr(model_or_config, "config"): | ||
| config = model_or_config.config | ||
| from modelopt.torch.export.model_utils import is_multimodal_model | ||
|
|
||
| if not is_multimodal_model(model_or_config): | ||
| return False | ||
| else: | ||
| config = model_or_config | ||
| if not _is_multimodal_config(config): | ||
| return False | ||
|
|
||
| architectures = getattr(config, "architectures", []) | ||
| return any("nemotron" in arch.lower() for arch in architectures) | ||
|
|
||
|
|
||
| def build_quant_cfg( | ||
| qformat, | ||
| kv_cache_qformat, | ||
|
|
@@ -185,7 +270,21 @@ def get_model( | |
| if device == "cpu": | ||
| device_map = "cpu" | ||
|
|
||
| # Prepare config kwargs for loading | ||
| config_kwargs = {"trust_remote_code": trust_remote_code} if trust_remote_code else {} | ||
|
|
||
| # Load config once and handle VL model detection | ||
| try: | ||
| hf_config = AutoConfig.from_pretrained(ckpt_path, **config_kwargs) | ||
| if is_nemotron_vl(hf_config): | ||
| print( | ||
| "Detected Nemotron VL model from config. " | ||
| "Disabling automatic device mapping for compatibility." | ||
| ) | ||
| device_map = None | ||
Edwardf0t1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| except Exception as e: | ||
| print(f"Error: Could not load config from {ckpt_path}: {e}") | ||
| raise RuntimeError(f"Failed to load model configuration from {ckpt_path}") from e | ||
| if attn_implementation is not None: | ||
| config_kwargs["attn_implementation"] = attn_implementation | ||
|
|
||
|
|
@@ -207,11 +306,6 @@ def get_model( | |
| ) | ||
| model = hf_vila.llm | ||
| else: | ||
| hf_config = AutoConfig.from_pretrained( | ||
| ckpt_path, | ||
| **config_kwargs, | ||
| ) | ||
|
|
||
| if use_seq_device_map: | ||
| device_map = "sequential" | ||
| # If we use sequential, set max_memory limit to ensure that the model does not occupy the full GPU | ||
|
|
@@ -282,6 +376,12 @@ def get_model( | |
| **model_kwargs, | ||
| ) | ||
| model.eval() | ||
|
|
||
| # If device_map was disabled (None), manually move model to target device | ||
| if device_map is None and device != "cpu": | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if device == "cpu"
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That was handled by HF's device_map="cpu" in L210. |
||
| print(f"Moving model to {device} device...") | ||
| model = model.to(device) | ||
|
|
||
| if device == "cuda" and not is_model_on_gpu(model): | ||
| print("Warning: Some parameters are not on a GPU. Calibration can be slow or hit OOM") | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forward-dated release entry
0.39 (2025-11-07) is in the future (today is 2025-10-23). Please mark this as Unreleased/TBD to avoid confusion until the release is cut.
🤖 Prompt for AI Agents