-
Notifications
You must be signed in to change notification settings - Fork 169
[New feature] Support fakequant serve with latest vllm and generalize calib logic #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a vLLM fakequant serving example: Dockerfile and README, a CLI utility to convert HuggingFace amax tensors to vLLM format, a vLLM integration that calibrates and applies quantization with Worker monkey‑patches, and a small import addition for vLLM fused MoE support. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant CLI as vLLM Server (main)
participant W as Worker
participant Q as fakequant_run_prolog
participant M as Model
participant D as Dataloader
U->>CLI: Start server
CLI->>W: initialize & warm up
rect rgba(200,230,255,0.25)
note right of W: Monkey‑patched warm-up flow
W->>Q: run prolog (quant prep)
Q->>D: build calibration dataset
loop calibration batches
Q->>M: forward pass (execute_model)
M-->>Q: activations / stats
end
Q->>M: resolve & apply quant_config
alt optional amax file
Q->>M: load amax state_dict
end
Q->>M: fold weights
end
W->>W: original warm-up / compile
CLI-->>U: Server ready (serving /v1/*)
sequenceDiagram
autonumber
participant OP as Operator
participant C as convert_amax_hf2vllm.py
participant FS as Filesystem
OP->>C: Run with --input ckpt --output amax.pt
C->>FS: Load HF state_dict
C->>C: Group & merge amax keys (qkv, gate_up) via elementwise max
alt dry-run
C-->>OP: Print summary (no write)
else save
C->>FS: Save converted vLLM amax state_dict
C-->>OP: Report completion
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Nitpick comments (8)
.pre-commit-config.yaml (1)
35-39
: Reverting clang-format to v9 may cause formatting drift; confirm toolchain compatibility.v9.0.0 is quite old and lacks many later fixes. Please verify CI/dev environments and any IDE integrations rely on the same version to avoid churn. If this is for compatibility with an external toolchain, document the rationale here.
Would you like me to scan the repo for references to clang-format versions in docs/CI to confirm alignment?
examples/vllm_serve/Dockerfile (2)
32-35
: Robust loop over requirements; avoid backslash escapes.Use read -r to avoid backslash interpretation and set IFS.
-RUN find TensorRT-Model-Optimizer/examples -name "requirements.txt" | grep -v "windows" | while read req_file; do \ +RUN find TensorRT-Model-Optimizer/examples -name "requirements.txt" | grep -v "windows" | while IFS= read -r req_file; do \ echo "Installing from $req_file"; \ pip install -r "$req_file" || echo "Warning: Failed to install from $req_file"; \ done
16-23
: Consider .dockerignore instead of copying and pruning .git.COPY . brings in large/unnecessary files; .dockerignore reduces context and speeds builds.
examples/vllm_serve/convert_amax_hf2vllm.py (1)
155-163
: CLI UX: require input/output unless --test, and add --only-amax option.Add a flag to save only amax keys to keep checkpoints small and avoid later confusion.
parser = argparse.ArgumentParser( description="Convert amax values from HuggingFace to vLLM format" ) @@ parser.add_argument("--dry-run", action="store_true", help="Show conversion without saving") parser.add_argument("--test", action="store_true", help="Run test with sample data") +parser.add_argument("--only-amax", action="store_true", help="Save only *_amax keys to output")
examples/vllm_serve/vllm_serve_fakequant.py (4)
117-124
: Consider left padding for calibration consistency.Left padding often yields better calibration; dataset_utils also warns.
- calib_dataloader = get_dataset_dataloader( + tokenizer.padding_side = "left" + calib_dataloader = get_dataset_dataloader(
223-227
: determine_available_memory: return value not used; simplify.The original returns None. Returning results is harmless, but you can just call and return None to match signature.
-def new_determine_available_memory(self) -> None: - with disable_compilation(self.model_runner.model): - results = old_determine_available_memory(self) - return results +def new_determine_available_memory(self) -> None: + with disable_compilation(self.model_runner.model): + old_determine_available_memory(self)
229-233
: Prolog trigger OK; confirm idempotency.compile_or_warm_up_model may be called more than once. Ensure fakequant_run_prolog is idempotent or guarded.
1-52
: Duplicate/stacked license headers; consolidate to avoid confusion.The file contains multiple Apache-2.0 and MIT blocks. Keep a single project header and add third-party notices in a dedicated section or 3rd-party notices file.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.pre-commit-config.yaml
(1 hunks)examples/vllm_serve/Dockerfile
(1 hunks)examples/vllm_serve/README.md
(1 hunks)examples/vllm_serve/convert_amax_hf2vllm.py
(1 hunks)examples/vllm_serve/vllm_serve_fakequant.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/vllm_serve/vllm_serve_fakequant.py (3)
modelopt/torch/utils/dataset_utils.py (1)
get_dataset_dataloader
(157-232)modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to
(115-123)examples/vllm_serve/convert_amax_hf2vllm.py (1)
main
(155-209)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (7)
examples/vllm_serve/README.md (1)
24-25
: Confirm CLI flags validity across vLLM versions.The short flag -tp may differ across versions. Ensure make_arg_parser exposes -tp; otherwise use --tensor-parallel-size.
examples/vllm_serve/Dockerfile (1)
26-27
: Flash-Attn version compatibility.Pinning flash-attn to 2.7.4.post1 may conflict with the base image CUDA/PyTorch versions. Please confirm ABI/Wheel availability for the base image.
examples/vllm_serve/convert_amax_hf2vllm.py (2)
71-85
: Merging logic OK; consider dtype/device preservation and scalar handling.torch.stack + max preserves dtype/device; fine. Ensure q/k/v shapes match; else raise with a clear error.
191-201
: Dry-run output helpful.Good observability of key changes.
examples/vllm_serve/vllm_serve_fakequant.py (3)
171-175
: PP intermediate tensor handoff may differ across vLLM versions.recv_tensor_dict signature and gating on is_first_rank can vary (0.10 vs 0.11). Please validate on both targets.
183-185
: Rank-0 printing only: good.Prevents noisy logs in distributed runs.
241-251
: Server startup path likely OK.FlexibleArgumentParser + make_arg_parser + uvloop.run(run_server(args)) matches vLLM patterns.
print(f"Saving vLLM checkpoint to: {args.output}") | ||
os.makedirs(os.path.dirname(args.output), exist_ok=True) | ||
torch.save(vllm_state_dict, args.output) | ||
print("Conversion complete!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: os.makedirs with empty dirname fails when output is a filename.
Guard dirname before creating.
-print(f"Saving vLLM checkpoint to: {args.output}")
-os.makedirs(os.path.dirname(args.output), exist_ok=True)
-torch.save(vllm_state_dict, args.output)
+print(f"Saving vLLM checkpoint to: {args.output}")
+out_dir = os.path.dirname(args.output)
+if out_dir:
+ os.makedirs(out_dir, exist_ok=True)
+to_save = vllm_state_dict
+if getattr(args, "only-amax", False):
+ to_save = {k: v for k, v in vllm_state_dict.items() if k.endswith("_amax")}
+torch.save(to_save, args.output)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print(f"Saving vLLM checkpoint to: {args.output}") | |
os.makedirs(os.path.dirname(args.output), exist_ok=True) | |
torch.save(vllm_state_dict, args.output) | |
print("Conversion complete!") | |
print(f"Saving vLLM checkpoint to: {args.output}") | |
out_dir = os.path.dirname(args.output) | |
if out_dir: | |
os.makedirs(out_dir, exist_ok=True) | |
to_save = vllm_state_dict | |
if getattr(args, "only-amax", False): | |
to_save = {k: v for k, v in vllm_state_dict.items() if k.endswith("_amax")} | |
torch.save(to_save, args.output) | |
print("Conversion complete!") |
🤖 Prompt for AI Agents
In examples/vllm_serve/convert_amax_hf2vllm.py around lines 206 to 209,
os.makedirs(os.path.dirname(args.output), exist_ok=True) will raise if
os.path.dirname(args.output) is empty (when args.output is just a filename);
guard the dirname before creating directories by computing d =
os.path.dirname(args.output) and only call os.makedirs(d, exist_ok=True) if d is
non-empty (or truthy), otherwise skip mkdir and proceed to torch.save.
@@ -0,0 +1,44 @@ | |||
FROM vllm/vllm-openai:v0.10.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align base image with tested/targeted vLLM version.
Base is v0.10.2; README claims tests with 0.9.0 and 0.11.2. Pick one target (prefer latest tested) and update image, or document the intended version matrix.
-FROM vllm/vllm-openai:v0.10.2
+# Consider aligning with the tested version, e.g.:
+# FROM vllm/vllm-openai:v0.11.2
+FROM vllm/vllm-openai:v0.11.2
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
FROM vllm/vllm-openai:v0.10.2 | |
# Consider aligning with the tested version, e.g.: | |
# FROM vllm/vllm-openai:v0.11.2 | |
FROM vllm/vllm-openai:v0.11.2 |
🤖 Prompt for AI Agents
In examples/vllm_serve/Dockerfile around line 1, the base image tag (v0.10.2)
does not match the README's tested versions (0.9.0 and 0.11.2); update the
Dockerfile to use the intended target image (e.g., FROM
vllm/vllm-openai:v0.11.2) to align with the latest tested version, or
alternatively add a comment/README entry documenting the supported vLLM version
matrix and rationale for keeping v0.10.2.
done | ||
|
||
# Allow users to run without root | ||
RUN chmod -R 777 /workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid 777; prefer least privilege.
chmod -R 777 is unsafe. Grant group write as needed or create a non-root user.
-RUN chmod -R 777 /workspace
+# Example: create non-root user and set appropriate permissions
+RUN useradd -m -u 1000 appuser \
+ && chown -R appuser:appuser /workspace
+USER appuser
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
RUN chmod -R 777 /workspace | |
# Example: create non-root user and set appropriate permissions | |
RUN useradd -m -u 1000 appuser \ | |
&& chown -R appuser:appuser /workspace | |
USER appuser |
🤖 Prompt for AI Agents
In examples/vllm_serve/Dockerfile around line 38, replace the unsafe "chmod -R
777 /workspace" with a least-privilege approach: create a non-root user and
group, chown the workspace to that user/group, and set restrictive perms (e.g.,
755 for dirs and 644 for files or 775/g+w only where group write is required)
instead of global 777; then switch to that USER with a USER directive so the
container runs without root privileges. Ensure recursive ownership is limited to
the workspace path and avoid granting execute/write bits to all users.
|
||
Compared with realquant, fakequant is 2-5x slower, but doesn't require dedicated kernel support and facilitates research. | ||
|
||
This example is tested with vllm 0.9.0 and 0.11.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version mismatch with Docker base image.
README states testing with vLLM 0.9.0 and 0.11.2, but the Dockerfile uses v0.10.2. Align the example and base image or document supported versions explicitly.
🤖 Prompt for AI Agents
In examples/vllm_serve/README.md around line 7, the README claims testing with
vLLM 0.9.0 and 0.11.2 while the Dockerfile uses v0.10.2; update either the
README or the Dockerfile so versions align or explicitly list all supported vLLM
versions. Fix by choosing one source of truth: (a) change the README to state
the Dockerfile's v0.10.2 and note any other supported versions, or (b) update
the Dockerfile to one of the versions listed in the README; also add a short
note in the README explaining the tested and supported vLLM versions and any
compatibility caveats.
|
||
## Calibrate and serve fake quant model in vLLM | ||
|
||
Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File name typo.
Use vllm_serve_fakequant.py (no extra underscore). Current text says vllm_serve_fake_quant.py.
-Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format
+Step 1: Modify `quant_config` in `vllm_serve_fakequant.py` for the desired quantization format
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format | |
Step 1: Modify `quant_config` in `vllm_serve_fakequant.py` for the desired quantization format |
🤖 Prompt for AI Agents
In examples/vllm_serve/README.md around line 19, the referenced filename
contains a typo: "vllm_serve_fake_quant.py" should be "vllm_serve_fakequant.py";
update the README line to use the correct filename (vllm_serve_fakequant.py) so
instructions match the actual file name.
if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None: | ||
tokenizer.pad_token = tokenizer.eos_token | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix pad_token condition.
Set pad_token when missing or equals unk; current logic is inverted.
- if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None:
- tokenizer.pad_token = tokenizer.eos_token
+ if tokenizer.pad_token is None or tokenizer.pad_token == getattr(tokenizer, "unk_token", "<unk>"):
+ tokenizer.pad_token = tokenizer.eos_token
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None: | |
tokenizer.pad_token = tokenizer.eos_token | |
if tokenizer.pad_token is None or tokenizer.pad_token == getattr(tokenizer, "unk_token", "<unk>"): | |
tokenizer.pad_token = tokenizer.eos_token |
🤖 Prompt for AI Agents
In examples/vllm_serve/vllm_serve_fakequant.py around lines 110 to 112, the
conditional that sets tokenizer.pad_token is inverted; change it to set
pad_token when it is missing or equals "<unk>" (i.e., if tokenizer.pad_token is
None or tokenizer.pad_token == "<unk>": set tokenizer.pad_token =
tokenizer.eos_token) so the pad token is replaced only when absent or explicitly
"<unk>".
quant_cfg = getattr(mtq, quant_config["quant_format"]) | ||
|
||
with disable_compilation(self.model): | ||
mtq.quantize(self.model, quant_cfg, forward_loop=calibrate_loop) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle invalid quant_format gracefully.
Guard getattr and raise a clear error if the format name is unknown.
- quant_cfg = getattr(mtq, quant_config["quant_format"])
+ try:
+ quant_cfg = getattr(mtq, quant_config["quant_format"])
+ except AttributeError as e:
+ raise ValueError(f"Unknown quant_format: {quant_config['quant_format']}") from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
quant_cfg = getattr(mtq, quant_config["quant_format"]) | |
with disable_compilation(self.model): | |
mtq.quantize(self.model, quant_cfg, forward_loop=calibrate_loop) | |
try: | |
quant_cfg = getattr(mtq, quant_config["quant_format"]) | |
except AttributeError as e: | |
raise ValueError(f"Unknown quant_format: {quant_config['quant_format']}") from e | |
with disable_compilation(self.model): | |
mtq.quantize(self.model, quant_cfg, forward_loop=calibrate_loop) |
🤖 Prompt for AI Agents
In examples/vllm_serve/vllm_serve_fakequant.py around lines 177 to 181, the code
uses getattr(mtq, quant_config["quant_format"]) without validating the attribute
and will raise an unclear AttributeError for unknown formats; update this to
check for the attribute (e.g., using hasattr or try/except AttributeError) and
if missing raise a clear ValueError (or RuntimeError) that includes the invalid
quant_format and optionally lists supported formats, before proceeding to call
mtq.quantize.
.pre-commit-config.yaml
Outdated
|
||
- repo: https://github.com/pre-commit/mirrors-clang-format | ||
rev: v21.1.0 | ||
rev: v9.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, the pre-commit hooks just throw error and cannot run: "metal not supported"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls add this directory in https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/CODEOWNERS and use one of the existing reviewer teams or create a new one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this new example in Changelog under a new 0.39 section in https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks.
if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None: | ||
tokenizer.pad_token = tokenizer.eos_token | ||
|
||
if quant_config["amax_file_path"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check if there's a "amax.pt" file inside the model folder if the amax path is not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not do this as silently loading a checkpoint seems fishy to me. Loading a checkpoint should be explicitly specified
c03be05
to
6d69e05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (9)
examples/vllm_serve/Dockerfile (2)
1-1
: Version mismatch with README.The Dockerfile uses
v0.10.2
but the README (line 7) states testing with vLLM 0.9.0 and 0.11.2. Align the versions or document the supported version matrix.
38-38
: Security issue: overly permissive permissions.
chmod -R 777
grants all users read/write/execute on/workspace
, which is unsafe. Create a non-root user and set restrictive permissions instead.Apply this diff to improve security:
-RUN chmod -R 777 /workspace +RUN useradd -m -u 1000 appuser \ + && chown -R appuser:appuser /workspace +USER appuserexamples/vllm_serve/convert_amax_hf2vllm.py (1)
206-209
: Bug: directory creation fails when output is a bare filename.
os.path.dirname(args.output)
returns an empty string for bare filenames, causingos.makedirs
to fail or misbehave. Guard against empty dirname.Apply this diff:
print(f"Saving vLLM checkpoint to: {args.output}") -os.makedirs(os.path.dirname(args.output), exist_ok=True) +out_dir = os.path.dirname(args.output) +if out_dir: + os.makedirs(out_dir, exist_ok=True) torch.save(vllm_state_dict, args.output)examples/vllm_serve/vllm_serve_fakequant.py (4)
75-95
: Raises on unknown model types.
disable_compilation
throwsValueError
if the model lacks bothmodel
andlanguage_model
attributes. Make it tolerant by falling back to the object itself or no-op.Apply this diff:
@contextmanager -def disable_compilation(model): - """Context manager to temporarily disable torch.compile""" - do_not_compile = True - if hasattr(model, "model"): - do_not_compile = model.model.do_not_compile - model.model.do_not_compile = True - elif hasattr(model, "language_model"): # VLM requires this - do_not_compile = model.language_model.model.do_not_compile - model.language_model.model.do_not_compile = True - else: - raise ValueError("Model does not have a model or language_model attribute") - +def disable_compilation(obj): + """Temporarily set do_not_compile on the underlying model if available.""" + target = None + if hasattr(obj, "model"): + target = obj.model + elif hasattr(obj, "language_model") and hasattr(obj.language_model, "model"): + target = obj.language_model.model + else: + target = obj + old = getattr(target, "do_not_compile", False) try: + setattr(target, "do_not_compile", True) yield finally: - if hasattr(model, "model"): - model.model.do_not_compile = do_not_compile - elif hasattr(model, "language_model"): - model.language_model.model.do_not_compile = do_not_compile + setattr(target, "do_not_compile", old)
110-112
: Bug: pad_token condition is inverted.The current logic sets
pad_token
when it is NOT"<unk>"
OR is None, which is incorrect. It should setpad_token
when it IS None OR equals"<unk>"
.Apply this diff:
- if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None: + if tokenizer.pad_token is None or tokenizer.pad_token == "<unk>": tokenizer.pad_token = tokenizer.eos_token
177-181
: Missing validation for quant_format.
getattr(mtq, quant_config["quant_format"])
will raise an unclearAttributeError
if the format is invalid. Validate and provide a clear error message.Apply this diff:
- quant_cfg = getattr(mtq, quant_config["quant_format"]) + try: + quant_cfg = getattr(mtq, quant_config["quant_format"]) + except AttributeError as e: + raise ValueError( + f"Unknown quant_format: {quant_config['quant_format']}. " + f"Check modelopt.torch.quantization for valid formats." + ) from e
187-213
: Weak validation: count-only check doesn't verify key mapping.Checking only amax key counts can miss mismatches when keys differ but counts match. Filter to
_amax
keys, verify exact key correspondence, and report missing/extra keys.Apply this diff:
amax_file_path = quant_config["amax_file_path"] if amax_file_path: print(f"Loading amax values from {amax_file_path}") saved_amax_dict = torch.load(amax_file_path, map_location=self.device) current_state_dict = self.model.state_dict() - # Count amax keys in checkpoint and model - checkpoint_amax_keys = [key for key in saved_amax_dict if key.endswith("amax")] - model_amax_keys = [key for key in current_state_dict if key.endswith("amax")] + # Filter to amax keys only + saved_amax = {k: v for k, v in saved_amax_dict.items() if "_amax" in k} + model_amax_keys = {k for k in current_state_dict if "_amax" in k} - checkpoint_amax_count = len(checkpoint_amax_keys) - model_amax_count = len(model_amax_keys) - - # Ensure counts match - if checkpoint_amax_count != model_amax_count: - raise ValueError( - f"Mismatch in amax key counts: checkpoint has {checkpoint_amax_count} " - f"amax keys but model has {model_amax_count} amax keys. " - ) + missing_in_model = set(saved_amax.keys()) - model_amax_keys + extra_in_model = model_amax_keys - set(saved_amax.keys()) + + if missing_in_model or extra_in_model: + error_msg = "Amax key mismatch:\n" + if missing_in_model: + error_msg += f" Keys in file but not in model: {sorted(list(missing_in_model))[:5]}\n" + if extra_in_model: + error_msg += f" Keys in model but not in file: {sorted(list(extra_in_model))[:5]}\n" + raise ValueError(error_msg) - # Update amax values - for key, value in saved_amax_dict.items(): - if key in current_state_dict: - current_state_dict[key] = value.to(self.device) + # Update amax values only + with torch.no_grad(): + for key, value in saved_amax.items(): + current_state_dict[key] = value.to(self.device) self.model.load_state_dict(current_state_dict, strict=True)examples/vllm_serve/README.md (2)
7-7
: Version mismatch with Dockerfile.The README claims testing with vLLM 0.9.0 and 0.11.2, but the Dockerfile uses v0.10.2. Align these or explicitly document the version matrix.
19-19
: Filename typo.The referenced filename
vllm_serve_fake_quant.py
has an extra underscore. The actual file isvllm_serve_fakequant.py
.Apply this diff:
-Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format +Step 1: Modify `quant_config` in `vllm_serve_fakequant.py` for the desired quantization format
🧹 Nitpick comments (3)
examples/vllm_serve/Dockerfile (1)
29-29
: Document why CUDA extension compilation failures are acceptable.The
|| true
swallows compilation errors. If precompilation is optional, add a brief comment explaining why failures are tolerated.-RUN python3 -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()" || true +# Pre-compile CUDA extensions (optional; compilation may fail in some environments) +RUN python3 -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()" || trueexamples/vllm_serve/vllm_serve_fakequant.py (1)
113-116
: Consider auto-detecting amax.pt in model directory.As suggested in past review, checking for
amax.pt
in the model folder whenamax_file_path
is None would improve user experience.Apply this diff:
+ # Auto-detect amax file if not specified + if not quant_config["amax_file_path"]: + default_amax = os.path.join(self.model_config.model, "amax.pt") + if os.path.isfile(default_amax): + quant_config["amax_file_path"] = default_amax + print(f"Auto-detected amax file at {default_amax}") + if quant_config["amax_file_path"]: # If amax file path is provided, we only need to do a simple calibration step quant_config["quant_num_samples"] = 1examples/vllm_serve/README.md (1)
46-56
: Section marked WIP but provides complete instructions.The heading says "(WIP)" but the steps appear complete and functional. Either remove the WIP tag or clarify what remains unfinished.
-## Load QAT/PTQ model and serve in vLLM (WIP) +## Load QAT/PTQ model and serve in vLLMOr clarify what's incomplete:
-## Load QAT/PTQ model and serve in vLLM (WIP) +## Load QAT/PTQ model and serve in vLLM (WIP: only tested for Llama 3.1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/vllm_serve/Dockerfile
(1 hunks)examples/vllm_serve/README.md
(1 hunks)examples/vllm_serve/convert_amax_hf2vllm.py
(1 hunks)examples/vllm_serve/vllm_serve_fakequant.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/vllm_serve/vllm_serve_fakequant.py (2)
modelopt/torch/utils/dataset_utils.py (1)
get_dataset_dataloader
(157-232)modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to
(115-123)
examples/vllm_serve/convert_amax_hf2vllm.py (1)
examples/vllm_serve/vllm_serve_fakequant.py (1)
main
(241-250)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (4)
examples/vllm_serve/convert_amax_hf2vllm.py (2)
26-86
: LGTM: Conversion logic is sound.The function correctly merges QKV and gate/up projection amax values using elementwise max, which is appropriate for quantization ranges.
88-153
: LGTM: Test coverage validates key transformations.The test function provides clear validation of the merging logic with expected vs. actual key comparisons.
examples/vllm_serve/vllm_serve_fakequant.py (1)
125-176
: LGTM: Calibration loop correctly integrates with vLLM's execution model.The synthetic request construction and scheduler output assembly properly mirror vLLM's internal structures for calibration.
examples/vllm_serve/README.md (1)
1-57
: LGTM: Clear documentation of the workflow.The README provides a well-structured guide covering environment setup, calibration, serving, and evaluation with concrete examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!!
6d69e05
to
39d7ded
Compare
39d7ded
to
5f1ed83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/vllm_serve/vllm_serve_fakequant.py (1)
187-213
: Consider stricter amax key validation.While the count-based check (lines 197-205) catches gross mismatches, it doesn't verify that checkpoint keys actually map to model keys. If the checkpoint contains different amax keys (e.g., from a different model architecture), the count could match but keys could be misaligned.
Consider validating key sets explicitly:
- checkpoint_amax_keys = [key for key in saved_amax_dict if key.endswith("amax")] - model_amax_keys = [key for key in current_state_dict if key.endswith("amax")] - - checkpoint_amax_count = len(checkpoint_amax_keys) - model_amax_count = len(model_amax_keys) - - # Ensure counts match - if checkpoint_amax_count != model_amax_count: - raise ValueError( - f"Mismatch in amax key counts: checkpoint has {checkpoint_amax_count} " - f"amax keys but model has {model_amax_count} amax keys. " - ) + # Filter to amax keys only + checkpoint_amax_keys = {k for k in saved_amax_dict if k.endswith("_amax") or k.endswith("amax")} + model_amax_keys = {k for k in current_state_dict if k.endswith("_amax") or k.endswith("amax")} + + # Verify key sets match + missing_in_model = checkpoint_amax_keys - model_amax_keys + extra_in_model = model_amax_keys - checkpoint_amax_keys + + if missing_in_model or extra_in_model: + raise ValueError( + f"Amax key mismatch:\n" + f" Keys in checkpoint not found in model: {sorted(list(missing_in_model))[:5]}...\n" + f" Keys in model not found in checkpoint: {sorted(list(extra_in_model))[:5]}..." + ) # Update amax values - for key, value in saved_amax_dict.items(): - if key in current_state_dict: + with torch.no_grad(): + for key in checkpoint_amax_keys: + value = saved_amax_dict[key] current_state_dict[key] = value.to(self.device)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/vllm_serve/Dockerfile
(1 hunks)examples/vllm_serve/README.md
(1 hunks)examples/vllm_serve/convert_amax_hf2vllm.py
(1 hunks)examples/vllm_serve/vllm_serve_fakequant.py
(1 hunks)modelopt/torch/quantization/plugins/vllm.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/vllm_serve/README.md
- modelopt/torch/quantization/plugins/vllm.py
🧰 Additional context used
🧬 Code graph analysis (1)
examples/vllm_serve/vllm_serve_fakequant.py (3)
modelopt/torch/utils/dataset_utils.py (1)
get_dataset_dataloader
(157-232)examples/llm_autodeploy/api_server.py (1)
run_server
(194-208)modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to
(115-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (8)
examples/vllm_serve/Dockerfile (2)
29-29
: LGTM: Pre-compile with fallback.Using
|| true
allows the build to continue if CUDA extension pre-compilation fails, which is appropriate for environments where compilation might not be immediately possible.
32-35
: LGTM: Requirements install with error handling.The script iterates through requirements files, attempts installation, and logs warnings on failure without stopping the build. This is a pragmatic approach for optional dependencies.
examples/vllm_serve/convert_amax_hf2vllm.py (2)
26-85
: LGTM: Conversion logic is sound.The function correctly:
- Preserves non-amax keys unchanged.
- Groups q/k/v and gate/up amax keys using regex patterns.
- Merges grouped keys by taking elementwise max via
torch.stack().max(dim=0)[0]
.- Handles single-key groups by renaming.
The regex patterns and merging strategy align with the vLLM format requirements described in the PR.
88-153
: LGTM: Test function validates conversion.The test creates a representative HF state dict with q/k/v and gate/up amax keys, runs the conversion, and verifies the output matches expected vLLM keys using set comparison. This provides a basic sanity check for the conversion logic.
examples/vllm_serve/vllm_serve_fakequant.py (4)
113-123
: LGTM: Efficient calibration when amax is pre-computed.Setting
quant_num_samples = 1
whenamax_file_path
is provided optimizes calibration by performing only a minimal forward pass before loading pre-computed amax values. The dataloader setup correctly uses the modelopt utility function.
241-250
: LGTM: Main entry point structure.The main function correctly:
- Creates a FlexibleArgumentParser for vLLM arguments.
- Adds a positional model argument.
- Uses
make_arg_parser
to augment with vLLM CLI arguments.- Launches the server with uvloop for async execution.
223-226
: Verify new_determine_available_memory return type
The wrapper is annotated-> None
but returnsresults
from the originaldetermine_available_memory
. Confirm what the original method returns (e.g., memory stats) and update the annotation or adjust the return accordingly.
169-175
: Check get_pp_group() behavior on non-distributed setups
get_pp_group() comes from vllm.distributed.parallel_state; verify that when torch.distributed isn’t initialized (e.g. single-GPU) calling.is_first_rank
or.recv_tensor_dict
won’t error.
What does this PR do?
Type of change: New feature and example
Overview: vllm linear and moe layer quantization support is already in modelopt library. This example shows how to calibrate and serve fakequant model with vllm. For more detailed instruction, see README.
Usage
Testing
This example is tested with latest vllm 0.10.2, with Qwen3 MoE/dense model and Llama 3.1 series.
Tested with GSM8k.
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Documentation
Chores