[New feature] Support fakequant serve with latest vllm and generalize calib logic #369

RalphMao · 2025-09-25T03:53:09Z

What does this PR do?

Type of change: New feature and example

Overview: vllm linear and moe layer quantization support is already in modelopt library. This example shows how to calibrate and serve fakequant model with vllm. For more detailed instruction, see README.

Usage

python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000

Testing

This example is tested with latest vllm 0.10.2, with Qwen3 MoE/dense model and Llama 3.1 series.

Tested with GSM8k.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added a vLLM fake-quantization serving workflow with calibration, weight folding, optional amax loading, and a server entrypoint.
- Added a utility to convert Hugging Face amax values to vLLM format with CLI and test mode.
Documentation
- Expanded example README with Docker setup, calibration/serving steps, sample API requests, and evaluation notes.
Chores
- Added a Docker image for the example with required dependencies and a flexible default shell.

coderabbitai · 2025-09-25T03:53:17Z

Walkthrough

Adds a vLLM fakequant serving example: Dockerfile and README, a CLI utility to convert HuggingFace amax tensors to vLLM format, a vLLM integration that calibrates and applies quantization with Worker monkey‑patches, and a small import addition for vLLM fused MoE support.

Changes

Cohort / File(s)	Summary
Container setup `examples/vllm_serve/Dockerfile`	New Dockerfile based on `vllm/vllm-openai:v0.10.2`; configures env and working dir, installs system deps, copies local `TensorRT-Model-Optimizer` source, removes `.git`, installs package editable with extras and dev-test, installs `flash-attn`, precompiles CUDA extensions, installs example requirements, adjusts permissions, clears `ENTRYPOINT`, sets default `CMD`.
Documentation `examples/vllm_serve/README.md`	New README describing Docker build/run, calibration and serving steps for fakequant models with vLLM, example `curl` API test, optional `lm_eval` use, and WIP guidance for loading QAT/PTQ amax into `quant_config`.
AMAX conversion utility `examples/vllm_serve/convert_amax_hf2vllm.py`	New script converting HF amax keys to vLLM format: preserves non-amax keys, merges `q_proj/k_proj/v_proj`→`qkv_proj` and `gate_proj/up_proj`→`gate_up_proj` via elementwise max, includes CLI (`--input`, `--output`, `--dry-run`, `--test`) and an internal test routine.
vLLM fakequant integration `examples/vllm_serve/vllm_serve_fakequant.py`	New example adding `disable_compilation` context manager, `quant_config`, `fakequant_run_prolog` for calibration runs (dataloader, forward passes, apply quantization, optional load amax), weight folding, and monkey‑patches for `Worker.determine_available_memory` and `Worker.compile_or_warm_up_model`; includes `main()` to launch server.
vLLM fused MoE import `modelopt/torch/quantization/plugins/vllm.py`	Adds `import vllm.model_executor.layers.fused_moe.layer as vllm_fused_moe_layer` to reference fused MoE symbols; no other logic changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant CLI as vLLM Server (main)
  participant W as Worker
  participant Q as fakequant_run_prolog
  participant M as Model
  participant D as Dataloader

  U->>CLI: Start server
  CLI->>W: initialize & warm up
  rect rgba(200,230,255,0.25)
    note right of W: Monkey‑patched warm-up flow
    W->>Q: run prolog (quant prep)
    Q->>D: build calibration dataset
    loop calibration batches
      Q->>M: forward pass (execute_model)
      M-->>Q: activations / stats
    end
    Q->>M: resolve & apply quant_config
    alt optional amax file
      Q->>M: load amax state_dict
    end
    Q->>M: fold weights
  end
  W->>W: original warm-up / compile
  CLI-->>U: Server ready (serving /v1/*)

sequenceDiagram
  autonumber
  participant OP as Operator
  participant C as convert_amax_hf2vllm.py
  participant FS as Filesystem

  OP->>C: Run with --input ckpt --output amax.pt
  C->>FS: Load HF state_dict
  C->>C: Group & merge amax keys (qkv, gate_up) via elementwise max
  alt dry-run
    C-->>OP: Print summary (no write)
  else save
    C->>FS: Save converted vLLM amax state_dict
    C-->>OP: Report completion
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws on silicon ground,
A Docker burrow freshly found.
I merge amax carrots, stash them tight,
Calibrate by moonlit byte.
Serve hops of speed with quantized delight. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly identifies the addition of support for fakequant serving in the latest vLLM and the generalization of calibration logic, which aligns directly with the primary changes in the examples, scripts, and README provided in this pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch huizim/vllm_serve_update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (8)

.pre-commit-config.yaml (1)

35-39: Reverting clang-format to v9 may cause formatting drift; confirm toolchain compatibility.

v9.0.0 is quite old and lacks many later fixes. Please verify CI/dev environments and any IDE integrations rely on the same version to avoid churn. If this is for compatibility with an external toolchain, document the rationale here.

Would you like me to scan the repo for references to clang-format versions in docs/CI to confirm alignment?
examples/vllm_serve/Dockerfile (2)
32-35: Robust loop over requirements; avoid backslash escapes.

Use read -r to avoid backslash interpretation and set IFS.
-RUN find TensorRT-Model-Optimizer/examples -name "requirements.txt" | grep -v "windows" | while read req_file; do \
+RUN find TensorRT-Model-Optimizer/examples -name "requirements.txt" | grep -v "windows" | while IFS= read -r req_file; do \
         echo "Installing from $req_file"; \
         pip install -r "$req_file" || echo "Warning: Failed to install from $req_file"; \
     done
16-23: Consider .dockerignore instead of copying and pruning .git.

COPY . brings in large/unnecessary files; .dockerignore reduces context and speeds builds.
examples/vllm_serve/convert_amax_hf2vllm.py (1)
155-163: CLI UX: require input/output unless --test, and add --only-amax option.

Add a flag to save only amax keys to keep checkpoints small and avoid later confusion.
 parser = argparse.ArgumentParser(
     description="Convert amax values from HuggingFace to vLLM format"
 )
@@
 parser.add_argument("--dry-run", action="store_true", help="Show conversion without saving")
 parser.add_argument("--test", action="store_true", help="Run test with sample data")
+parser.add_argument("--only-amax", action="store_true", help="Save only *_amax keys to output")
examples/vllm_serve/vllm_serve_fakequant.py (4)
117-124: Consider left padding for calibration consistency.

Left padding often yields better calibration; dataset_utils also warns.
-    calib_dataloader = get_dataset_dataloader(
+    tokenizer.padding_side = "left"
+    calib_dataloader = get_dataset_dataloader(
223-227: determine_available_memory: return value not used; simplify.

The original returns None. Returning results is harmless, but you can just call and return None to match signature.
-def new_determine_available_memory(self) -> None:
-    with disable_compilation(self.model_runner.model):
-        results = old_determine_available_memory(self)
-    return results
+def new_determine_available_memory(self) -> None:
+    with disable_compilation(self.model_runner.model):
+        old_determine_available_memory(self)
229-233: Prolog trigger OK; confirm idempotency.

compile_or_warm_up_model may be called more than once. Ensure fakequant_run_prolog is idempotent or guarded.

1-52: Duplicate/stacked license headers; consolidate to avoid confusion.

The file contains multiple Apache-2.0 and MIT blocks. Keep a single project header and add third-party notices in a dedicated section or 3rd-party notices file.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ff8fc9 and c03be05.

📒 Files selected for processing (5)

.pre-commit-config.yaml (1 hunks)
examples/vllm_serve/Dockerfile (1 hunks)
examples/vllm_serve/README.md (1 hunks)
examples/vllm_serve/convert_amax_hf2vllm.py (1 hunks)
examples/vllm_serve/vllm_serve_fakequant.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/vllm_serve/vllm_serve_fakequant.py (3)

modelopt/torch/utils/dataset_utils.py (1)

get_dataset_dataloader (157-232)

modelopt/torch/quantization/qtensor/base_qtensor.py (1)

to (115-123)

examples/vllm_serve/convert_amax_hf2vllm.py (1)

main (155-209)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (7)

examples/vllm_serve/README.md (1)

24-25: Confirm CLI flags validity across vLLM versions.

The short flag -tp may differ across versions. Ensure make_arg_parser exposes -tp; otherwise use --tensor-parallel-size.

examples/vllm_serve/Dockerfile (1)

26-27: Flash-Attn version compatibility.

Pinning flash-attn to 2.7.4.post1 may conflict with the base image CUDA/PyTorch versions. Please confirm ABI/Wheel availability for the base image.

examples/vllm_serve/convert_amax_hf2vllm.py (2)

71-85: Merging logic OK; consider dtype/device preservation and scalar handling.

torch.stack + max preserves dtype/device; fine. Ensure q/k/v shapes match; else raise with a clear error.

191-201: Dry-run output helpful.

Good observability of key changes.

examples/vllm_serve/vllm_serve_fakequant.py (3)

171-175: PP intermediate tensor handoff may differ across vLLM versions.

recv_tensor_dict signature and gating on is_first_rank can vary (0.10 vs 0.11). Please validate on both targets.

183-185: Rank-0 printing only: good.

Prevents noisy logs in distributed runs.

241-251: Server startup path likely OK.

FlexibleArgumentParser + make_arg_parser + uvloop.run(run_server(args)) matches vLLM patterns.

coderabbitai · 2025-09-25T04:00:59Z

examples/vllm_serve/convert_amax_hf2vllm.py

+    print(f"Saving vLLM checkpoint to: {args.output}")
+    os.makedirs(os.path.dirname(args.output), exist_ok=True)
+    torch.save(vllm_state_dict, args.output)
+    print("Conversion complete!")


⚠️ Potential issue

Bug: os.makedirs with empty dirname fails when output is a filename.

Guard dirname before creating.

-print(f"Saving vLLM checkpoint to: {args.output}") -os.makedirs(os.path.dirname(args.output), exist_ok=True) -torch.save(vllm_state_dict, args.output) +print(f"Saving vLLM checkpoint to: {args.output}") +out_dir = os.path.dirname(args.output) +if out_dir: + os.makedirs(out_dir, exist_ok=True) +to_save = vllm_state_dict +if getattr(args, "only-amax", False): + to_save = {k: v for k, v in vllm_state_dict.items() if k.endswith("_amax")} +torch.save(to_save, args.output)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"Saving vLLM checkpoint to: {args.output}")

os.makedirs(os.path.dirname(args.output), exist_ok=True)

torch.save(vllm_state_dict, args.output)

print("Conversion complete!")

print(f"Saving vLLM checkpoint to: {args.output}")

out_dir = os.path.dirname(args.output)

if out_dir:

os.makedirs(out_dir, exist_ok=True)

to_save = vllm_state_dict

if getattr(args, "only-amax", False):

to_save = {k: v for k, v in vllm_state_dict.items() if k.endswith("_amax")}

torch.save(to_save, args.output)

print("Conversion complete!")

🤖 Prompt for AI Agents

In examples/vllm_serve/convert_amax_hf2vllm.py around lines 206 to 209, os.makedirs(os.path.dirname(args.output), exist_ok=True) will raise if os.path.dirname(args.output) is empty (when args.output is just a filename); guard the dirname before creating directories by computing d = os.path.dirname(args.output) and only call os.makedirs(d, exist_ok=True) if d is non-empty (or truthy), otherwise skip mkdir and proceed to torch.save.

coderabbitai · 2025-09-25T04:00:59Z

examples/vllm_serve/Dockerfile

@@ -0,0 +1,44 @@
+FROM vllm/vllm-openai:v0.10.2


⚠️ Potential issue

Align base image with tested/targeted vLLM version.

Base is v0.10.2; README claims tests with 0.9.0 and 0.11.2. Pick one target (prefer latest tested) and update image, or document the intended version matrix.

-FROM vllm/vllm-openai:v0.10.2 +# Consider aligning with the tested version, e.g.: +# FROM vllm/vllm-openai:v0.11.2 +FROM vllm/vllm-openai:v0.11.2

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FROM vllm/vllm-openai:v0.10.2

# Consider aligning with the tested version, e.g.:

# FROM vllm/vllm-openai:v0.11.2

FROM vllm/vllm-openai:v0.11.2

🤖 Prompt for AI Agents

In examples/vllm_serve/Dockerfile around line 1, the base image tag (v0.10.2) does not match the README's tested versions (0.9.0 and 0.11.2); update the Dockerfile to use the intended target image (e.g., FROM vllm/vllm-openai:v0.11.2) to align with the latest tested version, or alternatively add a comment/README entry documenting the supported vLLM version matrix and rationale for keeping v0.10.2.

coderabbitai · 2025-09-25T04:00:59Z

examples/vllm_serve/Dockerfile

+    done
+
+# Allow users to run without root
+RUN chmod -R 777 /workspace


⚠️ Potential issue

Avoid 777; prefer least privilege.

chmod -R 777 is unsafe. Grant group write as needed or create a non-root user.

-RUN chmod -R 777 /workspace +# Example: create non-root user and set appropriate permissions +RUN useradd -m -u 1000 appuser \ + && chown -R appuser:appuser /workspace +USER appuser

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

RUN chmod -R 777 /workspace

# Example: create non-root user and set appropriate permissions

RUN useradd -m -u 1000 appuser \

&& chown -R appuser:appuser /workspace

USER appuser

🤖 Prompt for AI Agents

In examples/vllm_serve/Dockerfile around line 38, replace the unsafe "chmod -R 777 /workspace" with a least-privilege approach: create a non-root user and group, chown the workspace to that user/group, and set restrictive perms (e.g., 755 for dirs and 644 for files or 775/g+w only where group write is required) instead of global 777; then switch to that USER with a USER directive so the container runs without root privileges. Ensure recursive ownership is limited to the workspace path and avoid granting execute/write bits to all users.

coderabbitai · 2025-09-25T04:00:59Z

examples/vllm_serve/README.md

+
+Compared with realquant, fakequant is 2-5x slower, but doesn't require dedicated kernel support and facilitates research.
+
+This example is tested with vllm 0.9.0 and 0.11.2


⚠️ Potential issue

Version mismatch with Docker base image.

README states testing with vLLM 0.9.0 and 0.11.2, but the Dockerfile uses v0.10.2. Align the example and base image or document supported versions explicitly.

🤖 Prompt for AI Agents

In examples/vllm_serve/README.md around line 7, the README claims testing with vLLM 0.9.0 and 0.11.2 while the Dockerfile uses v0.10.2; update either the README or the Dockerfile so versions align or explicitly list all supported vLLM versions. Fix by choosing one source of truth: (a) change the README to state the Dockerfile's v0.10.2 and note any other supported versions, or (b) update the Dockerfile to one of the versions listed in the README; also add a short note in the README explaining the tested and supported vLLM versions and any compatibility caveats.

coderabbitai · 2025-09-25T04:00:59Z

examples/vllm_serve/README.md

+
+## Calibrate and serve fake quant model in vLLM
+
+Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format


⚠️ Potential issue

File name typo.

Use vllm_serve_fakequant.py (no extra underscore). Current text says vllm_serve_fake_quant.py.

-Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format +Step 1: Modify `quant_config` in `vllm_serve_fakequant.py` for the desired quantization format

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format

Step 1: Modify `quant_config` in `vllm_serve_fakequant.py` for the desired quantization format

🤖 Prompt for AI Agents

In examples/vllm_serve/README.md around line 19, the referenced filename contains a typo: "vllm_serve_fake_quant.py" should be "vllm_serve_fakequant.py"; update the README line to use the correct filename (vllm_serve_fakequant.py) so instructions match the actual file name.

examples/vllm_serve/vllm_serve_fakequant.py

coderabbitai · 2025-09-25T04:01:00Z

examples/vllm_serve/vllm_serve_fakequant.py

+    if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+


⚠️ Potential issue

Fix pad_token condition.

Set pad_token when missing or equals unk; current logic is inverted.

- if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None: - tokenizer.pad_token = tokenizer.eos_token + if tokenizer.pad_token is None or tokenizer.pad_token == getattr(tokenizer, "unk_token", "<unk>"): + tokenizer.pad_token = tokenizer.eos_token

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None:

tokenizer.pad_token = tokenizer.eos_token

if tokenizer.pad_token is None or tokenizer.pad_token == getattr(tokenizer, "unk_token", "<unk>"):

tokenizer.pad_token = tokenizer.eos_token

🤖 Prompt for AI Agents

In examples/vllm_serve/vllm_serve_fakequant.py around lines 110 to 112, the conditional that sets tokenizer.pad_token is inverted; change it to set pad_token when it is missing or equals "<unk>" (i.e., if tokenizer.pad_token is None or tokenizer.pad_token == "<unk>": set tokenizer.pad_token = tokenizer.eos_token) so the pad token is replaced only when absent or explicitly "<unk>".

coderabbitai · 2025-09-25T04:01:00Z

examples/vllm_serve/vllm_serve_fakequant.py

+    quant_cfg = getattr(mtq, quant_config["quant_format"])
+
+    with disable_compilation(self.model):
+        mtq.quantize(self.model, quant_cfg, forward_loop=calibrate_loop)
+


⚠️ Potential issue

Handle invalid quant_format gracefully.

Guard getattr and raise a clear error if the format name is unknown.

- quant_cfg = getattr(mtq, quant_config["quant_format"]) + try: + quant_cfg = getattr(mtq, quant_config["quant_format"]) + except AttributeError as e: + raise ValueError(f"Unknown quant_format: {quant_config['quant_format']}") from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

quant_cfg = getattr(mtq, quant_config["quant_format"])

with disable_compilation(self.model):

mtq.quantize(self.model, quant_cfg, forward_loop=calibrate_loop)

try:

quant_cfg = getattr(mtq, quant_config["quant_format"])

except AttributeError as e:

raise ValueError(f"Unknown quant_format: {quant_config['quant_format']}") from e

with disable_compilation(self.model):

mtq.quantize(self.model, quant_cfg, forward_loop=calibrate_loop)

🤖 Prompt for AI Agents

In examples/vllm_serve/vllm_serve_fakequant.py around lines 177 to 181, the code uses getattr(mtq, quant_config["quant_format"]) without validating the attribute and will raise an unclear AttributeError for unknown formats; update this to check for the attribute (e.g., using hasattr or try/except AttributeError) and if missing raise a clear ValueError (or RuntimeError) that includes the invalid quant_format and optionally lists supported formats, before proceeding to call mtq.quantize.

examples/vllm_serve/vllm_serve_fakequant.py

kevalmorabia97 · 2025-09-25T07:43:34Z

.pre-commit-config.yaml


  - repo: https://github.com/pre-commit/mirrors-clang-format
-    rev: v21.1.0
+    rev: v9.0.0


is this change intentional?

Without this change, the pre-commit hooks just throw error and cannot run: "metal not supported"

kevalmorabia97 · 2025-09-25T07:44:36Z

examples/vllm_serve/README.md

Pls add this directory in https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/CODEOWNERS and use one of the existing reviewer teams or create a new one

kevalmorabia97 · 2025-09-25T07:45:13Z

examples/vllm_serve/README.md

Please add this new example in Changelog under a new 0.39 section in https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst

mxinO

LGTM! Thanks.

mxinO · 2025-09-26T03:06:31Z

examples/vllm_serve/vllm_serve_fakequant.py

+    if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    if quant_config["amax_file_path"]:


Can we check if there's a "amax.pt" file inside the model folder if the amax path is not specified?

I would rather not do this as silently loading a checkpoint seems fishy to me. Loading a checkpoint should be explicitly specified

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (9)

examples/vllm_serve/Dockerfile (2)

1-1: Version mismatch with README.

The Dockerfile uses v0.10.2 but the README (line 7) states testing with vLLM 0.9.0 and 0.11.2. Align the versions or document the supported version matrix.

38-38: Security issue: overly permissive permissions.

chmod -R 777 grants all users read/write/execute on /workspace, which is unsafe. Create a non-root user and set restrictive permissions instead.

Apply this diff to improve security:
-RUN chmod -R 777 /workspace
+RUN useradd -m -u 1000 appuser \
+ && chown -R appuser:appuser /workspace
+USER appuser

examples/vllm_serve/convert_amax_hf2vllm.py (1)

206-209: Bug: directory creation fails when output is a bare filename.

os.path.dirname(args.output) returns an empty string for bare filenames, causing os.makedirs to fail or misbehave. Guard against empty dirname.

Apply this diff:
 print(f"Saving vLLM checkpoint to: {args.output}")
-os.makedirs(os.path.dirname(args.output), exist_ok=True)
+out_dir = os.path.dirname(args.output)
+if out_dir:
+    os.makedirs(out_dir, exist_ok=True)
 torch.save(vllm_state_dict, args.output)

examples/vllm_serve/vllm_serve_fakequant.py (4)

75-95: Raises on unknown model types.

disable_compilation throws ValueError if the model lacks both model and language_model attributes. Make it tolerant by falling back to the object itself or no-op.

Apply this diff:

 @contextmanager
-def disable_compilation(model):
-    """Context manager to temporarily disable torch.compile"""
-    do_not_compile = True
-    if hasattr(model, "model"):
-        do_not_compile = model.model.do_not_compile
-        model.model.do_not_compile = True
-    elif hasattr(model, "language_model"):  # VLM requires this
-        do_not_compile = model.language_model.model.do_not_compile
-        model.language_model.model.do_not_compile = True
-    else:
-        raise ValueError("Model does not have a model or language_model attribute")
-
+def disable_compilation(obj):
+    """Temporarily set do_not_compile on the underlying model if available."""
+    target = None
+    if hasattr(obj, "model"):
+        target = obj.model
+    elif hasattr(obj, "language_model") and hasattr(obj.language_model, "model"):
+        target = obj.language_model.model
+    else:
+        target = obj
+    old = getattr(target, "do_not_compile", False)
     try:
+        setattr(target, "do_not_compile", True)
         yield
     finally:
-        if hasattr(model, "model"):
-            model.model.do_not_compile = do_not_compile
-        elif hasattr(model, "language_model"):
-            model.language_model.model.do_not_compile = do_not_compile
+        setattr(target, "do_not_compile", old)

110-112: Bug: pad_token condition is inverted.

The current logic sets pad_token when it is NOT "<unk>" OR is None, which is incorrect. It should set pad_token when it IS None OR equals "<unk>".

Apply this diff:

-    if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None:
+    if tokenizer.pad_token is None or tokenizer.pad_token == "<unk>":
         tokenizer.pad_token = tokenizer.eos_token

177-181: Missing validation for quant_format.

getattr(mtq, quant_config["quant_format"]) will raise an unclear AttributeError if the format is invalid. Validate and provide a clear error message.

Apply this diff:

-    quant_cfg = getattr(mtq, quant_config["quant_format"])
+    try:
+        quant_cfg = getattr(mtq, quant_config["quant_format"])
+    except AttributeError as e:
+        raise ValueError(
+            f"Unknown quant_format: {quant_config['quant_format']}. "
+            f"Check modelopt.torch.quantization for valid formats."
+        ) from e

187-213: Weak validation: count-only check doesn't verify key mapping.

Checking only amax key counts can miss mismatches when keys differ but counts match. Filter to _amax keys, verify exact key correspondence, and report missing/extra keys.

Apply this diff:

     amax_file_path = quant_config["amax_file_path"]
     if amax_file_path:
         print(f"Loading amax values from {amax_file_path}")
         saved_amax_dict = torch.load(amax_file_path, map_location=self.device)
         current_state_dict = self.model.state_dict()
 
-        # Count amax keys in checkpoint and model
-        checkpoint_amax_keys = [key for key in saved_amax_dict if key.endswith("amax")]
-        model_amax_keys = [key for key in current_state_dict if key.endswith("amax")]
+        # Filter to amax keys only
+        saved_amax = {k: v for k, v in saved_amax_dict.items() if "_amax" in k}
+        model_amax_keys = {k for k in current_state_dict if "_amax" in k}
 
-        checkpoint_amax_count = len(checkpoint_amax_keys)
-        model_amax_count = len(model_amax_keys)
-
-        # Ensure counts match
-        if checkpoint_amax_count != model_amax_count:
-            raise ValueError(
-                f"Mismatch in amax key counts: checkpoint has {checkpoint_amax_count} "
-                f"amax keys but model has {model_amax_count} amax keys. "
-            )
+        missing_in_model = set(saved_amax.keys()) - model_amax_keys
+        extra_in_model = model_amax_keys - set(saved_amax.keys())
+        
+        if missing_in_model or extra_in_model:
+            error_msg = "Amax key mismatch:\n"
+            if missing_in_model:
+                error_msg += f"  Keys in file but not in model: {sorted(list(missing_in_model))[:5]}\n"
+            if extra_in_model:
+                error_msg += f"  Keys in model but not in file: {sorted(list(extra_in_model))[:5]}\n"
+            raise ValueError(error_msg)
 
-        # Update amax values
-        for key, value in saved_amax_dict.items():
-            if key in current_state_dict:
-                current_state_dict[key] = value.to(self.device)
+        # Update amax values only
+        with torch.no_grad():
+            for key, value in saved_amax.items():
+                current_state_dict[key] = value.to(self.device)
 
         self.model.load_state_dict(current_state_dict, strict=True)

examples/vllm_serve/README.md (2)

7-7: Version mismatch with Dockerfile.

The README claims testing with vLLM 0.9.0 and 0.11.2, but the Dockerfile uses v0.10.2. Align these or explicitly document the version matrix.

19-19: Filename typo.

The referenced filename vllm_serve_fake_quant.py has an extra underscore. The actual file is vllm_serve_fakequant.py.

Apply this diff:
-Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format
+Step 1: Modify `quant_config` in `vllm_serve_fakequant.py` for the desired quantization format

🧹 Nitpick comments (3)

examples/vllm_serve/Dockerfile (1)
29-29: Document why CUDA extension compilation failures are acceptable.

The || true swallows compilation errors. If precompilation is optional, add a brief comment explaining why failures are tolerated.
-RUN python3 -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()" || true
+# Pre-compile CUDA extensions (optional; compilation may fail in some environments)
+RUN python3 -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()" || true
examples/vllm_serve/vllm_serve_fakequant.py (1)
113-116: Consider auto-detecting amax.pt in model directory.

As suggested in past review, checking for amax.pt in the model folder when amax_file_path is None would improve user experience.

Apply this diff:
+    # Auto-detect amax file if not specified
+    if not quant_config["amax_file_path"]:
+        default_amax = os.path.join(self.model_config.model, "amax.pt")
+        if os.path.isfile(default_amax):
+            quant_config["amax_file_path"] = default_amax
+            print(f"Auto-detected amax file at {default_amax}")
+    
     if quant_config["amax_file_path"]:
         # If amax file path is provided, we only need to do a simple calibration step
         quant_config["quant_num_samples"] = 1
examples/vllm_serve/README.md (1)
46-56: Section marked WIP but provides complete instructions.

The heading says "(WIP)" but the steps appear complete and functional. Either remove the WIP tag or clarify what remains unfinished.
-## Load QAT/PTQ model and serve in vLLM (WIP)
+## Load QAT/PTQ model and serve in vLLM
Or clarify what's incomplete:
-## Load QAT/PTQ model and serve in vLLM (WIP)
+## Load QAT/PTQ model and serve in vLLM (WIP: only tested for Llama 3.1)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c03be05 and 6d69e05.

📒 Files selected for processing (4)

examples/vllm_serve/Dockerfile (1 hunks)
examples/vllm_serve/README.md (1 hunks)
examples/vllm_serve/convert_amax_hf2vllm.py (1 hunks)
examples/vllm_serve/vllm_serve_fakequant.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/vllm_serve/vllm_serve_fakequant.py (2)

modelopt/torch/utils/dataset_utils.py (1)

get_dataset_dataloader (157-232)

modelopt/torch/quantization/qtensor/base_qtensor.py (1)

to (115-123)

examples/vllm_serve/convert_amax_hf2vllm.py (1)

examples/vllm_serve/vllm_serve_fakequant.py (1)

main (241-250)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (4)

examples/vllm_serve/convert_amax_hf2vllm.py (2)

26-86: LGTM: Conversion logic is sound.

The function correctly merges QKV and gate/up projection amax values using elementwise max, which is appropriate for quantization ranges.

88-153: LGTM: Test coverage validates key transformations.

The test function provides clear validation of the merging logic with expected vs. actual key comparisons.

examples/vllm_serve/vllm_serve_fakequant.py (1)

125-176: LGTM: Calibration loop correctly integrates with vLLM's execution model.

The synthetic request construction and scheduler output assembly properly mirror vLLM's internal structures for calibration.

examples/vllm_serve/README.md (1)

1-57: LGTM: Clear documentation of the workflow.

The README provides a well-structured guide covering environment setup, calibration, serving, and evaluation with concrete examples.

realAsma

Looks great!!

…ogic

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/vllm_serve/vllm_serve_fakequant.py (1)

187-213: Consider stricter amax key validation.

While the count-based check (lines 197-205) catches gross mismatches, it doesn't verify that checkpoint keys actually map to model keys. If the checkpoint contains different amax keys (e.g., from a different model architecture), the count could match but keys could be misaligned.

Consider validating key sets explicitly:

-    checkpoint_amax_keys = [key for key in saved_amax_dict if key.endswith("amax")]
-    model_amax_keys = [key for key in current_state_dict if key.endswith("amax")]
-
-    checkpoint_amax_count = len(checkpoint_amax_keys)
-    model_amax_count = len(model_amax_keys)
-
-    # Ensure counts match
-    if checkpoint_amax_count != model_amax_count:
-        raise ValueError(
-            f"Mismatch in amax key counts: checkpoint has {checkpoint_amax_count} "
-            f"amax keys but model has {model_amax_count} amax keys. "
-        )
+    # Filter to amax keys only
+    checkpoint_amax_keys = {k for k in saved_amax_dict if k.endswith("_amax") or k.endswith("amax")}
+    model_amax_keys = {k for k in current_state_dict if k.endswith("_amax") or k.endswith("amax")}
+
+    # Verify key sets match
+    missing_in_model = checkpoint_amax_keys - model_amax_keys
+    extra_in_model = model_amax_keys - checkpoint_amax_keys
+
+    if missing_in_model or extra_in_model:
+        raise ValueError(
+            f"Amax key mismatch:\n"
+            f"  Keys in checkpoint not found in model: {sorted(list(missing_in_model))[:5]}...\n"
+            f"  Keys in model not found in checkpoint: {sorted(list(extra_in_model))[:5]}..."
+        )
 
     # Update amax values
-    for key, value in saved_amax_dict.items():
-        if key in current_state_dict:
+    with torch.no_grad():
+        for key in checkpoint_amax_keys:
+            value = saved_amax_dict[key]
             current_state_dict[key] = value.to(self.device)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39d7ded and 5f1ed83.

📒 Files selected for processing (5)

examples/vllm_serve/Dockerfile (1 hunks)
examples/vllm_serve/README.md (1 hunks)
examples/vllm_serve/convert_amax_hf2vllm.py (1 hunks)
examples/vllm_serve/vllm_serve_fakequant.py (1 hunks)
modelopt/torch/quantization/plugins/vllm.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

examples/vllm_serve/README.md
modelopt/torch/quantization/plugins/vllm.py

🧰 Additional context used

🧬 Code graph analysis (1)

examples/vllm_serve/vllm_serve_fakequant.py (3)

modelopt/torch/utils/dataset_utils.py (1)

get_dataset_dataloader (157-232)

examples/llm_autodeploy/api_server.py (1)

run_server (194-208)

modelopt/torch/quantization/qtensor/base_qtensor.py (1)

to (115-123)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (8)

examples/vllm_serve/Dockerfile (2)

29-29: LGTM: Pre-compile with fallback.

Using || true allows the build to continue if CUDA extension pre-compilation fails, which is appropriate for environments where compilation might not be immediately possible.

32-35: LGTM: Requirements install with error handling.

The script iterates through requirements files, attempts installation, and logs warnings on failure without stopping the build. This is a pragmatic approach for optional dependencies.

examples/vllm_serve/convert_amax_hf2vllm.py (2)

26-85: LGTM: Conversion logic is sound.

The function correctly:

Preserves non-amax keys unchanged.

Groups q/k/v and gate/up amax keys using regex patterns.

Merges grouped keys by taking elementwise max via torch.stack().max(dim=0)[0].

Handles single-key groups by renaming.

The regex patterns and merging strategy align with the vLLM format requirements described in the PR.

88-153: LGTM: Test function validates conversion.

The test creates a representative HF state dict with q/k/v and gate/up amax keys, runs the conversion, and verifies the output matches expected vLLM keys using set comparison. This provides a basic sanity check for the conversion logic.

examples/vllm_serve/vllm_serve_fakequant.py (4)

113-123: LGTM: Efficient calibration when amax is pre-computed.

Setting quant_num_samples = 1 when amax_file_path is provided optimizes calibration by performing only a minimal forward pass before loading pre-computed amax values. The dataloader setup correctly uses the modelopt utility function.

241-250: LGTM: Main entry point structure.

The main function correctly:

Creates a FlexibleArgumentParser for vLLM arguments.

Adds a positional model argument.

Uses make_arg_parser to augment with vLLM CLI arguments.

Launches the server with uvloop for async execution.

223-226: Verify new_determine_available_memory return type
The wrapper is annotated -> None but returns results from the original determine_available_memory. Confirm what the original method returns (e.g., memory stats) and update the annotation or adjust the return accordingly.

169-175: Check get_pp_group() behavior on non-distributed setups
get_pp_group() comes from vllm.distributed.parallel_state; verify that when torch.distributed isn’t initialized (e.g. single-GPU) calling .is_first_rank or .recv_tensor_dict won’t error.

RalphMao requested review from kinjalpatel27, realAsma and mxinO September 25, 2025 03:53

RalphMao requested review from a team as code owners September 25, 2025 03:53

RalphMao requested review from cjluo-nv and kevalmorabia97 September 25, 2025 03:53

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

kevalmorabia97 requested changes Sep 25, 2025

View reviewed changes

mxinO approved these changes Sep 26, 2025

View reviewed changes

RalphMao force-pushed the huizim/vllm_serve_update branch from c03be05 to 6d69e05 Compare October 1, 2025 17:53

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

realAsma approved these changes Oct 2, 2025

View reviewed changes

RalphMao force-pushed the huizim/vllm_serve_update branch from 6d69e05 to 39d7ded Compare October 3, 2025 22:20

RalphMao requested a review from a team as a code owner October 3, 2025 22:20

Support fakequant serve with latest vllm and generalize calibration l…

5f1ed83

…ogic

RalphMao force-pushed the huizim/vllm_serve_update branch from 39d7ded to 5f1ed83 Compare October 3, 2025 23:06

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

-    print(f"Saving vLLM checkpoint to: {args.output}")
-    os.makedirs(os.path.dirname(args.output), exist_ok=True)
-    torch.save(vllm_state_dict, args.output)
-    print("Conversion complete!")
+    print(f"Saving vLLM checkpoint to: {args.output}")
+    out_dir = os.path.dirname(args.output)
+    if out_dir:
+        os.makedirs(out_dir, exist_ok=True)
+    to_save = vllm_state_dict
+    if getattr(args, "only-amax", False):
+        to_save = {k: v for k, v in vllm_state_dict.items() if k.endswith("_amax")}
+    torch.save(to_save, args.output)
+    print("Conversion complete!")

-FROM vllm/vllm-openai:v0.10.2
+# Consider aligning with the tested version, e.g.:
+# FROM vllm/vllm-openai:v0.11.2
+FROM vllm/vllm-openai:v0.11.2

-RUN chmod -R 777 /workspace
+# Example: create non-root user and set appropriate permissions
+RUN useradd -m -u 1000 appuser \
+ && chown -R appuser:appuser /workspace
+USER appuser


		Compared with realquant, fakequant is 2-5x slower, but doesn't require dedicated kernel support and facilitates research.

		This example is tested with vllm 0.9.0 and 0.11.2


		## Calibrate and serve fake quant model in vLLM

		Step 1: Modify `quant_config` in `vllm_serve_fake_quant.py` for the desired quantization format

		if tokenizer.pad_token != "<unk>" or tokenizer.pad_token is None:
		tokenizer.pad_token = tokenizer.eos_token

[New feature] Support fakequant serve with latest vllm and generalize calib logic #369

Are you sure you want to change the base?

[New feature] Support fakequant serve with latest vllm and generalize calib logic #369

Uh oh!

Conversation

RalphMao commented Sep 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevalmorabia97 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

RalphMao Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

mxinO left a comment

Choose a reason for hiding this comment

Uh oh!

mxinO Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

RalphMao Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RalphMao commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 25, 2025 •

edited

Loading