Skip to content

AI: Enforce Q5_K_M minimum quantization for SGLang production inference #1320

@anchapin

Description

@anchapin

Summary

Audit the current SGLang self-hosted inference deployment (#1203) to verify the serving model uses at least Q5_K_M (5-bit) quantization. Models quantized to Q4 or below produce measurable increases in syntax errors in generated code — mismatched brackets, truncated JSON, invalid JavaScript — which directly degrades PortKit's Bedrock output quality.

Problem

The SGLang deployment (PR #1297) went live without a documented quantization floor. The inference quality literature for code generation is unambiguous:

"Q5_K_M (5-bit) quantization is the absolute minimum threshold for reliable code generation; 4-bit and below often introduce syntax errors (e.g., mismatched brackets) that break compilation."

Bedrock Add-on output is structured code: manifest.json must be valid JSON with exact field names, and the Scripting API .js files must parse without syntax errors. Q4 quantization artifacts (off-by-one token predictions, truncated outputs) would silently produce invalid output that fails downstream validation.

What to do

  1. Check current serving format: SSH into the SGLang RunPod instance and verify which GGUF/AWQ/EXL2 model file is being served. Document the quantization level.

  2. If below Q5_K_M: Download and serve the Q5_K_M or Q6_K GGUF of the current model instead. For Qwen2.5-Coder-7B, the Q5_K_M GGUF is available on HuggingFace from Bartowski's quantized models.

  3. For GPU-only inference paths (AWQ/EXL2 via vLLM): The equivalent quality floor is AWQ 4-bit with group size 128 (which preserves more precision than standard Q4). If using AWQ, ensure group_size ≤ 128.

  4. Add a startup assertion to the SGLang launch script that logs the quantization level and warnings if it detects a model below the 5-bit threshold.

  5. Document the standard in the SGLang deployment runbook: minimum Q5_K_M for GGUF, minimum AWQ-4bit-gs128 for GPU-only paths.

Acceptance Criteria

  • Current quantization level documented
  • Model serving at ≥ Q5_K_M (GGUF) or ≥ AWQ-4bit-gs128 (GPU)
  • JSON validity rate on eval set ≥ 90% (run the eval script from docs/ml_intern_finetuning_prompt.md)
  • JavaScript syntax pass rate ≥ 85% on eval set

References

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions