Summary
Audit the current SGLang self-hosted inference deployment (#1203) to verify the serving model uses at least Q5_K_M (5-bit) quantization. Models quantized to Q4 or below produce measurable increases in syntax errors in generated code — mismatched brackets, truncated JSON, invalid JavaScript — which directly degrades PortKit's Bedrock output quality.
Problem
The SGLang deployment (PR #1297) went live without a documented quantization floor. The inference quality literature for code generation is unambiguous:
"Q5_K_M (5-bit) quantization is the absolute minimum threshold for reliable code generation; 4-bit and below often introduce syntax errors (e.g., mismatched brackets) that break compilation."
Bedrock Add-on output is structured code: manifest.json must be valid JSON with exact field names, and the Scripting API .js files must parse without syntax errors. Q4 quantization artifacts (off-by-one token predictions, truncated outputs) would silently produce invalid output that fails downstream validation.
What to do
-
Check current serving format: SSH into the SGLang RunPod instance and verify which GGUF/AWQ/EXL2 model file is being served. Document the quantization level.
-
If below Q5_K_M: Download and serve the Q5_K_M or Q6_K GGUF of the current model instead. For Qwen2.5-Coder-7B, the Q5_K_M GGUF is available on HuggingFace from Bartowski's quantized models.
-
For GPU-only inference paths (AWQ/EXL2 via vLLM): The equivalent quality floor is AWQ 4-bit with group size 128 (which preserves more precision than standard Q4). If using AWQ, ensure group_size ≤ 128.
-
Add a startup assertion to the SGLang launch script that logs the quantization level and warnings if it detects a model below the 5-bit threshold.
-
Document the standard in the SGLang deployment runbook: minimum Q5_K_M for GGUF, minimum AWQ-4bit-gs128 for GPU-only paths.
Acceptance Criteria
References
Summary
Audit the current SGLang self-hosted inference deployment (#1203) to verify the serving model uses at least Q5_K_M (5-bit) quantization. Models quantized to Q4 or below produce measurable increases in syntax errors in generated code — mismatched brackets, truncated JSON, invalid JavaScript — which directly degrades PortKit's Bedrock output quality.
Problem
The SGLang deployment (PR #1297) went live without a documented quantization floor. The inference quality literature for code generation is unambiguous:
Bedrock Add-on output is structured code:
manifest.jsonmust be valid JSON with exact field names, and the Scripting API.jsfiles must parse without syntax errors. Q4 quantization artifacts (off-by-one token predictions, truncated outputs) would silently produce invalid output that fails downstream validation.What to do
Check current serving format: SSH into the SGLang RunPod instance and verify which GGUF/AWQ/EXL2 model file is being served. Document the quantization level.
If below Q5_K_M: Download and serve the Q5_K_M or Q6_K GGUF of the current model instead. For Qwen2.5-Coder-7B, the Q5_K_M GGUF is available on HuggingFace from Bartowski's quantized models.
For GPU-only inference paths (AWQ/EXL2 via vLLM): The equivalent quality floor is AWQ 4-bit with group size 128 (which preserves more precision than standard Q4). If using AWQ, ensure group_size ≤ 128.
Add a startup assertion to the SGLang launch script that logs the quantization level and warnings if it detects a model below the 5-bit threshold.
Document the standard in the SGLang deployment runbook: minimum Q5_K_M for GGUF, minimum AWQ-4bit-gs128 for GPU-only paths.
Acceptance Criteria
docs/ml_intern_finetuning_prompt.md)References
docs/ml_intern_finetuning_prompt.md