AI: Enforce Q5_K_M minimum quantization for SGLang production inference

## Summary

Audit the current SGLang self-hosted inference deployment (#1203) to verify the serving model uses at least **Q5_K_M (5-bit) quantization**. Models quantized to Q4 or below produce measurable increases in syntax errors in generated code — mismatched brackets, truncated JSON, invalid JavaScript — which directly degrades PortKit's Bedrock output quality.

## Problem

The SGLang deployment (PR #1297) went live without a documented quantization floor. The inference quality literature for code generation is unambiguous:

> "Q5_K_M (5-bit) quantization is the absolute minimum threshold for reliable code generation; 4-bit and below often introduce syntax errors (e.g., mismatched brackets) that break compilation."

Bedrock Add-on output is structured code: `manifest.json` must be valid JSON with exact field names, and the Scripting API `.js` files must parse without syntax errors. Q4 quantization artifacts (off-by-one token predictions, truncated outputs) would silently produce invalid output that fails downstream validation.

## What to do

1. **Check current serving format**: SSH into the SGLang RunPod instance and verify which GGUF/AWQ/EXL2 model file is being served. Document the quantization level.

2. **If below Q5_K_M**: Download and serve the Q5_K_M or Q6_K GGUF of the current model instead. For Qwen2.5-Coder-7B, the Q5_K_M GGUF is available on HuggingFace from Bartowski's quantized models.

3. **For GPU-only inference paths** (AWQ/EXL2 via vLLM): The equivalent quality floor is **AWQ 4-bit with group size 128** (which preserves more precision than standard Q4). If using AWQ, ensure group_size ≤ 128.

4. **Add a startup assertion** to the SGLang launch script that logs the quantization level and warnings if it detects a model below the 5-bit threshold.

5. **Document the standard** in the SGLang deployment runbook: minimum Q5_K_M for GGUF, minimum AWQ-4bit-gs128 for GPU-only paths.

## Acceptance Criteria

- [ ] Current quantization level documented
- [ ] Model serving at ≥ Q5_K_M (GGUF) or ≥ AWQ-4bit-gs128 (GPU)
- [ ] JSON validity rate on eval set ≥ 90% (run the eval script from `docs/ml_intern_finetuning_prompt.md`)
- [ ] JavaScript syntax pass rate ≥ 85% on eval set

## References

- Source document: *AI for Game and Mod Coding* — Section 6 (Quantization and Deployment), "Technical Advice"
- SGLang deployment: PR #1297
- Related: ml-intern fine-tuning — `docs/ml_intern_finetuning_prompt.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI: Enforce Q5_K_M minimum quantization for SGLang production inference #1320

Summary

Problem

What to do

Acceptance Criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

AI: Enforce Q5_K_M minimum quantization for SGLang production inference #1320

Description

Summary

Problem

What to do

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions