Skip to content

Add Qwen 3.5 and Gemma 4 models to benchmark suite #1

@vimeto

Description

@vimeto

New models to benchmark

For the MobiHoc 2026 paper revision, we should include newer model families to strengthen the generalizability of our findings.

Models to add

Qwen 3.5 family:

  • Qwen 3.5 (small variants, ~1-4B) — successor to Qwen 3, likely improved tool calling
  • Check mlx-community for INT4 variants
  • SGLang should support these with qwen parser

Gemma 4 family:

  • Gemma 4 (small variants) — successor to Gemma 3n
  • May resolve the SGLang/vLLM compatibility issues we had with Gemma 3n
  • Check if standard attention (no Conv3d) improves runtime support

What to do

  1. Check HuggingFace for available model variants and sizes
  2. Add model configs to MLX_MODEL_MAP and MODELS lists
  3. Create optimized prompts in optimized_prompts.py
  4. Validate tool calling works on both MLX and SGLang
  5. Run full benchmark sweep

Context

  • Current models: Qwen 3 (4B, 0.6B), Llama 3.2 3B, DeepSeek R1 1.5B, Gemma 3n E2B
  • Newer models would test if our structural findings (capability threshold, prefill dominance, inverse efficiency) hold across model generations
  • Reviewer A: "the paper feels like a snapshot in time" — newer models address this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions