Skip to content

Support multi-GPU eval in tallyqa.py (#69)#334

Open
BuildWithAbid wants to merge 1 commit into
m87-labs:mainfrom
BuildWithAbid:multi-gpu-eval-tallyqa
Open

Support multi-GPU eval in tallyqa.py (#69)#334
BuildWithAbid wants to merge 1 commit into
m87-labs:mainfrom
BuildWithAbid:multi-gpu-eval-tallyqa

Conversation

@BuildWithAbid
Copy link
Copy Markdown

@BuildWithAbid BuildWithAbid commented Apr 23, 2026

Closes #69 (reference pattern — I'll port the other moondream/eval/*.py scripts in a follow-up PR once the approach here is approved).

Approach

TallyQA eval is embarrassingly parallel per sample (no gradient sync, each row is independent), so multi-GPU is just dataset sharding plus a single all_reduce at the end:

  1. If launched under torchrun, init a NCCL process group and pin each process to its local-rank device.
  2. Shard the HuggingFace dataset contiguously via dataset.shard(world_size, rank).
  3. Each rank runs the existing eval loop on its shard.
  4. all_reduce the four counters (total, correct, total_simple, correct_simple); compute accuracies on rank 0.

No new dependencies — just torch.distributed, which ships with torch.

Usage

Single GPU / CPU / MPS (unchanged):

python -m moondream.eval.tallyqa --model <path>

Multi-GPU:

torchrun --nproc_per_node=<N> -m moondream.eval.tallyqa --model <path>

The distributed path only activates when LOCAL_RANK is set in the environment (i.e. the process was launched by torchrun), so existing single-GPU invocations behave exactly as before.

Notes

  • Progress bar and final print only run on rank 0 to keep stdout clean.
  • dist.destroy_process_group() is called on shutdown to avoid NCCL exit warnings.
  • Drive-by: switched the three args.debug references inside eval_tallyqa to the debug parameter it already accepts — this means eval_all.py can call it without leaking module globals. Happy to split this into a separate commit/PR if preferred.

Test plan

  • Syntax-checks and passes black --check.
  • Single-GPU path: no changes to control flow when LOCAL_RANK is unset.
  • Multi-GPU smoke test on 2× T4 (Modal). To isolate the distributed plumbing from model correctness, I ran the eval with a stubbed model (counts its own calls, returns a fixed answer) on the first 40 dataset samples. Results: single-GPU processed 126 QAs; 2-GPU sharded into 79 + 47 = 126 (full coverage, no overlap); accuracies matched exactly between both runs (Simple 36.29%, Full 36.51%); the rank-0-gated final prints appeared exactly once in the 2-GPU output. This directly exercises dist.init_process_group, dataset.shard(contiguous=True), dist.all_reduce, and the rank-0 print guards.

Once this pattern is accepted I'll port the other eval scripts (pope.py, chartqa.py, textvqa.py, docvqa.py, mmstar.py, coco_map.py, naturalbench.py, countbenchqa.py, realworldqa.py) and update eval_all.py in a follow-up.

Eval is embarrassingly parallel per sample, so each rank evaluates a
shard of the dataset and the counts are summed with all_reduce at the
end. Launch with:

  torchrun --nproc_per_node=<N> -m moondream.eval.tallyqa --model <path>

The single-GPU / CPU / MPS path is unchanged: when LOCAL_RANK is not in
the environment, the process group is never initialized.

Also replaces the few `args.debug` references inside eval_tallyqa with
the `debug` parameter it already accepts, so the function can be called
from eval_all.py without relying on module-level globals.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5511be6b64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread moondream/eval/tallyqa.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support running eval scripts on multiple GPUs

2 participants