Skip to content

Fix GPU dynamic batching reordering documents silently#98

Merged
raphaelsty merged 1 commit into
mainfrom
fix/gpu-dynamic-batch-doc-order
May 18, 2026
Merged

Fix GPU dynamic batching reordering documents silently#98
raphaelsty merged 1 commit into
mainfrom
fix/gpu-dynamic-batch-doc-order

Conversation

@raphaelsty
Copy link
Copy Markdown
Collaborator

tokenize_documents_in_batches sorts documents by token length and buckets them by shape on the GPU path, but the returned Vec<PreparedDocumentBatch> discards the original input order. Callers that map embeddings back to input positions (notably colgrep::index::run_pool_stage via original_to_unique) then pair every code unit with the wrong embedding, producing an unusable index. The symptom is silent: indexing succeeds and search runs, but results are unrelated to the query.

Reproduced on axios with LateOn-Code-edge: query "request and response interceptors" returns
helpers/speedometer.js, utils.js, helpers/composeSignals.js on GPU vs core/Axios.js, core/InterceptorManager.js, adapters/xhr.js on CPU. The full semble code-search benchmark dropped from ~0.69 to ~0.16 NDCG@10 on the first repos.

Track the original input position alongside each tokenized document through sorting and bucketing, store it in PreparedDocumentBatch, and have encode_prepared_document_batches restore the caller's input order before returning. Batches produced through code paths that don't populate original_input_indices (e.g. tokenize_documents, prepare_batch_from_tokenizer_encodings) are passed through unchanged, so the public API stays backwards compatible.

`tokenize_documents_in_batches` sorts documents by token length and
buckets them by shape on the GPU path, but the returned
`Vec<PreparedDocumentBatch>` discards the original input order. Callers
that map embeddings back to input positions (notably
colgrep::index::run_pool_stage via `original_to_unique`) then pair
every code unit with the wrong embedding, producing an unusable index.
The symptom is silent: indexing succeeds and search runs, but results
are unrelated to the query.

Reproduced on `axios` with `LateOn-Code-edge`: query
"request and response interceptors" returns
`helpers/speedometer.js`, `utils.js`, `helpers/composeSignals.js` on
GPU vs `core/Axios.js`, `core/InterceptorManager.js`, `adapters/xhr.js`
on CPU. The full semble code-search benchmark dropped from ~0.69 to
~0.16 NDCG@10 on the first repos.

Track the original input position alongside each tokenized document
through sorting and bucketing, store it in `PreparedDocumentBatch`, and
have `encode_prepared_document_batches` restore the caller's input
order before returning. Batches produced through code paths that don't
populate `original_input_indices` (e.g. `tokenize_documents`,
`prepare_batch_from_tokenizer_encodings`) are passed through unchanged,
so the public API stays backwards compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@raphaelsty raphaelsty merged commit 548b760 into main May 18, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant