Fix GPU dynamic batching reordering documents silently#98
Merged
Conversation
`tokenize_documents_in_batches` sorts documents by token length and buckets them by shape on the GPU path, but the returned `Vec<PreparedDocumentBatch>` discards the original input order. Callers that map embeddings back to input positions (notably colgrep::index::run_pool_stage via `original_to_unique`) then pair every code unit with the wrong embedding, producing an unusable index. The symptom is silent: indexing succeeds and search runs, but results are unrelated to the query. Reproduced on `axios` with `LateOn-Code-edge`: query "request and response interceptors" returns `helpers/speedometer.js`, `utils.js`, `helpers/composeSignals.js` on GPU vs `core/Axios.js`, `core/InterceptorManager.js`, `adapters/xhr.js` on CPU. The full semble code-search benchmark dropped from ~0.69 to ~0.16 NDCG@10 on the first repos. Track the original input position alongside each tokenized document through sorting and bucketing, store it in `PreparedDocumentBatch`, and have `encode_prepared_document_batches` restore the caller's input order before returning. Batches produced through code paths that don't populate `original_input_indices` (e.g. `tokenize_documents`, `prepare_batch_from_tokenizer_encodings`) are passed through unchanged, so the public API stays backwards compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
tokenize_documents_in_batchessorts documents by token length and buckets them by shape on the GPU path, but the returnedVec<PreparedDocumentBatch>discards the original input order. Callers that map embeddings back to input positions (notably colgrep::index::run_pool_stage viaoriginal_to_unique) then pair every code unit with the wrong embedding, producing an unusable index. The symptom is silent: indexing succeeds and search runs, but results are unrelated to the query.Reproduced on
axioswithLateOn-Code-edge: query "request and response interceptors" returnshelpers/speedometer.js,utils.js,helpers/composeSignals.json GPU vscore/Axios.js,core/InterceptorManager.js,adapters/xhr.json CPU. The full semble code-search benchmark dropped from ~0.69 to ~0.16 NDCG@10 on the first repos.Track the original input position alongside each tokenized document through sorting and bucketing, store it in
PreparedDocumentBatch, and haveencode_prepared_document_batchesrestore the caller's input order before returning. Batches produced through code paths that don't populateoriginal_input_indices(e.g.tokenize_documents,prepare_batch_from_tokenizer_encodings) are passed through unchanged, so the public API stays backwards compatible.