ADD: batch document ingest + bulk purge with partial success#57
Open
ljhh-0611 wants to merge 4 commits intorefactoring-applied-backed-v2from
Open
ADD: batch document ingest + bulk purge with partial success#57ljhh-0611 wants to merge 4 commits intorefactoring-applied-backed-v2from
ljhh-0611 wants to merge 4 commits intorefactoring-applied-backed-v2from
Conversation
- Add `Store::ingest_many` partial success (IngestResult/IngestFailure)
with batch index optimization and best-effort cleanup on failure
- Add `Store::purge_many` (PurgeResult/PurgeFailure)
- POST /documents: multi-file multipart upload with per-file validation
- DELETE /documents: bulk purge via JSON body { ids: [...] }
- GET /documents/{id}: single document retrieval
- Response DTOs: BatchIngestResponse, BatchPurgeResponse, FailedItem
- 14 new document tests + e2e test rewritten for multi-doc HTTP flow
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep the document batch API behavior unchanged while removing a silent fallback in failure indexing and trimming duplicated test HTTP setup. Constraint: Cleanup is scoped to PR #57 / a69222c document-batch changes only.\nRejected: Rewriting broader Speedwagon parser/tool clippy warnings | outside the requested commit scope.\nConfidence: high\nScope-risk: narrow\nDirective: Keep ingest_many response semantics provisional until the Store contract is hardened.\nTested: cargo fmt --check -p agent-k-backend -p speedwagon; cargo check -p agent-k-backend; cargo test -p agent-k-backend --test document_test; cargo test -p speedwagon --no-default-features --lib; cargo clippy -p agent-k-backend --tests\nNot-tested: live ignored e2e RAG test requiring OPENAI_API_KEY
Contributor
|
There's some conflicts. Let me resolve them. |
…rs/document.rs - Extract document handlers from router.rs into handlers/document.rs - Register document module in handlers/mod.rs - Add document routes to router.rs using plain axum routing (Multipart/Query extractors don't implement aide OperationHandler) - Update DEFAULT_MODEL in session.rs to gpt-5.4-mini from document-batch branch - Resolve Store::new() API change in tests/common/mod.rs - Update e2e_test.rs imports and AppState::new() signature Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
khj809
reviewed
May 7, 2026
| to_index.push((idx, id, content)); | ||
| } | ||
| Err(e) => { | ||
| remove_ingest_artifact(&corpus_path); |
Contributor
There was a problem hiding this comment.
This line may remove previously existing corpus file if fs::read_to_string fails. Suggest to add a flag to check if corpus file has been newly created, and call remove_ingest_artifact only if the flag is set.
let corpus_was_new = !corpus_path.exists();
// ...
if corpu_was_new {
remove_ingest_artifact(&corpus_path);
}| Ok(title) => docs.push((id.to_string(), title, content)), | ||
| Err(e) => { | ||
| let corpus_path = self.root.join("corpus").join(format!("{id}.md")); | ||
| remove_ingest_artifact(&corpus_path); |
Contributor
There was a problem hiding this comment.
Similar as before if parser::get_title fails.
| #[derive(Clone, Debug, Deserialize, JsonSchema)] | ||
| #[serde(deny_unknown_fields)] | ||
| pub struct BulkPurgeRequest { | ||
| pub ids: Vec<Uuid>, |
Contributor
There was a problem hiding this comment.
The DocumentResponse.id is String but here ids is Vec<Uuid>. Would be better to unify them to either Uuid or String.
khj809
reviewed
May 7, 2026
| let mut filenames: Vec<String> = Vec::new(); | ||
| let mut failed: Vec<FailedItem> = Vec::new(); | ||
|
|
||
| while let Ok(Some(field)) = multipart.next_field().await { |
Contributor
There was a problem hiding this comment.
This loop may be silently terminated if multipart.next_field().await returned Err. Suggest to fix it as below, so API can explicitly respond with 400 error:
Suggested change
| while let Ok(Some(field)) = multipart.next_field().await { | |
| while let Some(field) = multipart.next_field().await | |
| .map_err(|e| AppError::bad_request(format!("multipart error: {e}")))? { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds batch-oriented document materialization APIs on top of the Speedwagon store, with partial-success semantics for both ingest and purge. The goal is to let clients upload or delete multiple documents in one request while still preserving successful work when individual files or ids fail.
It builds on the Speedwagon-backed session/message flow from #48 by exposing the document corpus through HTTP endpoints and covering the multi-document path in backend integration tests.
Changes
Speedwagon Store
Store::ingest_manyreturningIngestResult { succeeded, failed }.Store::purge_manyreturningPurgeResult { purged, failed }.Backend Document API
backend-v2/src/model/document.rs:DocumentResponseBatchIngestResponseBatchPurgeResponseFailedItemBulkPurgeRequestGET /documents— list indexed documentsPOST /documents— multipart multi-file ingestGET /documents/{id}— fetch one document by idDELETE /documents/{id}— purge one document by idDELETE /documents— bulk purge via JSON bodypdf,md,markdown, andtxt.201 Createdwhen a batch ingest has no failures, and200 OKwhen the request completed with partial failures.Tests
document_testintegration suite covering:tests/commonwith reusable document API helpers.API Shape
Batch ingest
{ "succeeded": [ { "id": "...", "title": "...", "len": 123 } ], "failed": [ { "name": "bad.exe", "error": "unsupported file type '.exe' — supported: pdf, md, txt" } ] }Bulk purge
{ "ids": ["..."] }{ "purged": ["..."], "failed": [ { "name": "...", "error": "document not found" } ] }Validation
cargo fmt --check -p agent-k-backendcargo check -p agent-k-backendcargo test -p agent-k-backend --test document_test— 14 passedcargo test -p agent-k-backend --test e2e_test -- --ignored --list— confirms the ignored live E2E test is presentNotes / Out of Scope
#[ignore]because it requiresOPENAI_API_KEYand performs an actual model-backed RAG round trip.ingest_manycurrently follows the existing single-documentStorebehavior as closely as possible, but this is an initial API shape rather than a finalized long-term contract. As the Store is hardened (metadata generation, indexing strategy, failure recovery, corpus ownership, and migration behavior), the batch ingest API and its response/error semantics are expected to evolve with it.pdf,md,markdown,txt). Additional formats should be added through the shared SpeedwagonFileType/translator path.