Skip to content

ADD: batch document ingest + bulk purge with partial success#57

Open
ljhh-0611 wants to merge 4 commits intorefactoring-applied-backed-v2from
feat/document-batch
Open

ADD: batch document ingest + bulk purge with partial success#57
ljhh-0611 wants to merge 4 commits intorefactoring-applied-backed-v2from
feat/document-batch

Conversation

@ljhh-0611
Copy link
Copy Markdown
Collaborator

@ljhh-0611 ljhh-0611 commented May 7, 2026

Summary

This PR adds batch-oriented document materialization APIs on top of the Speedwagon store, with partial-success semantics for both ingest and purge. The goal is to let clients upload or delete multiple documents in one request while still preserving successful work when individual files or ids fail.

It builds on the Speedwagon-backed session/message flow from #48 by exposing the document corpus through HTTP endpoints and covering the multi-document path in backend integration tests.

Changes

Speedwagon Store

  • Add Store::ingest_many returning IngestResult { succeeded, failed }.
    • Valid documents are accumulated and written to the index in one batch.
    • Per-file failures are recorded with the original item index.
    • Newly written intermediate corpus/origin files are cleaned up on failure where possible.
    • Duplicate documents are treated as successful and return the existing deterministic UUID.
  • Add Store::purge_many returning PurgeResult { purged, failed }.
    • Missing ids are reported as per-item failures instead of failing the whole request.
  • Update the preset corpus path to use the new batch ingest result shape.

Backend Document API

  • Add document DTOs in backend-v2/src/model/document.rs:
    • DocumentResponse
    • BatchIngestResponse
    • BatchPurgeResponse
    • FailedItem
    • BulkPurgeRequest
  • Extend backend routing with document endpoints:
    • GET /documents — list indexed documents
    • POST /documents — multipart multi-file ingest
    • GET /documents/{id} — fetch one document by id
    • DELETE /documents/{id} — purge one document by id
    • DELETE /documents — bulk purge via JSON body
  • Add per-file multipart validation and file-type mapping for pdf, md, markdown, and txt.
  • Return 201 Created when a batch ingest has no failures, and 200 OK when the request completed with partial failures.
  • Reject unsupported file types and empty bulk purge requests with client-facing errors instead of store-level panics.

Tests

  • Add a dedicated document_test integration suite covering:
    • empty document lists
    • single ingest/list/get/purge
    • duplicate ingest behavior
    • multi-file ingest
    • mixed success/failure ingest
    • unsupported file handling
    • bulk purge success and partial failure
    • missing document behavior
  • Expand tests/common with reusable document API helpers.
  • Rewrite the ignored E2E flow to ingest two documents over HTTP, ask the agent about both facts, bulk purge them, verify the list is empty, then confirm the session still responds after purge.

API Shape

Batch ingest

POST /documents
Content-Type: multipart/form-data
{
  "succeeded": [
    { "id": "...", "title": "...", "len": 123 }
  ],
  "failed": [
    { "name": "bad.exe", "error": "unsupported file type '.exe' — supported: pdf, md, txt" }
  ]
}

Bulk purge

DELETE /documents
Content-Type: application/json
{ "ids": ["..."] }
{
  "purged": ["..."],
  "failed": [
    { "name": "...", "error": "document not found" }
  ]
}

Validation

  • cargo fmt --check -p agent-k-backend
  • cargo check -p agent-k-backend
  • cargo test -p agent-k-backend --test document_test — 14 passed
  • cargo test -p agent-k-backend --test e2e_test -- --ignored --list — confirms the ignored live E2E test is present

Notes / Out of Scope

  • The live E2E test remains #[ignore] because it requires OPENAI_API_KEY and performs an actual model-backed RAG round trip.
  • Temporary API / Store contract warning: ingest_many currently follows the existing single-document Store behavior as closely as possible, but this is an initial API shape rather than a finalized long-term contract. As the Store is hardened (metadata generation, indexing strategy, failure recovery, corpus ownership, and migration behavior), the batch ingest API and its response/error semantics are expected to evolve with it.
  • This PR keeps the current single shared Speedwagon store design from Test speedwagon tool #48; multi-tenant or per-user corpus ownership should be handled in a later API/auth integration pass.
  • Multipart upload currently accepts the supported document extensions only (pdf, md, markdown, txt). Additional formats should be added through the shared Speedwagon FileType/translator path.

ljhh-0611 and others added 2 commits May 7, 2026 15:14
- Add `Store::ingest_many` partial success (IngestResult/IngestFailure)
  with batch index optimization and best-effort cleanup on failure
- Add `Store::purge_many` (PurgeResult/PurgeFailure)
- POST /documents: multi-file multipart upload with per-file validation
- DELETE /documents: bulk purge via JSON body { ids: [...] }
- GET /documents/{id}: single document retrieval
- Response DTOs: BatchIngestResponse, BatchPurgeResponse, FailedItem
- 14 new document tests + e2e test rewritten for multi-doc HTTP flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep the document batch API behavior unchanged while removing a silent fallback in failure indexing and trimming duplicated test HTTP setup.

Constraint: Cleanup is scoped to PR #57 / a69222c document-batch changes only.\nRejected: Rewriting broader Speedwagon parser/tool clippy warnings | outside the requested commit scope.\nConfidence: high\nScope-risk: narrow\nDirective: Keep ingest_many response semantics provisional until the Store contract is hardened.\nTested: cargo fmt --check -p agent-k-backend -p speedwagon; cargo check -p agent-k-backend; cargo test -p agent-k-backend --test document_test; cargo test -p speedwagon --no-default-features --lib; cargo clippy -p agent-k-backend --tests\nNot-tested: live ignored e2e RAG test requiring OPENAI_API_KEY
@ljhh-0611 ljhh-0611 marked this pull request as ready for review May 7, 2026 06:43
@ljhh-0611 ljhh-0611 requested a review from khj809 May 7, 2026 06:43
@ljhh-0611 ljhh-0611 self-assigned this May 7, 2026
@khj809
Copy link
Copy Markdown
Contributor

khj809 commented May 7, 2026

There's some conflicts. Let me resolve them.

khj809 and others added 2 commits May 7, 2026 21:13
…rs/document.rs

- Extract document handlers from router.rs into handlers/document.rs
- Register document module in handlers/mod.rs
- Add document routes to router.rs using plain axum routing
  (Multipart/Query extractors don't implement aide OperationHandler)
- Update DEFAULT_MODEL in session.rs to gpt-5.4-mini from document-batch branch
- Resolve Store::new() API change in tests/common/mod.rs
- Update e2e_test.rs imports and AppState::new() signature

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
to_index.push((idx, id, content));
}
Err(e) => {
remove_ingest_artifact(&corpus_path);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line may remove previously existing corpus file if fs::read_to_string fails. Suggest to add a flag to check if corpus file has been newly created, and call remove_ingest_artifact only if the flag is set.

let corpus_was_new = !corpus_path.exists();
// ...
if corpu_was_new {
    remove_ingest_artifact(&corpus_path);
}

Ok(title) => docs.push((id.to_string(), title, content)),
Err(e) => {
let corpus_path = self.root.join("corpus").join(format!("{id}.md"));
remove_ingest_artifact(&corpus_path);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar as before if parser::get_title fails.

#[derive(Clone, Debug, Deserialize, JsonSchema)]
#[serde(deny_unknown_fields)]
pub struct BulkPurgeRequest {
pub ids: Vec<Uuid>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DocumentResponse.id is String but here ids is Vec<Uuid>. Would be better to unify them to either Uuid or String.

let mut filenames: Vec<String> = Vec::new();
let mut failed: Vec<FailedItem> = Vec::new();

while let Ok(Some(field)) = multipart.next_field().await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop may be silently terminated if multipart.next_field().await returned Err. Suggest to fix it as below, so API can explicitly respond with 400 error:

Suggested change
while let Ok(Some(field)) = multipart.next_field().await {
while let Some(field) = multipart.next_field().await
.map_err(|e| AppError::bad_request(format!("multipart error: {e}")))? {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants