ADD: batch document ingest + bulk purge with partial success by ljhh-0611 · Pull Request #57 · brekkylab/agent-k

ljhh-0611 · 2026-05-07T06:17:57Z

Summary

This PR adds batch-oriented document materialization APIs on top of the Speedwagon store, with partial-success semantics for both ingest and purge. The goal is to let clients upload or delete multiple documents in one request while still preserving successful work when individual files or ids fail.

It builds on the Speedwagon-backed session/message flow from #48 by exposing the document corpus through HTTP endpoints and covering the multi-document path in backend integration tests.

Changes

Speedwagon Store

Add Store::ingest_many returning IngestResult { succeeded, failed }.
- Valid documents are accumulated and written to the index in one batch.
- Per-file failures are recorded with the original item index.
- Newly written intermediate corpus/origin files are cleaned up on failure where possible.
- Duplicate documents are treated as successful and return the existing deterministic UUID.
Add Store::purge_many returning PurgeResult { purged, failed }.
- Missing ids are reported as per-item failures instead of failing the whole request.
Update the preset corpus path to use the new batch ingest result shape.

Backend Document API

Add document DTOs in backend-v2/src/model/document.rs:
- DocumentResponse
- BatchIngestResponse
- BatchPurgeResponse
- FailedItem
- BulkPurgeRequest
Extend backend routing with document endpoints:
- GET /documents — list indexed documents
- POST /documents — multipart multi-file ingest
- GET /documents/{id} — fetch one document by id
- DELETE /documents/{id} — purge one document by id
- DELETE /documents — bulk purge via JSON body
Add per-file multipart validation and file-type mapping for pdf, md, markdown, and txt.
Return 201 Created when a batch ingest has no failures, and 200 OK when the request completed with partial failures.
Reject unsupported file types and empty bulk purge requests with client-facing errors instead of store-level panics.

Tests

Add a dedicated document_test integration suite covering:
- empty document lists
- single ingest/list/get/purge
- duplicate ingest behavior
- multi-file ingest
- mixed success/failure ingest
- unsupported file handling
- bulk purge success and partial failure
- missing document behavior
Expand tests/common with reusable document API helpers.
Rewrite the ignored E2E flow to ingest two documents over HTTP, ask the agent about both facts, bulk purge them, verify the list is empty, then confirm the session still responds after purge.

API Shape

Batch ingest

POST /documents
Content-Type: multipart/form-data

{
  "succeeded": [
    { "id": "...", "title": "...", "len": 123 }
  ],
  "failed": [
    { "name": "bad.exe", "error": "unsupported file type '.exe' — supported: pdf, md, txt" }
  ]
}

Bulk purge

DELETE /documents
Content-Type: application/json

{ "ids": ["..."] }

{
  "purged": ["..."],
  "failed": [
    { "name": "...", "error": "document not found" }
  ]
}

Validation

cargo fmt --check -p agent-k-backend
cargo check -p agent-k-backend
cargo test -p agent-k-backend --test document_test — 14 passed
cargo test -p agent-k-backend --test e2e_test -- --ignored --list — confirms the ignored live E2E test is present

Notes / Out of Scope

The live E2E test remains #[ignore] because it requires OPENAI_API_KEY and performs an actual model-backed RAG round trip.
Temporary API / Store contract warning: ingest_many currently follows the existing single-document Store behavior as closely as possible, but this is an initial API shape rather than a finalized long-term contract. As the Store is hardened (metadata generation, indexing strategy, failure recovery, corpus ownership, and migration behavior), the batch ingest API and its response/error semantics are expected to evolve with it.
This PR keeps the current single shared Speedwagon store design from Test speedwagon tool #48; multi-tenant or per-user corpus ownership should be handled in a later API/auth integration pass.
Multipart upload currently accepts the supported document extensions only (pdf, md, markdown, txt). Additional formats should be added through the shared Speedwagon FileType/translator path.

- Add `Store::ingest_many` partial success (IngestResult/IngestFailure) with batch index optimization and best-effort cleanup on failure - Add `Store::purge_many` (PurgeResult/PurgeFailure) - POST /documents: multi-file multipart upload with per-file validation - DELETE /documents: bulk purge via JSON body { ids: [...] } - GET /documents/{id}: single document retrieval - Response DTOs: BatchIngestResponse, BatchPurgeResponse, FailedItem - 14 new document tests + e2e test rewritten for multi-doc HTTP flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep the document batch API behavior unchanged while removing a silent fallback in failure indexing and trimming duplicated test HTTP setup. Constraint: Cleanup is scoped to PR #57 / a69222c document-batch changes only.\nRejected: Rewriting broader Speedwagon parser/tool clippy warnings | outside the requested commit scope.\nConfidence: high\nScope-risk: narrow\nDirective: Keep ingest_many response semantics provisional until the Store contract is hardened.\nTested: cargo fmt --check -p agent-k-backend -p speedwagon; cargo check -p agent-k-backend; cargo test -p agent-k-backend --test document_test; cargo test -p speedwagon --no-default-features --lib; cargo clippy -p agent-k-backend --tests\nNot-tested: live ignored e2e RAG test requiring OPENAI_API_KEY

khj809 · 2026-05-07T12:07:14Z

There's some conflicts. Let me resolve them.

…rs/document.rs - Extract document handlers from router.rs into handlers/document.rs - Register document module in handlers/mod.rs - Add document routes to router.rs using plain axum routing (Multipart/Query extractors don't implement aide OperationHandler) - Update DEFAULT_MODEL in session.rs to gpt-5.4-mini from document-batch branch - Resolve Store::new() API change in tests/common/mod.rs - Update e2e_test.rs imports and AppState::new() signature Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

khj809 · 2026-05-07T12:31:16Z

+                        to_index.push((idx, id, content));
+                    }
+                    Err(e) => {
+                        remove_ingest_artifact(&corpus_path);


This line may remove previously existing corpus file if fs::read_to_string fails. Suggest to add a flag to check if corpus file has been newly created, and call remove_ingest_artifact only if the flag is set.

let corpus_was_new = !corpus_path.exists(); // ... if corpu_was_new { remove_ingest_artifact(&corpus_path); }

khj809 · 2026-05-07T12:31:49Z

+                Ok(title) => docs.push((id.to_string(), title, content)),
+                Err(e) => {
+                    let corpus_path = self.root.join("corpus").join(format!("{id}.md"));
+                    remove_ingest_artifact(&corpus_path);


Similar as before if parser::get_title fails.

khj809 · 2026-05-07T12:38:18Z

+#[derive(Clone, Debug, Deserialize, JsonSchema)]
+#[serde(deny_unknown_fields)]
+pub struct BulkPurgeRequest {
+    pub ids: Vec<Uuid>,


The DocumentResponse.id is String but here ids is Vec<Uuid>. Would be better to unify them to either Uuid or String.

khj809 · 2026-05-07T12:40:15Z

+    let mut filenames: Vec<String> = Vec::new();
+    let mut failed: Vec<FailedItem> = Vec::new();
+
+    while let Ok(Some(field)) = multipart.next_field().await {


This loop may be silently terminated if multipart.next_field().await returned Err. Suggest to fix it as below, so API can explicitly respond with 400 error:

Suggested change

while let Ok(Some(field)) = multipart.next_field().await {

while let Some(field) = multipart.next_field().await

.map_err(|e| AppError::bad_request(format!("multipart error: {e}")))? {

ljhh-0611 and others added 2 commits May 7, 2026 15:14

ljhh-0611 marked this pull request as ready for review May 7, 2026 06:43

ljhh-0611 requested a review from khj809 May 7, 2026 06:43

ljhh-0611 self-assigned this May 7, 2026

khj809 and others added 2 commits May 7, 2026 21:13

use aide::axum::routing

d74d1e7

khj809 reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADD: batch document ingest + bulk purge with partial success#57

ADD: batch document ingest + bulk purge with partial success#57
ljhh-0611 wants to merge 4 commits intorefactoring-applied-backed-v2from
feat/document-batch

ljhh-0611 commented May 7, 2026 •

edited

Loading

Uh oh!

khj809 commented May 7, 2026

Uh oh!

khj809 May 7, 2026

Uh oh!

khj809 May 7, 2026

Uh oh!

khj809 May 7, 2026

Uh oh!

khj809 May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	while let Ok(Some(field)) = multipart.next_field().await {
	while let Some(field) = multipart.next_field().await
	.map_err(\|e\| AppError::bad_request(format!("multipart error: {e}")))? {

Conversation

ljhh-0611 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Speedwagon Store

Backend Document API

Tests

API Shape

Batch ingest

Bulk purge

Validation

Notes / Out of Scope

Uh oh!

khj809 commented May 7, 2026

Uh oh!

khj809 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

khj809 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

khj809 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

khj809 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ljhh-0611 commented May 7, 2026 •

edited

Loading