feat(speedwagon): add DescriptionAgent for KB-level description generation#47
Conversation
…ments - Drop the "alongside other KBs" framing from the prompt opening so the model isn't primed to emit comparison vocabulary; the existing "do not compare" clause now matches the framing. - Note Korean output's ~1/3 char density at the same budget; per-language budgets are deferred (LLM can't count Korean words reliably either). - Trim verbose doc-comments across description.rs and Store::describe; add cost note (~24K input chars at N=200) to discourage synchronous use on indexing hot paths. Verified against the 4-KB / 12-probe harness: routing accuracy stays 12/12 across the shipped baseline, the prompt-only change, and word-budget variants. cargo test -p speedwagon --lib description: 12 passed, 1 ignored.
|
(I apologize for bringing this up unrelated. I discovered this while running this PR test.) @jhlee525 It is irregular, so maybe we need to fix this. How should we modify this? |
It appears that for stores containing a Korean document set, the Since the Just as an LLM thinks in English even while responding in Korean, if using English I do not have a particular preference regarding language fixed proposals; however, if you are interested in this, I would like to add a bit more justification for it. (Making a decision after accurately verifying it through benchmarks could be an approach. However, since the characteristics of the datasets currently available are relatively clear, I suspect this will not be easy for LLMs to get confused by; and also, I believe the scope of the decision is too small in comparison. So I wouldn't recommend.) |
I agree with this direction. For AgentCard descriptions, English may also be more token-efficient in many cases, especially compared to Korean descriptions, while improving consistency and readability. |
Switch `&[(String, String)]` / `&[String]` to `&[(&str, &str)]` / `&[&str]` in `generate`, `get_description`, `build_user_message`, and `fallback_description`, and drop the upfront title/purpose clone in `Store::describe`. Caller-side `Document` strings are borrowed directly, and the fallback title vec is built only on the empty-LLM-response branch.
grf53
left a comment
There was a problem hiding this comment.
In addition to the notes mentioned, I believe we could consider using 'existing descriptions' and 'data from modified documents' to avoid repeatedly inputting nearly identical data into the LLM when describing.
However, since this method can cause the output to be overly skewed toward the changes and so requires observation and policy from a higher perspective, it would be advisable to consider it when improvements are needed.
LGTM
Recency bias is actually the main reason for the current policy. Purpose is capped at 200 chars (~120 avg), so |
Ticket
resolves #45
Summary
Adds a
DescriptionAgentto the speedwagon crate that turns a KB's(title, purpose)list into one ~200-character description. Pairs withPurposeAgentfrom #41: same Ailoy stack, same JSON-with-fallback parser. Library work only — no DB column, noManual | Autotoggle, no HTTP endpoint. A backend that wants to refresh its descriptions callsStore::describeand writes the result back.The motivating failure mode: chat-agent reads each Speedwagon's
descriptionin two places (chat-agent/src/speedwagon/dispatch.rs:110-114as the tool description, andbackend/src/prompt.rs:66as the system prompt KB list). Today that field is plain user input that goes stale the moment documents change. An offline harness reproduced the silent-degradation case: with a four-KB / twelve-probe harness (three hand-written cross-domain questions per KB), a single empty description drops cross-domain routing accuracy from 12/12 (all probes routed to the right KB) to 11/12. AllX/12numbers in this PR refer to this harness.Changes
description.rs(new)DescriptionAgentmirrorsparser::PurposeAgent. Inputs: KB name, optional instruction, the full(title, purpose)list. Output: a single line, target ~200 chars.get_descriptionis the env-backed entry point — same shape asparser::get_title/parser::get_purpose:dotenvy::dotenv().ok(), build provider fromOPENAI_API_KEY, run the agent, swap infallback_descriptionif the body comes back empty.fallback_description(N, top_titles)writes"{N} documents including: {top-5 titles}". The harness compared this against the empty-string fallback and"{top-20 titles}"; empty-string dropped to 11/12 (the no-description KB got bypassed when a query was ambiguous), and N+top-5 matched the longer fallback at a quarter of the tokens.Prompt body
The system instruction is hardcoded. The shipped prompt is a self-anchored revision — its opening sentence does not mention "alongside descriptions of other knowledge bases", so the model is not primed to emit comparison vocabulary. Earlier candidates that did mention peers still routed 12/12 on the harness, but the self-anchored framing keeps the prompt aligned with the later "do not compare" clause and is the cleaner default.
Store::describe(mod.rs)Store::describe(kb_name, instruction)pulls(title, purpose)from the index it already owns and forwards toget_description. Empty index returns""without an LLM call. The doc-comment notes the cost: one LLM call, input ≈ 24K chars at N=200 docs — callers should not invoke this synchronously inside indexing hot paths.Prompt sweep
Real outputs at length ~200, four-KB / twelve-probe harness, shipped prompt:
Descriptions are forced to English by the prompt regardless of document language; kpaperqa's output is in English even though its corpus is Korean. Neither the shipped prompt nor the earlier peer-aware candidate references other KBs by name.
Other knobs from the same harness, fixed as single values in the code:
AgentCard.title + purposevspurpose-only. Both 12/12; purpose-only saves ~25% of input tokens, so the prompt only renderspurpose. The fallback string still uses titles since they are the cheapest identity signal when the LLM is unavailable.The probe set itself:
The probe set is admittedly easy: four disjoint domains. Same exercise on near-domain KBs (e.g.
finance-2023vsfinance-2024) might land on different values; none such exist yet.Tests
12 new unit tests (parser variants, fallback shape, user-message rendering, empty-store short-circuit) plus one
#[ignore]-gated integration test that prefills the index viaindexer::add_documentand runsStore::describeend-to-end.The integration test passes locally in ~3s. Run it with:
Notes
OPENAI_API_KEY), same shape asPurposeAgentandTitleAgent. Unifying the three utility agents under one provider-routing abstraction belongs in a separate issue.