feat(speedwagon): add DescriptionAgent for KB-level description generation by nuri-yoo · Pull Request #47 · brekkylab/agent-k

nuri-yoo · 2026-04-28T16:25:04Z

Ticket

resolves #45

Summary

Adds a DescriptionAgent to the speedwagon crate that turns a KB's (title, purpose) list into one ~200-character description. Pairs with PurposeAgent from #41: same Ailoy stack, same JSON-with-fallback parser. Library work only — no DB column, no Manual | Auto toggle, no HTTP endpoint. A backend that wants to refresh its descriptions calls Store::describe and writes the result back.

The motivating failure mode: chat-agent reads each Speedwagon's description in two places (chat-agent/src/speedwagon/dispatch.rs:110-114 as the tool description, and backend/src/prompt.rs:66 as the system prompt KB list). Today that field is plain user input that goes stale the moment documents change. An offline harness reproduced the silent-degradation case: with a four-KB / twelve-probe harness (three hand-written cross-domain questions per KB), a single empty description drops cross-domain routing accuracy from 12/12 (all probes routed to the right KB) to 11/12. All X/12 numbers in this PR refer to this harness.

Changes

`description.rs` (new)

DescriptionAgent mirrors parser::PurposeAgent. Inputs: KB name, optional instruction, the full (title, purpose) list. Output: a single line, target ~200 chars.

get_description is the env-backed entry point — same shape as parser::get_title / parser::get_purpose: dotenvy::dotenv().ok(), build provider from OPENAI_API_KEY, run the agent, swap in fallback_description if the body comes back empty.

fallback_description(N, top_titles) writes "{N} documents including: {top-5 titles}". The harness compared this against the empty-string fallback and "{top-20 titles}"; empty-string dropped to 11/12 (the no-description KB got bypassed when a query was ambiguous), and N+top-5 matched the longer fallback at a quarter of the tokens.

Prompt body

The system instruction is hardcoded. The shipped prompt is a self-anchored revision — its opening sentence does not mention "alongside descriptions of other knowledge bases", so the model is not primed to emit comparison vocabulary. Earlier candidates that did mention peers still routed 12/12 on the harness, but the self-anchored framing keeps the prompt aligned with the later "do not compare" clause and is the cleaner default.

You write a self-contained description of a knowledge base. This description will be read by a routing agent that picks the right knowledge base for a user's question. Inputs: KB name, optional instruction, and a list of one-line document purposes. Describe what is INSIDE this knowledge base — its document types, the entities and time periods covered, and the topics it can answer. Lead with the collective identity of the documents. Describe this KB on its own terms — do not compare it to other KBs, list what it excludes, or mention neighboring KB names. Output must NOT mention dataset names, QA pairs, paper IDs, contract IDs, or any metadata about how this knowledge base was assembled. Describe ONLY what documents are inside, as if a curator wrote it. Length: ~200 characters. Output a JSON object: {"description": "<text>"}.

`Store::describe` (`mod.rs`)

Store::describe(kb_name, instruction) pulls (title, purpose) from the index it already owns and forwards to get_description. Empty index returns "" without an LLM call. The doc-comment notes the cost: one LLM call, input ≈ 24K chars at N=200 docs — callers should not invoke this synchronously inside indexing hot paths.

Prompt sweep

Real outputs at length ~200, four-KB / twelve-probe harness, shipped prompt:

KB	description
financebench	Annual reports, 10-Ks, 10-Qs, 8-Ks, and earnings releases from major public companies, covering fiscal years and quarters from 2015–2024. Includes revenue, expenses, assets, debt, cash flow, guidance, operations, stock, and board events.
cuad	Commercial agreements and amendments covering stock offerings, licenses, supply, services, alliances, sponsorships, distribution, consulting, transportation, franchise, and hosting, spanning 1990s–2024 across corporations, banks, universities, startups, and vendors.
kpaperqa	Korean academic papers spanning medicine, engineering, science, education, design, policy, and social research, covering topics from the 1990s–2020s with experimental, clinical, survey, modeling, and review studies.
bioasq	Biomedical literature abstracts on diseases, mechanisms, diagnostics, imaging, therapeutics, protocols, and case reports across human, animal, and cell studies from varied years.

Descriptions are forced to English by the prompt regardless of document language; kpaperqa's output is in English even though its corpus is Korean. Neither the shipped prompt nor the earlier peer-aware candidate references other KBs by name.

Other knobs from the same harness, fixed as single values in the code:

Output length: scanned 100/200/400/800 characters. All 12/12. 200 was the shortest length that still preserved a coherent identity sentence.
Output language: forced to English. Routing held 12/12 on the four-KB mix, including the Korean-corpus kpaperqa, where the prior language-following prompt produced a Korean description that ran ~90 chars (vs ~200 for the English KBs). English-only output also sidesteps encoding and external-transmission concerns once the description reaches an AgentCard.
Doc-count scaling: pushed kpaperqa to 200 docs in. Output stayed at 100–130 chars regardless of input N. Input cost grows linearly (~24K chars at N=200). Beyond ~500 docs per KB this will need stratified sampling or a two-pass step.
Input format: title + purpose vs purpose-only. Both 12/12; purpose-only saves ~25% of input tokens, so the prompt only renders purpose. The fallback string still uses titles since they are the cheapest identity signal when the LLM is unavailable.

The probe set itself:

expected KB	probe
financebench	What was Apple's iPhone revenue in fiscal year 2021?
financebench	How did Walmart's e-commerce segment perform in 2023?
financebench	Show me Microsoft's operating margin for the last reported year.
cuad	What does the termination clause of this commercial agreement say?
cuad	Find non-compete obligations in vendor contracts.
cuad	Who are the parties bound by exclusivity in this licensing deal?
kpaperqa	한국 학술 논문에서 머신러닝 기법을 적용한 사례를 찾아줘.
kpaperqa	국내 학술지 논문에서 이 연구의 결론은 무엇이지?
kpaperqa	한국 연구자들이 발표한 학술 논문에서 이 주제를 어떻게 다뤘어?
bioasq	What does the literature say about CRISPR-Cas9 off-target effects?
bioasq	Find PubMed abstracts on the role of TNF-alpha in rheumatoid arthritis.
bioasq	Summarize biomedical findings about gut microbiome and depression.

The probe set is admittedly easy: four disjoint domains. Same exercise on near-domain KBs (e.g. finance-2023 vs finance-2024) might land on different values; none such exist yet.

Tests

12 new unit tests (parser variants, fallback shape, user-message rendering, empty-store short-circuit) plus one #[ignore]-gated integration test that prefills the index via indexer::add_document and runs Store::describe end-to-end.

cargo test -p speedwagon --lib description
test result: ok. 12 passed; 0 failed; 1 ignored

The integration test passes locally in ~3s. Run it with:

OPENAI_API_KEY=... cargo test -p speedwagon describe_round_trips -- --ignored

Notes

Provider is environment-driven (OPENAI_API_KEY), same shape as PurposeAgent and TitleAgent. Unifying the three utility agents under one provider-routing abstraction belongs in a separate issue.
No backend wiring in this PR. Manual-vs-Auto policy, an HTTP endpoint to trigger regeneration, and an indexing-finalize hook all belong to whoever owns Speedwagon CRUD on the backend side. The library has no opinion.
Prompt is peer-blind by construction. Cross-KB cyclic prompts would help only if near-domain KBs share a routing surface, which the current four-KB mix does not exercise. Not needed yet.
Doc-count compression for >500-doc KBs is also deferred until a KB at that size shows up.

…ation

…ments - Drop the "alongside other KBs" framing from the prompt opening so the model isn't primed to emit comparison vocabulary; the existing "do not compare" clause now matches the framing. - Note Korean output's ~1/3 char density at the same budget; per-language budgets are deferred (LLM can't count Korean words reliably either). - Trim verbose doc-comments across description.rs and Store::describe; add cost note (~24K input chars at N=200) to discourage synchronous use on indexing hot paths. Verified against the 4-KB / 12-probe harness: routing accuracy stays 12/12 across the shipped baseline, the prompt-only change, and word-budget variants. cargo test -p speedwagon --lib description: 12 passed, 1 ignored.

nuri-yoo · 2026-04-29T03:09:37Z

@grf53 @jhlee525

grf53 · 2026-04-29T05:44:27Z

(I apologize for bringing this up unrelated. I discovered this while running this PR test.)

@jhlee525
I found dependency: knowledge-base-examples = { path = "../../knowledge-base-examples", optional = true } in speedwagon/Cargo.toml.
This requires us to have the latest knowledge-base-example in the specific location.
Maybe it's because the repository is private.

It is irregular, so maybe we need to fix this. How should we modify this?

grf53 · 2026-04-29T10:56:54Z

kpaperqa 한국 학술논문·학위논문·연구보고서가 모인 지식베이스로, 교육·의학·공학·농업·환경·사회과학 전반의 연구 주제, 실험·사례·설문·리뷰 자료와 한국어/영문 논문을 담고 있습니다.

It appears that for stores containing a Korean document set, the description may appear in Korean. I suspect this is because the titles and purposes of individual documents are written in Korean.

Since the title and purpose are included in BM25 search targets, I believe maintaining the original language is a good decision. However, regarding AgentCard descriptions, they are intended entirely for LLMs (or occasionally humans), so I do not think it is absolutely necessary to maintain the original language.

Just as an LLM thinks in English even while responding in Korean, if using English descriptions helps the LLM select sub-agents more accurately(not really sure for this), I believe it is better to fix the description language to English.
Given the nature of AgentCard descriptions, which can be transmitted externally, fixing the language to English may be an advantageous choice in terms of encoding issues and comprehensibility for occasional human reading.

I do not have a particular preference regarding language fixed proposals; however, if you are interested in this, I would like to add a bit more justification for it.

(Making a decision after accurately verifying it through benchmarks could be an approach. However, since the characteristics of the datasets currently available are relatively clear, I suspect this will not be easy for LLMs to get confused by; and also, I believe the scope of the decision is too small in comparison. So I wouldn't recommend.)

nuri-yoo · 2026-04-29T11:10:29Z

kpaperqa 한국 학술논문·학위논문·연구보고서가 모인 지식베이스로, 교육·의학·공학·농업·환경·사회과학 전반의 연구 주제, 실험·사례·설문·리뷰 자료와 한국어/영문 논문을 담고 있습니다.

It appears that for stores containing a Korean document set, the description may appear in Korean. I suspect this is because the titles and purposes of individual documents are written in Korean.

Since the title and purpose are included in BM25 search targets, I believe maintaining the original language is a good decision. However, regarding AgentCard descriptions, they are intended entirely for LLMs (or occasionally humans), so I do not think it is absolutely necessary to maintain the original language.

Just as an LLM thinks in English even while responding in Korean, if using English descriptions helps the LLM select sub-agents more accurately(not really sure for this), I believe it is better to fix the description language to English. Given the nature of AgentCard descriptions, which can be transmitted externally, fixing the language to English may be an advantageous choice in terms of encoding issues and comprehensibility for occasional human reading.

I do not have a particular preference regarding language fixed proposals; however, if you are interested in this, I would like to add a bit more justification for it.

(Making a decision after accurately verifying it through benchmarks could be an approach. However, since the characteristics of the datasets currently available are relatively clear, I suspect this will not be easy for LLMs to get confused by; and also, I believe the scope of the decision is too small in comparison. So I wouldn't recommend.)

I agree with this direction. For AgentCard descriptions, English may also be more token-efficient in many cases, especially compared to Korean descriptions, while improving consistency and readability.
I will follow up with a commit to apply this approach without benchmark.

nuri-yoo · 2026-04-30T02:28:17Z

@grf53 Applied in 287b649. Added "Write the description in English regardless of the document language." to the system instruction. PR body updated as well.

Switch `&[(String, String)]` / `&[String]` to `&[(&str, &str)]` / `&[&str]` in `generate`, `get_description`, `build_user_message`, and `fallback_description`, and drop the upfront title/purpose clone in `Store::describe`. Caller-side `Document` strings are borrowed directly, and the fallback title vec is built only on the empty-LLM-response branch.

grf53

In addition to the notes mentioned, I believe we could consider using 'existing descriptions' and 'data from modified documents' to avoid repeatedly inputting nearly identical data into the LLM when describing.

However, since this method can cause the output to be overly skewed toward the changes and so requires observation and policy from a higher perspective, it would be advisable to consider it when improvements are needed.

LGTM

nuri-yoo · 2026-05-04T06:58:03Z

In addition to the notes mentioned, I believe we could consider using 'existing descriptions' and 'data from modified documents' to avoid repeatedly inputting nearly identical data into the LLM when describing.

However, since this method can cause the output to be overly skewed toward the changes and so requires observation and policy from a higher perspective, it would be advisable to consider it when improvements are needed.

LGTM

Recency bias is actually the main reason for the current policy. Purpose is capped at 200 chars (~120 avg), so
per call cost is bearable for now. Would be good to discuss this further later on.

nuri-yoo added 2 commits April 29, 2026 01:23

feat(speedwagon): add DescriptionAgent for KB-level description gener…

c989487

…ation

nuri-yoo marked this pull request as ready for review April 29, 2026 03:09

nuri-yoo requested review from grf53 and jhlee525 April 29, 2026 03:09

grf53 linked an issue Apr 29, 2026 that may be closed by this pull request

Generate Agent card description from Source #45

Closed

feat(speedwagon): force description output to English

287b649

grf53 reviewed May 3, 2026

View reviewed changes

Comment thread speedwagon/src/store/description.rs Outdated

grf53 approved these changes May 4, 2026

View reviewed changes

jhlee525 approved these changes May 6, 2026

View reviewed changes

nuri-yoo merged commit 76839b3 into refactoring-applied May 6, 2026

nuri-yoo deleted the feat/agentcard-description branch May 6, 2026 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(speedwagon): add DescriptionAgent for KB-level description generation#47

feat(speedwagon): add DescriptionAgent for KB-level description generation#47
nuri-yoo merged 4 commits intorefactoring-appliedfrom
feat/agentcard-description

nuri-yoo commented Apr 28, 2026 •

edited

Loading

Uh oh!

nuri-yoo commented Apr 29, 2026

Uh oh!

grf53 commented Apr 29, 2026 •

edited

Loading

Uh oh!

grf53 commented Apr 29, 2026

Uh oh!

nuri-yoo commented Apr 29, 2026

Uh oh!

nuri-yoo commented Apr 30, 2026

Uh oh!

Uh oh!

grf53 left a comment

Uh oh!

nuri-yoo commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nuri-yoo commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Summary

Changes

description.rs (new)

Prompt body

Store::describe (mod.rs)

Prompt sweep

Tests

Notes

Uh oh!

nuri-yoo commented Apr 29, 2026

Uh oh!

grf53 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grf53 commented Apr 29, 2026

Uh oh!

nuri-yoo commented Apr 29, 2026

Uh oh!

nuri-yoo commented Apr 30, 2026

Uh oh!

Uh oh!

grf53 left a comment

Choose a reason for hiding this comment

Uh oh!

nuri-yoo commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nuri-yoo commented Apr 28, 2026 •

edited

Loading

`description.rs` (new)

`Store::describe` (`mod.rs`)

grf53 commented Apr 29, 2026 •

edited

Loading