Skip to content

feat(speedwagon): add DescriptionAgent for KB-level description generation#47

Merged
nuri-yoo merged 4 commits intorefactoring-appliedfrom
feat/agentcard-description
May 6, 2026
Merged

feat(speedwagon): add DescriptionAgent for KB-level description generation#47
nuri-yoo merged 4 commits intorefactoring-appliedfrom
feat/agentcard-description

Conversation

@nuri-yoo
Copy link
Copy Markdown
Collaborator

@nuri-yoo nuri-yoo commented Apr 28, 2026

Ticket

resolves #45

Summary

Adds a DescriptionAgent to the speedwagon crate that turns a KB's (title, purpose) list into one ~200-character description. Pairs with PurposeAgent from #41: same Ailoy stack, same JSON-with-fallback parser. Library work only — no DB column, no Manual | Auto toggle, no HTTP endpoint. A backend that wants to refresh its descriptions calls Store::describe and writes the result back.

The motivating failure mode: chat-agent reads each Speedwagon's description in two places (chat-agent/src/speedwagon/dispatch.rs:110-114 as the tool description, and backend/src/prompt.rs:66 as the system prompt KB list). Today that field is plain user input that goes stale the moment documents change. An offline harness reproduced the silent-degradation case: with a four-KB / twelve-probe harness (three hand-written cross-domain questions per KB), a single empty description drops cross-domain routing accuracy from 12/12 (all probes routed to the right KB) to 11/12. All X/12 numbers in this PR refer to this harness.

Changes

description.rs (new)

DescriptionAgent mirrors parser::PurposeAgent. Inputs: KB name, optional instruction, the full (title, purpose) list. Output: a single line, target ~200 chars.

get_description is the env-backed entry point — same shape as parser::get_title / parser::get_purpose: dotenvy::dotenv().ok(), build provider from OPENAI_API_KEY, run the agent, swap in fallback_description if the body comes back empty.

fallback_description(N, top_titles) writes "{N} documents including: {top-5 titles}". The harness compared this against the empty-string fallback and "{top-20 titles}"; empty-string dropped to 11/12 (the no-description KB got bypassed when a query was ambiguous), and N+top-5 matched the longer fallback at a quarter of the tokens.

Prompt body

The system instruction is hardcoded. The shipped prompt is a self-anchored revision — its opening sentence does not mention "alongside descriptions of other knowledge bases", so the model is not primed to emit comparison vocabulary. Earlier candidates that did mention peers still routed 12/12 on the harness, but the self-anchored framing keeps the prompt aligned with the later "do not compare" clause and is the cleaner default.

You write a self-contained description of a knowledge base. This description will be read by a routing agent that picks the right knowledge base for a user's question. Inputs: KB name, optional instruction, and a list of one-line document purposes. Describe what is INSIDE this knowledge base — its document types, the entities and time periods covered, and the topics it can answer. Lead with the collective identity of the documents. Describe this KB on its own terms — do not compare it to other KBs, list what it excludes, or mention neighboring KB names. Output must NOT mention dataset names, QA pairs, paper IDs, contract IDs, or any metadata about how this knowledge base was assembled. Describe ONLY what documents are inside, as if a curator wrote it. Length: ~200 characters. Output a JSON object: {"description": "<text>"}.

Store::describe (mod.rs)

Store::describe(kb_name, instruction) pulls (title, purpose) from the index it already owns and forwards to get_description. Empty index returns "" without an LLM call. The doc-comment notes the cost: one LLM call, input ≈ 24K chars at N=200 docs — callers should not invoke this synchronously inside indexing hot paths.

Prompt sweep

Real outputs at length ~200, four-KB / twelve-probe harness, shipped prompt:

KB description
financebench Annual reports, 10-Ks, 10-Qs, 8-Ks, and earnings releases from major public companies, covering fiscal years and quarters from 2015–2024. Includes revenue, expenses, assets, debt, cash flow, guidance, operations, stock, and board events.
cuad Commercial agreements and amendments covering stock offerings, licenses, supply, services, alliances, sponsorships, distribution, consulting, transportation, franchise, and hosting, spanning 1990s–2024 across corporations, banks, universities, startups, and vendors.
kpaperqa Korean academic papers spanning medicine, engineering, science, education, design, policy, and social research, covering topics from the 1990s–2020s with experimental, clinical, survey, modeling, and review studies.
bioasq Biomedical literature abstracts on diseases, mechanisms, diagnostics, imaging, therapeutics, protocols, and case reports across human, animal, and cell studies from varied years.

Descriptions are forced to English by the prompt regardless of document language; kpaperqa's output is in English even though its corpus is Korean. Neither the shipped prompt nor the earlier peer-aware candidate references other KBs by name.

Other knobs from the same harness, fixed as single values in the code:

  • Output length: scanned 100/200/400/800 characters. All 12/12. 200 was the shortest length that still preserved a coherent identity sentence.
  • Output language: forced to English. Routing held 12/12 on the four-KB mix, including the Korean-corpus kpaperqa, where the prior language-following prompt produced a Korean description that ran ~90 chars (vs ~200 for the English KBs). English-only output also sidesteps encoding and external-transmission concerns once the description reaches an AgentCard.
  • Doc-count scaling: pushed kpaperqa to 200 docs in. Output stayed at 100–130 chars regardless of input N. Input cost grows linearly (~24K chars at N=200). Beyond ~500 docs per KB this will need stratified sampling or a two-pass step.
  • Input format: title + purpose vs purpose-only. Both 12/12; purpose-only saves ~25% of input tokens, so the prompt only renders purpose. The fallback string still uses titles since they are the cheapest identity signal when the LLM is unavailable.

The probe set itself:

expected KB probe
financebench What was Apple's iPhone revenue in fiscal year 2021?
financebench How did Walmart's e-commerce segment perform in 2023?
financebench Show me Microsoft's operating margin for the last reported year.
cuad What does the termination clause of this commercial agreement say?
cuad Find non-compete obligations in vendor contracts.
cuad Who are the parties bound by exclusivity in this licensing deal?
kpaperqa 한국 학술 논문에서 머신러닝 기법을 적용한 사례를 찾아줘.
kpaperqa 국내 학술지 논문에서 이 연구의 결론은 무엇이지?
kpaperqa 한국 연구자들이 발표한 학술 논문에서 이 주제를 어떻게 다뤘어?
bioasq What does the literature say about CRISPR-Cas9 off-target effects?
bioasq Find PubMed abstracts on the role of TNF-alpha in rheumatoid arthritis.
bioasq Summarize biomedical findings about gut microbiome and depression.

The probe set is admittedly easy: four disjoint domains. Same exercise on near-domain KBs (e.g. finance-2023 vs finance-2024) might land on different values; none such exist yet.

Tests

12 new unit tests (parser variants, fallback shape, user-message rendering, empty-store short-circuit) plus one #[ignore]-gated integration test that prefills the index via indexer::add_document and runs Store::describe end-to-end.

cargo test -p speedwagon --lib description
test result: ok. 12 passed; 0 failed; 1 ignored

The integration test passes locally in ~3s. Run it with:

OPENAI_API_KEY=... cargo test -p speedwagon describe_round_trips -- --ignored

Notes

  • Provider is environment-driven (OPENAI_API_KEY), same shape as PurposeAgent and TitleAgent. Unifying the three utility agents under one provider-routing abstraction belongs in a separate issue.
  • No backend wiring in this PR. Manual-vs-Auto policy, an HTTP endpoint to trigger regeneration, and an indexing-finalize hook all belong to whoever owns Speedwagon CRUD on the backend side. The library has no opinion.
  • Prompt is peer-blind by construction. Cross-KB cyclic prompts would help only if near-domain KBs share a routing surface, which the current four-KB mix does not exercise. Not needed yet.
  • Doc-count compression for >500-doc KBs is also deferred until a KB at that size shows up.

…ments

- Drop the "alongside other KBs" framing from the prompt opening so the model
  isn't primed to emit comparison vocabulary; the existing "do not compare"
  clause now matches the framing.
- Note Korean output's ~1/3 char density at the same budget; per-language
  budgets are deferred (LLM can't count Korean words reliably either).
- Trim verbose doc-comments across description.rs and Store::describe; add
  cost note (~24K input chars at N=200) to discourage synchronous use on
  indexing hot paths.

Verified against the 4-KB / 12-probe harness: routing accuracy stays 12/12
across the shipped baseline, the prompt-only change, and word-budget
variants. cargo test -p speedwagon --lib description: 12 passed, 1 ignored.
@nuri-yoo nuri-yoo marked this pull request as ready for review April 29, 2026 03:09
@nuri-yoo nuri-yoo requested review from grf53 and jhlee525 April 29, 2026 03:09
@nuri-yoo
Copy link
Copy Markdown
Collaborator Author

@grf53 @jhlee525

@grf53
Copy link
Copy Markdown
Contributor

grf53 commented Apr 29, 2026

(I apologize for bringing this up unrelated. I discovered this while running this PR test.)

@jhlee525
I found dependency: knowledge-base-examples = { path = "../../knowledge-base-examples", optional = true } in speedwagon/Cargo.toml.
This requires us to have the latest knowledge-base-example in the specific location.
Maybe it's because the repository is private.

It is irregular, so maybe we need to fix this. How should we modify this?

@grf53 grf53 linked an issue Apr 29, 2026 that may be closed by this pull request
@grf53
Copy link
Copy Markdown
Contributor

grf53 commented Apr 29, 2026

kpaperqa 한국 학술논문·학위논문·연구보고서가 모인 지식베이스로, 교육·의학·공학·농업·환경·사회과학 전반의 연구 주제, 실험·사례·설문·리뷰 자료와 한국어/영문 논문을 담고 있습니다.

It appears that for stores containing a Korean document set, the description may appear in Korean. I suspect this is because the titles and purposes of individual documents are written in Korean.

Since the title and purpose are included in BM25 search targets, I believe maintaining the original language is a good decision. However, regarding AgentCard descriptions, they are intended entirely for LLMs (or occasionally humans), so I do not think it is absolutely necessary to maintain the original language.

Just as an LLM thinks in English even while responding in Korean, if using English descriptions helps the LLM select sub-agents more accurately(not really sure for this), I believe it is better to fix the description language to English.
Given the nature of AgentCard descriptions, which can be transmitted externally, fixing the language to English may be an advantageous choice in terms of encoding issues and comprehensibility for occasional human reading.

I do not have a particular preference regarding language fixed proposals; however, if you are interested in this, I would like to add a bit more justification for it.

(Making a decision after accurately verifying it through benchmarks could be an approach. However, since the characteristics of the datasets currently available are relatively clear, I suspect this will not be easy for LLMs to get confused by; and also, I believe the scope of the decision is too small in comparison. So I wouldn't recommend.)

@nuri-yoo
Copy link
Copy Markdown
Collaborator Author

kpaperqa 한국 학술논문·학위논문·연구보고서가 모인 지식베이스로, 교육·의학·공학·농업·환경·사회과학 전반의 연구 주제, 실험·사례·설문·리뷰 자료와 한국어/영문 논문을 담고 있습니다.

It appears that for stores containing a Korean document set, the description may appear in Korean. I suspect this is because the titles and purposes of individual documents are written in Korean.

Since the title and purpose are included in BM25 search targets, I believe maintaining the original language is a good decision. However, regarding AgentCard descriptions, they are intended entirely for LLMs (or occasionally humans), so I do not think it is absolutely necessary to maintain the original language.

Just as an LLM thinks in English even while responding in Korean, if using English descriptions helps the LLM select sub-agents more accurately(not really sure for this), I believe it is better to fix the description language to English. Given the nature of AgentCard descriptions, which can be transmitted externally, fixing the language to English may be an advantageous choice in terms of encoding issues and comprehensibility for occasional human reading.

I do not have a particular preference regarding language fixed proposals; however, if you are interested in this, I would like to add a bit more justification for it.

(Making a decision after accurately verifying it through benchmarks could be an approach. However, since the characteristics of the datasets currently available are relatively clear, I suspect this will not be easy for LLMs to get confused by; and also, I believe the scope of the decision is too small in comparison. So I wouldn't recommend.)

I agree with this direction. For AgentCard descriptions, English may also be more token-efficient in many cases, especially compared to Korean descriptions, while improving consistency and readability.
I will follow up with a commit to apply this approach without benchmark.

@nuri-yoo
Copy link
Copy Markdown
Collaborator Author

@grf53 Applied in 287b649. Added "Write the description in English regardless of the document language." to the system instruction. PR body updated as well.

Comment thread speedwagon/src/store/description.rs Outdated
Switch `&[(String, String)]` / `&[String]` to `&[(&str, &str)]` /
`&[&str]` in `generate`, `get_description`, `build_user_message`, and
`fallback_description`, and drop the upfront title/purpose clone in
`Store::describe`. Caller-side `Document` strings are borrowed directly,
and the fallback title vec is built only on the empty-LLM-response
branch.
Copy link
Copy Markdown
Contributor

@grf53 grf53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the notes mentioned, I believe we could consider using 'existing descriptions' and 'data from modified documents' to avoid repeatedly inputting nearly identical data into the LLM when describing.

However, since this method can cause the output to be overly skewed toward the changes and so requires observation and policy from a higher perspective, it would be advisable to consider it when improvements are needed.

LGTM

@nuri-yoo
Copy link
Copy Markdown
Collaborator Author

nuri-yoo commented May 4, 2026

In addition to the notes mentioned, I believe we could consider using 'existing descriptions' and 'data from modified documents' to avoid repeatedly inputting nearly identical data into the LLM when describing.

However, since this method can cause the output to be overly skewed toward the changes and so requires observation and policy from a higher perspective, it would be advisable to consider it when improvements are needed.

LGTM

Recency bias is actually the main reason for the current policy. Purpose is capped at 200 chars (~120 avg), so
per call cost is bearable for now. Would be good to discuss this further later on.

@nuri-yoo nuri-yoo merged commit 76839b3 into refactoring-applied May 6, 2026
@nuri-yoo nuri-yoo deleted the feat/agentcard-description branch May 6, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate Agent card description from Source

3 participants