Embedding and rerank calls currently talk to rkllama directly via qmd-client. Refactor so the memory_search skill and any RAG path submit Tasks to the scheduler with:
- capability = Capability.EMBEDDING
- preferred_resources = [npu-rk3588 with max_wait_ms=200, cpu-inference]
- priority = Priority.INTERACTIVE_AGENT
This is the load-bearing example in docs/design/resource-scheduler.md §Worked example — chat-while-generating. When image gen holds the NPU, memory lookups should transparently route to CPU embedding in ~500ms rather than blocking behind 34s of SD.
Embedding and rerank calls currently talk to rkllama directly via qmd-client. Refactor so the memory_search skill and any RAG path submit Tasks to the scheduler with:
This is the load-bearing example in docs/design/resource-scheduler.md §Worked example — chat-while-generating. When image gen holds the NPU, memory lookups should transparently route to CPU embedding in ~500ms rather than blocking behind 34s of SD.