This document describes how to measure whether opencode-evermemos-plugin improves agent performance compared to:
- no memory plugin
supermemory
The goal is to measure actual task performance, not just whether memory retrieval "looks good" in logs.
A better memory system should improve some or all of these:
- higher task success rate
- fewer turns to complete tasks
- better recall of prior facts, preferences, and actions
- better continuity after long gaps or compaction
- low latency overhead
- low prompt bloat
- no privacy regressions
For this plugin, the most important dimensions are:
- recall quality
- continuity across turns and compaction
- usefulness of stored tool/action memory
- latency cost
- prompt-size cost
- privacy safety
Run the same benchmark in three modes:
- no memory plugin
supermemoryopencode-evermemos-plugin
Keep these fixed across runs:
- same model
- same repo/worktree
- same prompts
- same starting files
- same environment
- same permissions
- same temperature, ideally
0
Use tasks that actually depend on memory.
Seed a user preference such as:
Prefer small focused patches and no unrelated refactors.
Later ask the agent to make a change and score whether it follows that preference.
Seed project facts such as:
- framework
- database
- architecture conventions
- naming rules
Later ask questions that require those facts after several turns.
Have the agent make a file edit, then later ask:
What did we change in auth?
This tests whether action-derived memory is actually useful.
Seed code-specific identifiers such as:
validateTokenAuthMiddlewareSessionCache
Later ask about them after enough context has passed that the base model should no longer have them in short-term context.
Create a long enough session to trigger compaction, then ask about facts or actions mentioned much earlier.
This is a key test for persistent memory value.
Include content such as:
this is secret <private>abc123</private>
Then verify that:
- the private payload is not stored
- it is not injected later
- it does not show up in memory tools
Track these for every run:
task_success_rateturns_to_successtime_to_first_useful_answerextra_latency_per_turnmemory_precisionmemory_contaminationmemory_usefulnessprompt_overheadprivacy_failure_rate
task_success_rate
- Whether the agent actually solved the task.
turns_to_success
- Number of user/assistant turns needed before the task is completed correctly.
time_to_first_useful_answer
- Time until the first answer that materially advances the task.
extra_latency_per_turn
- Added time caused by memory retrieval or storage behavior.
memory_precision
- Recalled memories were relevant to the current task.
memory_contamination
- Wrong project, wrong user, or irrelevant/stale memory was injected.
memory_usefulness
- Recalled memory changed the outcome for the better rather than being decorative.
prompt_overhead
- Added injected characters or estimated tokens per turn.
privacy_failure_rate
- How often sensitive or
<private>content appears in stored or injected memory.
For memory-related quality metrics, use a lightweight rubric:
2= clearly helped1= neutral0= hurt, wrong, or noisy
This works well for:
- memory precision
- memory usefulness
- continuity after compaction
- action-memory usefulness
Do not rely on casual transcripts alone. Use a repeatable structure:
- seed memory in an earlier turn or session
- remove short-term context pressure by adding unrelated turns or forcing compaction
- ask a task that depends on the seeded memory
- compare the result across all three modes
This isolates memory quality from raw model capability.
If opencode-evermemos-plugin is genuinely better than supermemory, it should mostly win on:
- exact code symbol recall
- repo-scoped memory separation
- action/history recall from tool summaries
- separation of profile memory from episodic memory
If it loses, the common reasons will likely be:
- noisy tool memories
- prompt bloat
- stale or irrelevant recall
- latency overhead
- cross-user or cross-project contamination
For each turn, log at least:
- recall request latency
- write request latency
- number of recalled memories
- memory types injected
- injected character count
- whether recall occurred
- whether the task eventually succeeded
This makes it possible to correlate memory behavior with outcome rather than guessing from transcripts.
Start with:
- 15 to 20 benchmark tasks
- 10 runs per mode
That is usually enough to detect clear signal without turning evaluation into a huge research project.
Use a matrix like this:
| Category | No Plugin | Supermemory | EverMemOS Plugin |
|---|---|---|---|
| Preference recall success | |||
| Repo fact recall success | |||
| Action recall success | |||
| Symbol recall success | |||
| Compaction survival | |||
| Avg added latency | |||
| Avg injected chars | |||
| Privacy failures |
Treat this as a benchmark suite, not a one-off demo.
The most convincing result will come from:
- repeated runs
- fixed prompts
- fixed repo state
- measured outputs
- side-by-side comparison
A good next step is to build a lightweight evaluation harness that:
- defines benchmark tasks
- runs them across the three modes
- captures transcripts and timing
- stores raw results
- outputs a markdown or CSV comparison report
That will let this plugin compete on evidence instead of intuition.