DevMemBench
DevMemBench is a benchmark developed by ProsperityPirate specifically for codexfi — available in the benchmark/ directory of the repository. It measures how reliably codexfi recalls facts across coding sessions. The latest run scores 94.5% overall — 189 correct answers out of 200 questions — across 8 categories that map directly to how you use memory day-to-day.
What the score means for you
Each category in the benchmark corresponds to something the memory system does during a real session:
| Category | What it means in practice | Score |
|---|---|---|
tech-stack | Remembers your framework, language, and tooling choices | 100% |
architecture | Recalls system design decisions and component relationships | 100% |
preference | Keeps track of your personal coding and workflow preferences | 100% |
abstention | Doesn't hallucinate answers when something isn't in memory | 100% |
session-continuity | Carries facts from one session into the next | 96% |
knowledge-update | Replaces stale facts when you switch tools or change approaches | 96% |
error-solution | Retains bug fixes, gotchas, and debugging approaches | 92% |
cross-session-synthesis | Combines facts that were spread across multiple sessions | 72% |
| Overall | 94.5% |
The four 100% categories represent the core value proposition — stack, architecture, preferences, and abstention. These are the facts you most frequently need the agent to remember, and they are recalled perfectly in the benchmark.
Known limitation: cross-session synthesis
cross-session-synthesis (72%) is the weakest category. This tests questions like "list every preference you've seen me mention across all our sessions" — queries that require enumerating facts scattered across many separate sessions with different phrasing.
A single semantic search surfaces the most relevant memories but may miss facts that are phrased differently from the query. codexfi's types[] enumeration path (used by memory({ mode: "list" })) improves recall for explicit list queries, but the benchmark measures fully autonomous recall where the agent doesn't know to ask for a list.
This is a known gap and an active area of improvement. For now: if you need an exhaustive list of a specific type of memory, use the memory tool explicitly — memory({ mode: "list", scope: "project" }) — rather than relying on semantic retrieval alone.
Benchmark dataset
The benchmark uses a synthetic codebase history spanning two realistic projects:
ecommerce-api— FastAPI + PostgreSQL + Redis + Stripe. 25 sessions covering authentication, product catalog, shopping cart, and checkout flow (Jan–Feb 2025).dashboard-app— Next.js 15 + Recharts + SWR. 25 sessions covering analytics dashboard, chart components, and data fetching (Jan–Feb 2025).
Sessions were written to reflect realistic developer conversations — including preference statements, architectural decisions, error debugging, and knowledge updates (for example, migrating from one ORM to another mid-project). The 200 questions are designed to stress memory across session boundaries, not just within a single conversation.
Methodology
The benchmark runs a five-phase LLM-as-judge pipeline:
- Ingest — all 50 synthetic sessions are ingested into a fresh codexfi memory store
- Search — for each of the 200 questions, codexfi's retrieval pipeline is called with the question as the query
- Answer — an LLM answers using only the retrieved memory context (no access to the original session transcripts)
- Evaluate — a judge LLM compares the answer against ground-truth and scores it
correctorincorrectwith an explanation - Report — scores are aggregated by category and written to
report.json
Both the answering model and the judge model are claude-sonnet-4-6. The answering step is intentionally constrained — if codexfi's retrieval doesn't surface the right memory, the answer will be wrong. This makes the benchmark a direct measure of retrieval quality, not LLM reasoning ability.
Running the benchmark yourself
Prerequisites
ANTHROPIC_API_KEY— used for the answering and judge LLM calls (set as an environment variable for the benchmark runner)VOYAGE_API_KEY— used for embeddings during ingest and search (set as an environment variable for the benchmark runner)
These environment variables are read by the benchmark runner directly (
benchmark/src/), not by the codexfi plugin. The plugin reads keys from~/.codexfi/codexfi.jsonc. You need both: the config file for the plugin, and the env vars for the benchmark script.
Run
cd benchmark
bun install # first time only
bun run ingest # ingest all 50 synthetic sessions into a fresh store
bun run bench # run all 200 questions and produce report.jsonResults are written to benchmark/data/runs/<run-id>/report.json.
To run a single category in isolation:
bun run bench --category tech-stack
bun run bench --category cross-session-synthesis