Retrieval is competitive: R@5 0.927 on representative LongMemEval-S, and 0.966 in byte-identical parity with upstream ChromaDB. The move to Postgres + pgvector + Apache AGE cost zero recall — MemPalace finds the right memory.
A diagnostic framework for memory systems — RAG, knowledge graphs, personal knowledge bases, conversational memory — that tests what the system knows about its own structure, not just whether it can retrieve. Findings are deltas under controlled conditions, not absolute leaderboard scores.
MemPalace, on the Postgres + pgvector + Apache AGE fork, is a competitive and honestly-measured memory system — and SME's job here was to find where it stands and where the work is. The verdict: it retrieves with the field, and a frontier-grade answer is already reachable on this stack. The gap to that ceiling isn't the language model's reasoning — it's how faithfully the memory system delivers the right context into the prompt. The encouraging part: that lever is in MemPalace's own hands — ingestion fidelity and retrieval breadth — not a bigger model.
Retrieval is competitive: R@5 0.927 on representative LongMemEval-S, and 0.966 in byte-identical parity with upstream ChromaDB. The move to Postgres + pgvector + Apache AGE cost zero recall — MemPalace finds the right memory.
Handed the right memory (true oracle), the reader answers at 0.868 — within 0.2pp of the field's GPT-4o oracle (0.870). A frontier-grade answer is already reachable on this stack.
The gap to that ceiling is what reaches the reader, not the model's reasoning: ~7pp ingestion fidelity + ~18pp retrieval breadth, ~0pp genuine reader. It's the memory system's to close — and it's MemPalace's to build.
The Apache AGE knowledge graph is live and non-trivial — 1,873,489 cleaned triples. It surfaces drawers vector + keyword search miss on structural cases (the structure categories read it directly); broad-recall fusion and the product path are the next integration.
SME measures structural surfaces no leaderboard captures: ingestion integrity, gap-detection, ontology coherence, and invocation discipline — whether a system knows its own shape.
The method caught its own error in public: a reader floor-lift that returned null exposed a substrate confound and turned an apparent 0.61 “reader ceiling” into the 0.87 finding. Diagnostic deltas, not a leaderboard rank.
SME was driven across nine diagnostic categories, nine
benched substrates, and a 33-system published field. Five
findings carry the report. Every number traces to a committed baselines/
artifact, and we never blur measured (our harness) against claimed
(a vendor self-report). The interactive matrix below is the data; this is the story it tells.
Read the full synthesis →
Oracle retrieval R@5 0.974, deployed 0.927 — yet QA tops out lower because the reader loses the points, not the retriever. Widening top-5→top-20 buys +17.3pp QA (0.567→0.740) then plateaus; the residual to the 0.868 ceiling is synthesis, not retrieval.
Age-fusion showed no significant gain on three corpora (two CI-confirmed null under FDR); the hybrid graph leg is inert (hybrid ≡ union, byte-identical); cross-encoder rerank is neutral-to-negative. The dense-vector + BM25 backbone already does the work — corroborated from outside: ai-memory (no graph) hits R@5 0.920, level with the full stack.
ChromaDB → postgres+pgvector, holding embedding/corpus/reader/judge fixed: retrieval 0.833 == 0.833 and QA 0.392 ≈ 0.384 are statistically identical. CI-confirmed — the paired QA delta is [−2.0, +2.8]pp, 9/250 discordant, p_adj 0.84. The substrate carries the answer; the storage engine does not.
A capped /graph projection reported Cat 4 “98.98% one edge type, entropy
0.020.” The real KG reads entropy 0.645, a
61.87% giant component, modularity 0.796 (hierarchy PASS).
“Monoculture / fragmented / flat” → “diverse / connected /
hierarchical.” SME found its own measurement bug, fixed it, cross-checked it three ways.
Re-typing one graph flat→moderate→fine moves Cat 4 entropy 0.000→0.842→0.856 — while Cat 5 topology (components, giant size, Betti) is byte-identical. So compare systems on Cat 5; report Cat 4 with its type-count. The same graph reads “monoculture” or “healthy” purely by ontology granularity.
Cheap-to-bench needs both no write-time LLM and fast ingest: ai-memory is $0 (R@5 0.920). The wall has two independent causes — a write-time LLM (Mem0 ~18h, Hindsight ~150h) and slow ingest throughput (agentmemory is LLM-free yet ~15h-walled). “No write-time LLM” is necessary but not sufficient.
Standard memory benchmarks ask "can you find a memory?" SME asks four further questions: is the system actually retrieving what it should, does its structural layer earn its complexity, does the graph match what the README says, and when it ships, does the model actually reach the memory?
Multiple passes over every system under test, across multiple corpus shapes and multiple retrieval conditions (A / B / C / D), so brittle default behaviours that hide on any single pass become visible when readings are compared side by side.
The Lookup (1), The Crossing (2), The Dissonance (3), The Threshold (4), The Missing Room (5), The Archive (6), The Abacus (7), The Blueprint (8), The Handshake (9). Each measures a structural property retrieval-only benchmarks don't reach.
The defensible reading shape is a before/after delta under matched conditions, or a within-system A/B/C/D ablation. Cross-system rankings carry an unmeasured confound — corpus, ontology, retrieval config. SME calls that out instead of papering over it.
Every number below is sourced from a JSON in
baselines/. Where the source file isn't published
yet, the card says so. No rounded estimates, no extrapolations.
Thirty-three memory systems, two kinds of number side by side: Published — what each system’s own team or a survey reports (self-reported, NOT our harness) — and SME multipass — the Cat 1–9 readings we measured ourselves. The two are kept in unmistakably separate column-groups: never read a published QA number as something we benched. Sort any column, filter by status / architecture / metric, and follow the magic links — every cell that has a source is clickable: our receipts (the baseline/doc that holds the reading) and their claims (the leaderboard or paper) are styled differently on purpose. Only 8 of the 33 carry a full SME column — that gap is the honest headline. For the narrative this data supports — the five campaign findings, the statistical rigor, and the cost-wall taxonomy — see the synthesis above.
| System | Arch | Status | Published self-reported — NOT our harness | SME multipass we measured (Cat 1–9) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LME QA | LoCoMo | BEAM-1M | 1 | 2c | 3 | 4 | 5 | 6 | 7 | 8 | 9a | 9b | |||
The explorer above is the whole field at a glance; this is the
role-grouped zoom on the nine systems we
actually benched through Cat 1–9 — five memory products
(incl. postgres_ingest, the mempalace-raw ablation), a
no-structure control, a diagnostic orchestrator arm, and two
wired-but-not-yet-benched baselines. It isn’t a leaderboard of nine
competitors: the rows are grouped by role, and the most
informative reading is where each system can even be
measured. Two graph-native products (mempalace, OMEGA) take the full
Cat 1–9; the extraction systems read N/A on the structural cats
for three different architectural reasons; and an
N/A is a measured finding, not a blank. The single-substrate
mempalace deep-dive that follows is the supporting detail under this.
| System | 1Lookup | 2cStairway | 3Dissonance | 4Threshold | 5Missing Room | 6Archive | 7Abacus | 8Blueprint | 9aHandshake | 9bCall-through |
|---|---|---|---|---|---|---|---|---|---|---|
| Memory systems — products under test | ||||||||||
| mempalace verbatim + AGE graph | R@5 0.927 | R@5 0.960 | 0.00E | entropy 0.645 · other 26.83% | largest 61.87% · iso 22.2% | 0.00E | QA 0.580 | mod 0.7961 · PASS | 0.983 | reachable |
| OMEGA embed + auto-relate | R@5 0.900 | R@5 0.920 | recall 0.50E | entropy 0.78 | 1 comp · 0 iso | 0.00E | QA 0.593 | drift 0.875 | N/A · no harness | N/A · no harness |
| Hindsight extraction competitor | QA-def | QA-def | N/A · no endpoint | N/A · no endpoint | N/A · no endpoint | N/A · no endpoint | QA-def | N/A · no endpoint | N/A · no harness | N/A · no harness |
| Mem0-OSS extraction competitor | QA-def | QA-def | N/A · graph removed | N/A · graph removed | N/A · graph removed | N/A · graph removed | QA-def | N/A · graph removed | N/A · no harness | N/A · no harness |
| postgres_ingest mempalace-raw: verbatim postgres, no graph | R@5 0.833 | R@5 0.833 | N/A · by design | N/A · by design | N/A · by design | N/A · by design | QA 0.392 | N/A · by design | N/A · no harness | N/A · no harness |
| Baseline — no-structure control (the floor the structural delta is measured against; not a competitor) | ||||||||||
| flat ChromaDB vector, no graph | R@5 0.833 | R@5 0.833 | N/A · by design | N/A · by design | N/A · by design | N/A · by design | QA 0.384 | N/A · by design | N/A · no harness | N/A · no harness |
| Diagnostic arm — Cat 9a orchestrator-invocation probe (measures the model in front of the memory, not a product) | ||||||||||
| RLM Qwen-7B / Llama-70B orchestrator | R@5 0.467 | by-hop | n/r | n/r | n/r | n/r | n/r | n/r | 0.467 · 7–27% invoke | N/A · no harness |
| Wired, not yet benched — harness has these adapters; no Cat run exists | ||||||||||
| full_context Karpathy D1 · whole-vault-in-context | not run | not run | not run | not run | not run | not run | not run | not run | not run | not run |
| karpathy_compiled Karpathy D2 · LLM-compiled wiki | not run | not run | not run | not run | not run | not run | not run | not run | not run | not run |
The only rows on this page measured under identical conditions — every system below was run by us on the same n=150 stratified LongMemEval-S subset, the same retrieval definition (session-level R@K), the same reader, and the same canonical gpt-5.3-chat judge. No self-reported numbers, no mixed answer models, no metric swaps. The field matrix below cites published leaderboards (caveat-heavy); this is the table that controls the variables.
| System | Storage | R@5 | E2E QA |
|---|---|---|---|
| mempalace deployed substrate (this work) | verbatim | 92.7% | 58.0% o4-mini reader · retrieved ctx · canonical gpt-5.3-chat judge · n=150 macro ‖ comparator = oracle-n500, not strat150-S (closest apples, not pixel-perfect) · opus deployed→oracle gradient 0.567→0.868 in field card |
| OMEGA first independent competitor run | extraction | 90.0% | 59.3% sme-rich · o4-mini reader · retrieved ctx · canonical gpt-5.3-chat judge · n=150 macro ✦ ≈ parity (+1.3pp), same reader + context |
| Hindsight extraction, cloud-extractor | extraction | — | QA deferred extraction throughput / cost-gated |
The honest read: mempalace’s edge is retrieval (R@5 +2.7pp);
reader/QA is at parity (OMEGA +1.3pp, same-reader). On
R@5, mempalace 0.927 vs OMEGA
0.900 — a fair same-rendering pair (both measured
upstream-exact, --content-rules), so the 2.7pp edge is real:
retrieval-only, no answer-model confound. On E2E QA, held to
same reader, same context (o4-mini, retrieved, macro n=150),
mempalace 0.580 vs OMEGA sme-rich 0.593
— ≈ parity (OMEGA +1.3pp): two
verbatim-vs-extraction systems land within a question of each other once
reader and context are pinned. Hindsight’s cloud-extractor throughput
keeps its QA pass cost-gated — deferred rather than guessed.
Don’t confuse the 0.580 comparator with mempalace’s
deployed→oracle gradient (deployed-E2E ladder
0.567→0.760 as retrieval widens limit 5→50 → 0.868 oracle
ceiling): that gradient is an opus reader on retrieved/oracle context and
lives in the field card below — it is not the like-for-like
OMEGA comparison. This table — not the leaderboard — is the one to
read.
✦ OMEGA’s 0.593 is the fair sme-rich number (our harness, n=150, o4-mini reader + gpt-5.3-chat judge). OMEGA’s upstream-exact run scored just 0.37 — but that was date-starved: its reader context dropped session timestamps, and the temporal category (cat_6) jumped 0.04→0.36 once we restored the dates. Scoring the date-stripped 0.37 against mempalace would be the unfair comparison, so we use the date-restored sme-rich figure. (We ran the same date-confound check on our own LoCoMo numbers — see the explainer below — and there it came back clean, which is why we trust 0.593 here.) On R@5, the comparison stays upstream-exact on both sides (mempalace 0.927, OMEGA 0.900) — OMEGA’s sme-rich R@5 of 0.953 is not a comparator here, because no same-rendering mempalace pair exists for it; mixing renderings is exactly the apples-to-oranges this table refuses.
What this table is: the field column is self-reported leaderboard numbers — mixed metrics (R@5 vs oracle QA vs E2E QA, which are three different tests), mixed answer models (answer-model choice alone swings ~24pp), and mixed verification. It is context, cited with caveats — not a controlled comparison. For that, read the apples-to-apples table above (our own harness, identical conditions). The Verif. column below tells you who measured each number:
| System | Storage | LongMemEval | LoCoMo | Answer model | Verif. | ||
|---|---|---|---|---|---|---|---|
| R@5 | oracle QA | E2E QA | |||||
| mempalace — this work (verbatim-first) | |||||||
| mempalace — deployed substrate | verbatim | 92.7% | — | 0.567→0.760 limit 5→50 ‖ | 38.8% daemon § | gpt-5.3-chat | indie |
| mempalace — ingest-fixed | verbatim | — | — | 68.4% limit=5 ‖ | — | gpt-5.3-chat | indie |
| mempalace — true-oracle ceiling | verbatim | — | 86.8% | — | — | gpt-5.3-chat | indie |
| the field — E2E QA leaderboard (sorted by LongMemEval E2E QA; storage paradigm per MemPalace survey + independent check) | |||||||
| OMEGA | extraction | 90.0% ‡ | — | 95.4% | — | GPT-4.1 | self indie |
| Mastra | unstated | — | — | 94.9% | — | GPT-5-mini | self |
| Mem0 (platform v3) | hybrid | — | — | 94.4% | 92.5% | undisclosed | self |
| Hindsight | extraction | — | — | 91.4% | 89.6% | Gemini 3 Pro | indie |
| True Memory (Pro) | verbatim | — | — | 87.8% | 93.0% | gpt-4.1-mini | paper |
| Supermemory | extraction | — | — | 85.2% | 65.4% | Gemini-3 | indie |
| EverOS / EverMind | unstated | — | — | 83.0% | 93.1% | undisclosed | self |
| ENGRAM (paper, arXiv:2511.12960) | extraction | — | — | 71.4% | 77.6% | GPT-4o-mini | paper |
| Zep / Graphiti | hybrid | — | — | 71.2% | 75.1% | GPT-4o | paper |
| Celiums | unstated | — | — | 62.3% | — | Opus | self |
| GPT-4o (reference) | reference | — | 87.0% | 60.2% | — | GPT-4o | ref |
| retrieval-only — R@K recall (LongMemEval R@5; not comparable to QA) | |||||||
| engram-2 † | verbatim | 99.0% | — | — | 74.5% QA | — | self |
| ai-memory † | verbatim | 97.8% | — | — | — | — | self |
| MemPalace (upstream) | verbatim | 96.6% | — | — | 88.9% R@10 | — | indie |
| agentmemory | verbatim | 95.2% | — | — | — | — | self |
| mcp-memory-service | verbatim | 86.0% | — | — | 49.7% R@5 | — | self |
† engram-2 = github.com/199-biotechnologies/engram-2 (Paperfot AI) — a Rust CLI: SQLite + FTS5/BM25 + Gemini Embedding 2 + RRF + Cohere rerank, self-reporting 99.0% R@5 on LongMemEval-S. It returns verbatim source chunks (with an optional LLM claim/entity extraction pass). ai-memory = github.com/alphaonedev/ai-memory-mcp (FTS5 + embeddings, 97.8% R@5). The “Engram” name is overloaded: the same org also ships 199-biotechnologies/engram (MCP server, BM25+ColBERT+KG, 98.1% R@5), and the ENGRAM (paper) row is a separate, extraction-based academic system (arXiv:2511.12960, 71.4% QA). Storage paradigms marked unstated are ones we could not pin to a primary source — left blank rather than guessed.
‡ OMEGA R@5 0.900 is our own measurement — the first
independent head-to-head: OMEGA run on the identical n=150 stratified
LongMemEval-S subset, session-level R@K, same retrieval definition as mempalace
(which scores 0.927 on that subset — OMEGA trails by 2.7pp / 4 questions).
It is not comparable to OMEGA’s self-reported 95.4%, which is
a different metric (E2E QA, not R@5) with a stronger answer model (GPT-4.1;
answer-model choice alone swings ~24pp). The on-harness R@5 is the apples-to-apples
number. (Note: OMEGA ships without its bge-small embedding model and silently falls
back to keyword-only FTS5; we ran omega setup --download-model to restore
semantic retrieval before measuring, so this is the fair number, not a crippled one.)
§ mempalace daemon on LoCoMo — age-fused QA 0.388 (n=250 stratified, isolated scratch palace, never prod), drawer-level R@5 0.556. An A/B isolating the KG: restoring the graph-hydration fix (palace-daemon#202) moved QA +1.2pp over the vector-only fallback (0.376) with identical drawer-R@5 — so on LoCoMo the age-fused graph half adds ~nothing to top-5 retrieval (the graph-only hits don’t displace the vector top-5). The flat adapter scores the same QA (0.384), confirming the substrate, not the path, sets it. LoCoMo QA here (~0.39) is an our-harness figure (gpt-5.3-chat reader) — not comparable to the field’s 0.75–0.93 (stronger answer models + looser graders; cf. the event-ordering Kendall-τ vs binary-judge mismatch).
‖ The deployed-E2E column is a retrieval-breadth ladder
(real /search→reader→judge, opus reader, n=150):
limit=5 0.567 (85/150) → limit=20 0.740
(111/150) → limit=50 0.760 (114/150). Deployed QA was
retrieval-limited at limit=5; widening to 20 recovers
+17.3pp and it plateaus by 50 — concrete proof the gap
to the 0.868 oracle ceiling is retrieval breadth, not the reader.
(One content-filtered judge call at limit=50, qid 95228167, is counted
wrong — a conservative floor; scored correct it is 0.767, which
doesn’t touch the plateau.) These are a live re-query
and supersede the 05-29 cached 0.610 (pre-retrieval-drift pinned context).
Only the 0.868 ceiling sits under oracle QA (gold
handed to the reader, retrieval bypassed) — the like-for-like to the
field’s GPT-4o 0.870 oracle
(techempower-org/multipass-structural-memory-eval#117).
Read the Storage column first — it is the axis a LongMemEval leaderboard hides. mempalace is verbatim-first: it stores the raw turns and never summarizes (“store everything, then make it findable”), solving retrieval separately. It is not alone in that — True Memory (a paper-only design, arXiv:2605.04897, no code release) is also verbatim-first and reports 87.8% QA, so the paradigm is not what caps the leaderboard. The rest of the field sits on the other side of the split: hybrid systems that selectively extract while keeping some raw history (Mem0’s self-editing vector+graph+KV, Zep/Graphiti’s bi-temporal knowledge graph, Letta’s tiered memory), and pure extraction (Hindsight stores structured, time-aware facts instead of raw logs). So no, these are not all verbatim-first — but mempalace’s verbatim peers are real: True Memory, traditional RAG, and the R@K-only ChromaDB baselines (agentmemory, engram-2, ai-memory at 95–99% R@5). The paradigm split is taken from the MemPalace landscape survey, not assigned by us.
The honest read: on retrieval mempalace is
competitive — R@5 0.927 sits mid-pack against the field's
R@K leaders (96–99% for ChromaDB-baseline systems;
mcp-memory-service 80–86%). On QA we
deliberately publish no full-pipeline leaderboard number (that needs
a field-standard answer model + the canonical
gpt-4o-2024-08-06 judge we don't have). The
apples-to-apples axis is the oracle — and an
earlier reading of this card got it wrong, which is worth stating
plainly: we reported 0.61 and called the 26-point
gap to GPT-4o's 0.87 oracle a reader limit. It isn't. That
0.61 was retrieval-limited — the context
reaching the reader came through /search at limit=5,
and for single-session-assistant questions our upstream-parity
ingest (user-turns-only, which upstream itself recommends against)
dropped the assistant-authored gold entirely. Hand the
reader the gold instead (true oracle, evidence sessions verbatim)
and five of six categories recover, not just
assistant: single-session-assistant 0.32→0.98
(ingest), temporal 0.36→0.75,
knowledge-update 0.70→0.91, multi-session
0.71→0.87, single-session-preference
0.80→0.93 (all retrieval); only
single-session-user carried over. That lifts the overall from
0.610 to a 0.868 reader ceiling — within 0.2pp of the
field's 0.870. So the old 26-point “reader” gap
decomposes as ~7pp ingest artifact (the dropped
assistant turns; fixed → 0.684) + ~18pp retrieval
breadth (limit=5 + chunking never delivering the gold) +
~0pp genuine reader. The reader was never
the bottleneck — once the gold is in context it essentially
matches the field's oracle; getting the gold to it is the
whole game. The route here is itself a finding: an earlier
reader floor-lift — three category-specific
prompt clauses on the pinned /search context —
returned a clean null (net −2 to +3 questions),
which is precisely what flagged that the binding constraint was
substrate, not prompt; the true-oracle test then confirmed it.
That decomposition is the SME finding the leaderboards
can't show. And the structural categories — ingestion
integrity (Cat 4), gap-detection (Cat 5), ontology coherence
(Cat 8), invocation discipline (Cat 9) — have no competitor
analogue at all.
/search)
sits between the 0.610 substrate reading and the ceiling, and a
production daemon re-test is still pending. All four gradient numbers
are abstention-credited (the field's own metric), so our 0.868
ceiling and the field's 0.870 oracle are a like-for-like comparison. Cross-system LongMemEval numbers are not directly
comparable anyway: answer-model choice alone swings ~24pp, oracle QA
is a different metric from full-pipeline E2E QA, and most competitor
numbers are self-reported — so we do not place ourselves on
the QA leaderboard. Field numbers are cited from the landscape
survey; ours are sourced from baselines/.
The cross-system matrix above is the comparison — every system,
every category. This is the single-substrate deep-dive beneath
it: the same Cat 1–9 read in detail against mempalace alone,
answering what does the memory system know about its own structure?
These nine categories are SME’s unique
contribution — LongMemEval, LoCoMo, BEAM, Mem0’s and
Zep’s evaluations all measure end-to-end QA or retrieval recall.
None of them diagnose canonical-collision dedup, edge-type
monoculture, structural holes, ontology drift, or harness invocation rate.
The analogue column is
none for every row, by construction — that
is the headline. Substrate under test: mempalace
via the mempalace-daemon adapter against the live palace-daemon,
reading the real AGE knowledge graph (1,156,314 entities /
1,873,489-edge cleaned RELATION set, full-graph EXACT — not the earlier
capped /graph projection, and post the --drop-code-tokens
junk DELETE). Diagnostic readings of one substrate —
not leaderboard scores.
| Cat | What it measures | MemPalace reading | Corpus / adapter | Analogue |
|---|---|---|---|---|
| 1 Lookup | Find a specific memory from a natural-language query | R@5 0.867 full-recall 22/30; hop-1 0.889, hop-2 0.667 | jp-realm-v0.1 · daemon /search/age-fused | none |
| 2c Stairway | Multi-hop retrieval recall by hop depth — does structure scale? | (structural − flat) +10pp · A flat 0.833 / B hybrid 0.933 / C age-fused 0.900 @K=5 · grows with depth (B/A 1.11×→1.25×) B−A +9.3pp@1-hop, +16.7pp@2-hop; graph-RRF sub-layer (C) is a neutral-to-slightly-negative tax vs B — third data point for the age-fusion-null thesis (#203) | jp-realm-v0.1 · flat / daemon / age-fused | none |
| 3 Dissonance | Detect and surface conflicting facts | 0.00 emergent · the live palace KG has 0 contradicts edges — the enrichment pipeline generates no emergent contradiction structure on real content supersedes the earlier +1.00, which was a corpus-declared ceiling (good-dog-graph reading back hand-seeded edges, not detection). Cross-system: OMEGA auto-relate generated contradicts edges, recall 0.50 (#148/#215) |
live palace (real KG) · daemon --real-kg | none |
| 4 Threshold | Is ingestion producing a clean graph? (dedup, field coverage, monoculture) | 4a collisions 248 (1.9%) · 4b coverage 1.00 · 4c norm-entropy 0.645 · other 26.83% · 40 types full-graph EXACT (server-side cypher over the cleaned 1,873,489-edge RELATION set) — RESOLVED. Overturns the prior sampled 0.020/98.98%-tunnel capped-projection artifact, then the post-relabel-pre-DELETE intermediate (0.340/55.05%/237). The re-map ran: kg_predicate_norm de-monoculture relabel (~520k edges) + a --drop-code-tokens DELETE (48,135 junk shell-cmd/stopword/DOM-method edges). Entropy 0.020→0.340→0.645; other 98.98%→55.05%→26.83% (#211) |
live palace (real KG) · daemon /cypher | none |
| 5 Missing Room | Identify what’s structurally missing (components, holes, gaps) | largest component 61.87% (715,435 of 1,156,314) · isolates 22.2% (256,782) · 325,965 components EXACT full-graph WCC (server-side, post-DELETE) — replaces the bogus capped-projection 44.8%-isolate artifact. Honest note: isolates rose 20.4%→22.2% after the --drop-code-tokens DELETE — deleting 48k junk edges orphaned ~20k entities whose only edge was junk; pre-DELETE they were “connected” by noise, the cleaner graph honestly reports them isolated. The giant component is still well-connected (#211) |
live palace (real KG) · daemon | none |
| 6 Archive | Current vs historical state, supersession tracking | 0.00 emergent · the live palace KG has 0 supersedes edges — completeness 0.00 supersedes the earlier +1.00 (good-dog-graph declared ceiling, 8/8 hand-seeded). Cross-system: OMEGA also 0.00 — its temporal analogue evolution doesn’t normalize to supersedes. Emergent supersession is unsolved in both (#148/#215) |
live palace (real KG) · daemon --real-kg | none |
| 7 Abacus | Does structure earn its token overhead? (graph vs no-graph) | structure earns it: +10pp recall for 1.69× context (A vs B, token-comparable: 466→789 tok/q) flat is cheaper per-correct but lands 5 fewer correct (21/30 vs 26/30); the structural lift is hybrid retrieval, not graph traversal (#203) | jp-realm-v0.1 · flat / daemon | none |
| 7b Latency | Query latency distribution (YCSB p50 / p95) | post-AGE-index golden set: p50 vector ~684ms / union ~515ms / hybrid ~689ms two honest facts, not a single-set win: on this drawer-query golden set the graph leg is inert (hybrid ≈ union, ~689ms) — the AGE-index speedup lands on entity-anchored queries, not these. The earlier hybrid p50 2064ms was a graph-firing query set; the two sets aren’t the same workload, so do not read a same-set 2064→689 improvement (#144/#227) | live palace (AGE) · daemon (candidate-strategy) | none |
| 8 Blueprint | Does the actual graph match what the system claims to do? | “hierarchical” PASSES · modularity 0.7961 (218 communities) · introspection 1.0 LIVE EXACT full-graph (networkx, post-DELETE) — refutes the prior FAIL (modularity 0.009 was the capped-projection artifact) and replaces the verdict-only reading with a real number: 0.7961 >> 0.5. Introspection is now 1.0 on the deployed daemon (daemon restarted, /ontology serving) — was 1.0 capability / 0.0 deployed (#147/#211) |
live palace (real KG) · daemon | none |
| 9a Handshake | Does the model actually invoke memory when it has access? | opus-4-8 (Tau2 99.3): 100% invocation, 98.3% recall — invokes on every question, exceeds the deterministic 78.3% ceiling. Recall monotonic in Tau2: gemma4 41.7 → qwen3.5 75.0 → opus-4-8 98.3 prior RLM Qwen-7B/Llama-70B plateaued 46.7% (7–27% invocation) — ceiling was willingness to invoke, not retrieval (#194). 4B arms’ on-harness invocation-rate is a backfill follow-up; recall carried from the validated 2026-05-15 RLM run | jp-realm-v0.1 · familiar / rlm / opus-4-8 | none |
| 9b Call-through | Given an invocation, does the tool call complete and return a valid result? | live surfaces reachable clean floor; mock-model probe path | — | none |
contradicts edges in the live KG) and
0.50 in OMEGA (auto-relate catches 1 of 2 ground-truth
themes, precision 0.25); emergent supersession is 0.00 in both.
The earlier +1.00 we published for Cat 3/6 masked
this — it was a corpus-declared ceiling (the good-dog-graph
adapter reading back hand-seeded edges), not emergent detection on
real content. Declared-ceiling and emergent are different questions; the matrix
now labels both, and the emergent column is the more honest one
(techempower-org/multipass-structural-memory-eval#215).
familiar pipeline consistently invokes and lands
at 78.3%. The lever for raising 9a is a high-Tau2 orchestrator (Opus 4.6
99.3%, GPT-5.4 98.9%, GLM-5 ~98%), not more parameters
(techempower-org/multipass-structural-memory-eval#194).
/graph sample was tunnel-dominated — Cat 4 read 0.020/98.98%,
Cat 8 “hierarchical” FAILED at 0.009) is re-anchored to the
full-graph EXACT: Cat 4 entropy 0.645 / other
26.83%, Cat 8 modularity 0.7961 (PASS).
(2) The tautological scorer (Cat 3/6 +1.00 was the
good-dog-graph reading back hand-seeded edges — a declared ceiling, not
emergent detection) is replaced by the honest emergent read, 0.00.
(3) The limit-dependent topology samples (Cat 5/8) that
used to OOM at full scale are now computed server-side over the whole graph:
Cat 5 WCC 61.87% largest / 22.2% isolates,
Cat 8 modularity 0.7961. Cat 2c / Cat 7 keep their #203 flat
Condition-A deltas (+10pp). The one open follow-up is the two 4B arms’
on-harness Cat 9a invocation rate. Per the diagnostic posture, each cell is a
controlled reading of one substrate — and we’d rather publish a
0.00 emergent than a flattering declared 1.00.
Taken next to the field’s LoCoMo QA (0.75–0.93), our ~0.38 looks alarming. It isn’t a verbatim-substrate failure — it is four measurement choices stacked on top of each other, none of which the leaderboard numbers share. Reading them in order:
1 — the substrate isn’t the limit. The flat adapter (plain verbatim + vector retrieval) scores 0.384; the full age-fused daemon (pgvector + AGE knowledge graph) scores 0.388 on the same n=250 stratified subset. If the verbatim store were the weakness, the graph-augmented path would pull ahead — it doesn’t. Whatever caps LoCoMo here sits above the substrate, in the retrieval-and-judge layer that every system shares. 2 — it’s a retrieval-recall ceiling. Drawer-level R@5 on LoCoMo is ~0.44: more than half the time the gold turn isn’t in the top-5 the reader sees, so the QA number is bounded by recall, not by the reader’s ability to answer. 3 — our judge is strict. It is a binary, abstention-aware grader with no partial credit and no looser string-overlap leniency — the same discipline we hold mempalace to everywhere else. Field LoCoMo numbers frequently use softer graders (and much stronger answer models), and answer-model choice alone swings results by ~24pp. 4 — we include the adversarial split. Our subset keeps the adversarial LoCoMo questions that many field reports quietly drop; those are exactly the ones designed to defeat retrieval.
| Bucket | Overall QA | Regime | Per-ability (info-ext · temporal · multi-session · summ.) |
|---|---|---|---|
| 100K full-context | 0.649 | whole bucket in context | 0.85 · 0.75 · 0.60 · 0.87 |
| 500K retrieval | 0.487 | real retrieval (needle-in-haystack) | 0.40 · 0.26 · 0.24 · 0.87 (holds) |
| 1M deep retrieval | 0.471 | retrieval-limited (plateau) | flat vs 500K; a couple abilities tick up |
BEAM is the one benchmark here with a genuine regime shift — a cliff, then a plateau. At 100K the whole haystack fits in context, so the reader operates in a full-context regime and the needle abilities are intact. The cliff is 100K→500K (0.649→0.487, −0.162): the system crosses into a real retrieval regime and the needle-dependent abilities collapse — info-extraction 0.85→0.40, temporal 0.75→0.26, multi-session 0.60→0.24 — while the retrieval-light ability holds flat (summarization 0.87 in both, because summarizing doesn’t need a specific needle). Then it plateaus: 500K→1M is essentially flat (0.487→0.471, −0.016) and a couple of needle abilities even tick up. So once you’re retrieval-limited at a low top-K, haystack size barely matters — the bottleneck is retrieval breadth, not corpus scale. That asymmetry is the tell: a retrieval-recall ceiling, not a reader ceiling — the same finding as our LongMemEval decomposition (the reader is fine once the gold is in context; getting it there is the whole game).
| Model | Tau2 | Recall @ n=5 | Recall @ n=20 | Hit-rate @ n=5 |
|---|---|---|---|---|
| gemma4:e4b | baseline | 0.417 | 0.417 | 57% |
| qwen3.5:4b | +37.7pp | 0.717 | 0.750 | 93% |
| delta | +37.7pp | +30.0pp | +33.3pp | +36% |
The published Tau2 tool-use benchmark gap between qwen3.5:4b and gemma4:e4b is +37.7 points in qwen's favour. On our independent 30-question Cat 9a-shaped RLM experiment, the empirical recall gap landed at +30.0pp at n=5 and +33.3pp at n=20 — Tau2 predicts the gap to within ~5pp. Tau2 is therefore a useful prior when picking an orchestrator model for any RLM-as-tool-use experiment, and a much stronger predictor than parameter count.
| SME Category | LME type | n | R@5 | QA-acc |
|---|---|---|---|---|
| cat_1 | single-session IE | 150 | 100.0% | 51.33% |
| cat_1_negative | abstention | 30 | 96.67% | 90.00% |
| cat_2c | multi-session | 121 | 98.35% | 74.38% |
| cat_3_partial | knowledge-update | 72 | 100.0% | 65.28% |
| cat_6 | temporal-reasoning | 127 | 90.55% | 40.94% |
| overall | — | 500 | 97.00% | 58.60% |
The headline flips once the R@5 matcher is correct: the daemon finds the gold session in the top 5 97% of the time. The 38-point R@5→QA gap looked like a reader failure — but the true-oracle correction (see the comparison card) showed it is substrate: “R@5 found the session” doesn't mean the gold content reached the reader (limit=5 + chunking left it out), and with the gold actually present QA rises to a 0.868 ceiling. cat_6 (temporal) is the only category below the R@5 ceiling (90.55%) and also the worst QA (40.94%): its evidence is the most fragmented across the limited context. cat_1_negative's QA (90%) clears its R@5 because a correct abstention is the right answer regardless of what was retrieved.
<parent>_chunk_NNNNNN sub-drawers
and /search returns the chunk IDs, but the matcher
compared them exact-string against the parent IDs we stored at
ingest, so every hit read as a miss. techempower-org/multipass-structural-memory-eval#98
strips the suffix before comparing. The 97% is computed by
re-scoring the existing 2026-05-28 rerun records — no new
bench compute, same retrieved sets, correct matcher. QA-acc is
unchanged by the fix (it never depended on the matcher).
/search default/search default (−41.2pp)/search default (5.5× less)| SME Category | n | R@5 age-fused | QA default | QA age-fused |
|---|---|---|---|---|
| cat_1 | 150 | 100.0% | 51.33% | 9.33% |
| cat_1_negative | 30 | 93.33% | 90.00% | 100.00% |
| cat_2c | 121 | 98.35% | 74.38% | 0.00% |
| cat_3_partial | 72 | 100.0% | 65.28% | 1.39% |
| cat_6 | 127 | 64.57% | 40.94% | 33.07% |
| overall | 500 | 90.20% | 58.60% | 17.40% |
The corrected matcher (techempower-org/multipass-structural-memory-eval#98)
turns this card from a mystery into a clean finding.
/search/age-fused retrieves the gold session at
R@5 = 90.2% — almost as well as the
default endpoint. Yet QA accuracy is 17.4%, a 73-point
R@5→QA gap. The cause is the snippet width: age-fused
returns ~457 chars of context per query (5.5× less than
default's 2539). The gold session is in the results, but
the reader is handed too thin a slice of it to answer.
cat_2c (multi-session) collapses to 0% QA at 98% R@5 — the
starkest illustration: every gold session retrieved, none
answerable from the snippet. cat_1_negative rises to 100%
because thin context makes the reader abstain, which is correct
for unanswerable questions.
| SME Category | n | R@5 #44 | R@5 #45 | R@5 #46 |
|---|---|---|---|---|
| cat_1 | 150 | 100.0% | 100.0% | 38.00% |
| cat_1_negative | 30 | 96.67% | 93.33% | 6.67% |
| cat_2c | 121 | 98.35% | 98.35% | 26.45% |
| cat_3_partial | 72 | 100.0% | 100.0% | 34.72% |
| cat_6 | 127 | 90.55% | 64.57% | 21.26% |
| overall R@5 | 500 | 97.00% | 90.20% | 28.60% |
Familiar.realm.watch is a separate memory system in this household — same corpus, different retrieval and reader architecture. The second-adapter discipline is the point: a single reading is a single corpus, and brittle defaults hide on any single corpus. With the corrected matcher (techempower-org/multipass-structural-memory-eval#98) the finding inverts from an earlier reading: Familiar's R@5 (28.6%) is the lowest of the three legs, not the highest. Its retrieval is genuinely weaker than daemon-direct's 97%. And unlike the daemon legs, Familiar's QA (31.0%) slightly exceeds its R@5 — the reader does fine with what it gets; the limit is what reaches it. A different shape from #44/#45, where retrieval finds the session but — as the true-oracle correction later showed — the gold content still often didn't reach the reader. Both point the same way: the limit is what reaches the reader, not its reasoning.
| In-domain stack (LongMemEval-S, n=500) | R@1 | R@5 | R@10 |
|---|---|---|---|
| MemPal raw default | 0.806 | 0.966 | 0.982 |
| + adaptmem FT-300 | 0.862 | 0.980 | 0.994 |
| + hybrid_v4 + FT-300 | 0.916 | 0.990 | 0.998 |
| katana FT-300 repro (held-out 200) | 0.925 | 1.000 | 1.000 |
| Cross-domain (jp-realm-v0.1, covered n=29) | R@1 | R@5 | R@10 |
|---|---|---|---|
| base (all-MiniLM-L6-v2) | 0.3448 | 0.5172 | 0.6207 |
| FT-300 (MNR fine-tune) | 0.3621 | 0.5172 | 0.6034 |
| delta | +1.73pp | 0.00 | −1.73pp |
Domain-adaptive encoder fine-tuning is a real retrieval
lift in-domain. On LongMemEval-S through MemPalace's own
bench, adaptmem's FT-300 moves R@5 0.966→0.980
and R@1 0.806→0.862; stacked with hybrid_v4
retrieval the gains compose to R@1 0.916 / R@5 0.990
— encoder fine-tune (layer 2) and hybrid retrieval
(layer 3) operate on different failure modes and add
independent lift. Our own katana FT-300 reproduction hit
R@5 = 1.000 on the held-out 200. So the encoder
is not useless — quite the opposite.
What our jp-realm reading shows is narrower: the lift
does not transfer cross-domain. A FT-300
encoder trained on conversational memory, dropped onto JP's
personal KB, gives best delta +1.73pp at R@1
(R@5 flat) against a predicted +30–33pp — a clean
cross-domain null, exactly the open question the orthogonal-layers
note flagged. And it doesn't change the end-to-end picture
either: on oracle LongMemEval retrieval is already
~0.974 while the reader leaves a ~45pp
gap (techempower-org/multipass-structural-memory-eval#116).
Encoder tuning is a legitimate retrieval improvement, but
it is not the lever for end-to-end QA — what reaches the reader is.
longmemeval_bench.py with a
monkey-patched encoder swap (same dataset, same encoder family,
zero changes to eval logic) and posted to
MemPalace/mempalace
discussion #1249. The raw-baseline R@5 of 0.966 exactly
reproduces MemPalace's published number. The orthogonality finding
(fine-tune and hybrid retrieval compose) is the durable claim;
the absolute percentages attribute to that thread. Framing and
layer model: SME's
docs/research/adaptmem-orthogonal-layers.md.
On a representative, category-stratified sample, age-fusion
shows NO significant R@5 gain over plain
/search (Δ = −1 question of 150; R@1
+2 questions). The +2.0pp R@5 “win” seen on an
earlier n=100 first-100 slice did not replicate —
that slice was single-session-dominated (the S corpus is
question_type-sorted; fixed via --stratify-by in
techempower-org/multipass-structural-memory-eval#122).
Per-category (n=25 each, ±1-question noise →
directional hypothesis only): age-fusion helps
temporal-reasoning + knowledge-update, hurts single/multi-session
recall.
| Endpoint | o4-mini QA | gpt-5.3-chat QA |
|---|---|---|
| search-default | 50.4% | 52.2% |
| age-fused | 43.2% | 46.6% |
With the search-default context pinned (R@5 0.974), both readers answer correctly only ~50% — a ~45pp R@5→QA gap. Correction (see the comparison card): this gap is substrate, not the reader. The true-oracle test later showed that “R@5 found the session” does not mean the gold content reached the reader — limit=5 + chunking (and, for assistant-authored answers, the user-only ingest) often left the answer out of the context entirely. Hand the reader the gold verbatim and QA rises to a 0.868 ceiling, near GPT-4o's oracle. So the ~50% reflects what reached the reader, not a reasoning limit. Age-fused context is harder still (QA 43–47%). Self-judge caveat: gpt-5.3-chat judges its own family on its own reader leg, which may inflate it.
| Category | Opus QA-acc |
|---|---|
| single-session-user | 84% |
| temporal-reasoning | 48% |
| knowledge-update | 48% |
| multi-session | 44% |
| single-session-assistant | 12% |
| single-session-preference | 0% |
Opus 4.8 — the strongest reader — scored the worst overall (39.3%). But it is not a capability gap: Opus is the best reader on well-posed direct recall (single-session-user 84%, the highest of any reader on any category) and collapses only on mis-specified categories (preference 0%, assistant 12%). It is penalized for following the baseline prompt literally (over-abstaining where the gold answer is an inferred preference) and for thoroughness (reporting both old+new values → judged PARTIAL; answers 2× longer). The 45pp gap is prompt + judge design, not reader capability. Both follow-ups are now published below: the prompt-axis fix and the Opus-as-judge re-scoring.
| Reader | baseline | committed | preference |
|---|---|---|---|
| o4-mini | 0.420 | 0.507 | 0.527 |
| claude-opus-4-8 | 0.360 | 0.473 | 0.593 |
“Opus is the worst reader” was a prompt artifact. With the preference-tuned prompt, Opus goes from worst (0.360) to best of any config (0.593) — a +23pp swing from changing nothing but the prompt. The entire gap lived in one category: single-session-preference, where Opus scored 0.04 under the baseline prompt (it refused to make a recommendation — the “say I don’t know” instruction over-fired on inference questions) and 0.76 under the preference prompt (+72pp). That single category drags the overall from 0.36 to 0.59. This is the reader-harness fix; the judge is still gpt-5.3-chat (not the LongMemEval-canonical type-specific judge), so 0.593 is a floor — the canonical-judge work compounds on top. Bottom line: the #116 reader gap is prompt design, not model capability.
| Run | R@5 | R@10 | MRR base | MRR rerank | Δ MRR |
|---|---|---|---|---|---|
| #2 (confirming) | 0.909→0.909 | 1.00 | 0.748 | 0.921 | +23.1% |
| #1 (first) | 1.00→0.909 | 1.00 | 0.761 | 0.877 | +15.3% |
| movement | rerank-spike 7→1 (+6) · fallback-contract 4→1 · 7 no-change | 1 regr. | |||
A clean ordering-only A/B — retrieval is held constant, only the final sort changes (distance vs FlashRank score) on the same candidate pool. MRR lifts +15.3% (run #1) and +23.1% (run #2), driven by a buried answer rescued from rank 7 to rank 1 plus two smaller promotions; 7 of 11 queries were already optimal and rerank correctly left them alone. R@5 is a wash and R@10 is untouched, so MRR is the load-bearing metric for this known-item set. Rerank latency stayed at 47 ms mean (126 ms under host load), worst single request 557 ms — well inside a 1 s budget on CPU. Verdict: keep nano, A/B a larger model next.
daemon-deploy-arch moved 3→8 in both runs. The
top 7 reranked passages all scored 0.9971–0.9994
(a 0.002 spread) and every one was genuinely on-topic — when
7 passages are all relevant and scored within 0.002, head ordering
is a coin-flip. This is the distilled 2-layer cross-encoder's
score-compression failure mode and the strongest case for A/B-testing
a larger model (ms-marco-MiniLM-L-12-v2; this FlashRank
build ships no L-6). Cross-validated: replaying frozen pools in
--mode candidates reproduced run #1's metrics exactly.
| config | R@5 | R@10 | MRR |
|---|---|---|---|
| vector / convex (default wts) | 0.917 | 1.000 | 0.808 |
| union / convex (default wts) | 1.000 | 1.000 | 0.785 |
| hybrid / convex (default 0.6/0.4) | 1.000 | 1.000 | 0.785 |
| hybrid / convex (0.85/0.15) | 1.000 | 1.000 | 0.833 |
The #111 acceptance criterion — a documented hybrid weight achieving R@5 ≈ 1.000 without regressing MRR vs union/vector — is met by vector_weight 0.85 / bm25_weight 0.15. The default 0.6/0.4 hybrid had regressed MRR to 0.785 (below the 0.808 union/vector floor); 0.85/0.15 lifts it to 0.833 — +2.5pp over the floor, +4.8pp over the default-hybrid regression — while R@5 stays pinned at 1.000. MRR turns out to be non-monotonic in vector_weight: the earlier “graph promoted one query and demoted another” reading was wrong; the real mechanism is the convex blend over-weighting BM25 at 0.4. The weight ships as an env knob (mempalace#342) — the deployed default stays 0.6/0.4 because n=12 is below the n≥25 bar to flip a production default.
hybrid scores identically to union
(R@5 1.000 / MRR 0.785 at default weights) — the lift comes from
BM25 pool-widening, not graph traversal. That joins the
Cat 2c/7 graph-RRF tax and the
#91 age-fusion R@5 null as a third independent reading
where the AGE graph half adds ~nothing to top-5 retrieval on
drawer-query workloads (it earns its keep on entity-anchored queries).
| leg | MRR | R@5 | R@10 | found | p50 |
|---|---|---|---|---|---|
| rerank OFF | 0.299 | 0.510 | 0.60 | 120/200 | 555 ms |
| TinyBERT-L-2-v2 (daemon default) | 0.293 | 0.515 | 0.60 | 120/200 | 475 ms |
| MiniLM-L-12-v2 (bigger) | 0.284 | 0.505 | 0.60 | 120/200 | 1523 ms |
This is the SME-side rerank A/B — distinct from the
palace-daemon FlashRank known-item A/B above (a 12-query run that found a
+15–23% MRR lift on a different, answer-buried query shape). On a
corpus-seeded scratch daemon — the git/docs targets
re-ingested so the relevant set is actually present — the verdict is
the opposite: cross-encoder rerank is neutral-to-slightly-negative.
R@10 is identical at 0.60 across all three legs (rerank only
reorders the top-K; it cannot recall what vector retrieval missed), MRR is
slightly hurt by rerank (0.299→0.293), and the bigger
MiniLM-L-12-v2 is both worse on MRR and 3× the
latency (1523 ms vs 475 ms p50). Recommendation:
keep rerank OFF / opt-in, do not promote the bigger model.
This is the 4th independent reading that the
structural/rerank layer doesn’t lift retrieval — the vector
backbone is the lever (with the age-fusion null ×3: Cat 2c/7 graph-RRF
tax, #91 age-fusion R@5, and the #111 graph-leg-inert finding).
ms-marco-MiniLM-L-6-v2
was never in the FlashRank build — #225’s rerank leg actually ran
TinyBERT-L-2-v2, the same nano model that is the daemon default.)
| Room | n | p10 | median | p90 |
|---|---|---|---|---|
| references | 144 | 0.173 | 0.636 | 0.781 |
| architecture | 40 | 0.363 | 0.649 | 0.776 |
| discoveries | 40 | 0.126 | 0.304 | 0.705 |
| planning | 32 | 0.496 | 0.677 | 0.821 |
| problems | 24 | 0.215 | 0.637 | 0.791 |
The gzip-NCD novelty_score on the live corpus is
continuous and right-skewed — not bimodal:
a broad novel hump across 0.55–0.85, a long redundant tail
below ~0.20, and the middle band populated throughout with no empty
valley. Per-room baselines diverge by ~0.37 in median
(discoveries 0.30 vs planning 0.68), so a
single global NCD cut mislabels content; the calibration recommends
per-room percentile thresholds — redundant at
p15(room), novel at p60(room).
novelty_score=1.0 — the numbers above were captured
after the fix), and content_preview is truncated
to ~200 chars, so the write path still scores against truncated
neighbours (NCD on short prefixes is noisier and biased high). The
calibration script computes the full-content distribution by default.
| Reader (answers fixed) | QA gpt5.3-judge | QA Opus-judge | Δ |
|---|---|---|---|
| claude-opus-4-8 | 0.393 | 0.420 | +0.027 |
| o4-mini | 0.467 | 0.480 | +0.013 |
| gpt-5.3-chat | 0.447 | 0.480 | +0.033 |
To isolate the judge variable, the identical Pass B reader answers were re-graded with Opus-4.8 as judge (the original judge was gpt-5.3-chat). Every reader lifts modestly — +1–3pp — mostly by rescuing single-session-preference questions (the Opus reader's preference category moves 0.00→0.12). But the Opus judge does not change the ranking and does not rescue the Opus reader, which stays the lowest at 0.420. So the judge confound is real but small: the dominant factor in the reader gap is prompt design, not judge strictness. The prompt-fix sweep is published — see the prompt-axis card above.
| Question type | /search Δ | age-fused Δ |
|---|---|---|
| single-session-preference | +20.0pp | +26.7pp |
| single-session-user | +0.0pp | +1.4pp |
| single-session-assistant | +0.0pp | +7.1pp |
| multi-session | −3.8pp | −6.0pp |
| knowledge-update | −5.1pp | −9.0pp |
| temporal-reasoning | −2.3pp | −2.3pp |
| OVERALL (harness) | 0.510 | 0.456 |
| OVERALL (+abstention) | 0.562 | 0.510 |
The companion card swaps the judge model (Opus-as-judge) and finds it doesn't help. This card swaps the judge prompts: porting LongMemEval's verbatim type-specific templates — in particular the rubric-based preference template the old paraphrased judge lacked. That un-collapses single-session-preference (+20.0pp / +26.7pp, the largest single-category move) and removes the spurious-ABSTAIN noise (the old judge emitted 34 ABSTAIN labels, 21 on non-abstention temporal questions; the canonical binary judge can't). The stricter, more faithful templates pull KU and multi-session down a few points — that is the old judge having over-credited via a softer rubric, not a regression. The headline: the overall did NOT move toward the published 87% oracle — corrected, abstention-credited it is 0.562 / 0.510. The judge-prompt confound was real (preference collapse and label noise were genuine measurement artifacts) but it is not the bulk of the 35pp gap. The residual ~31–36pp is reader/substrate, not the scorer — and the true-oracle test (see the comparison card) later resolved which: it is substrate (what reaches the reader), not the reader's reasoning.
Disclosure. Judge is gpt-5.3-chat with canonical prompts, not the canonical gpt-4o-2024-08-06 snapshot (not deployed on this resource) — the fix is the prompts, not the model. 2 ERROR rows per run from an Azure content-filter trip (graceful retry-then-ERROR by design; counted as wrong, a 0.4% asymmetry vs the confounded denominator). Absolute numbers are understated by the aggregator’s ABSTAIN dead-code bug (#148) — it never credits a correct ABSTAIN — but that depresses both runs identically, so the deltas are valid.
The methodology is load-bearing. The headline number changes less than the condition that produced it — so the condition is what gets named, versioned, and reported.
A. System under test, default config.
B. System with the structural layer disabled
(graph-off, registry-off, etc.).
C. System with retrieval replaced by oracle
retrieval — isolates reader from retriever.
D. Karpathy baselines: full-context
(D1, whole corpus in prompt) and karpathy-compiled
(D2, LLM-compiled wiki). If your structure can't beat D1, the
structure isn't earning its complexity.
SME ships two evaluation corpora — jp-realm-v0.1
(a personal-knowledge corpus with adversarial entity overlap)
and good-dog-corpus (24 notes across 6 domains
with pre-authored questions). External corpora — LongMemEval,
LoCoMo, MINE — integrate through the same adapter ABC. A
single-corpus reading is a sketch; the load-bearing readings
come from the same diagnostic run across multiple corpus shapes.
Cross-system absolute rankings are out of scope. Two systems with different ontologies, different corpora, or different retrieval conditions produce readings that are not directly comparable. SME supports controlled cross-system runs when corpora and ontologies are matched, but treats unconditioned "system X scores higher than system Y" claims as confounded.
Generalisation of deltas across corpus scale, ontology design quality, operator workflow beyond the diagnostic report, live agentic memory dynamics (read-after-write, JEPSEN-shaped questions — Cat 10 in the backlog), and human-judgment calibration without explicit calibration runs. Naming the boundaries is part of the framing — readings that fall outside SME's scope are invitations to reach for a complementary tool, not failures of the framework.
The story of SME against MemPalace is a sequence of paired readings — each one a structural answer to the last one's surprise. Below: the load-bearing ones, with the date each was confirmed.
MemPalaceDaemonAdapter targets palace-daemon on
the postgres + pgvector + AGE backend — the production
retrieval path. Live daemon at familiar.jphe.in:8085
(migrated from disks circa 2026-05-24, both
palace-daemon and postgres moved together).
c3e204e.)
jphein/ to
techempower-org/. PRs from the fork target
M0nkeyFl0wer/multipass-structural-memory-eval
(the canonical upstream, maintained by Ben West). The
MemPalace fork moves the same week. Issue references use
full org/repo#N form on both sides of the fork
boundary.
/backfill-age reports 100% on the entity pass
(142,315 entities in ~61 min, zero errors) but the
relationship layer is effectively empty (triples: 1
per kg_stats). /search/age-fused
works structurally but the graph half of the RRF fusion has
nothing to contribute. Implication: an A/B that compares
vector-only against age-fused right now measures capacity
for lift, not realised lift.
Every category in SME borrows from work done elsewhere. The diagnostic framing only works because the systems it points at and the benchmarks it borrows from are open in the first place.
Canonical upstream of the framework. SME is named, scoped, and maintained here; the fork at techempower-org submits PRs back.
M0nkeyFl0wer/multipass-structural-memory-evalThe techempower-org fork — bench harness, daemon adapter, A/B leg infrastructure, content-rules loader. Carries the readings on this page.
techempower-org/multipass-structural-memory-evalThe memory palace for AI: verbatim storage, local-first, permanent. SME's primary test target and the conversation partner for much of the structural diagnostic methodology.
MemPalace/mempalaceThe MemPalace fork served by palace-daemon in this household. Carries the canonicalisation, hybrid fusion, and KG-extraction changes the readings exercise.
techempower-org/mempalace
The HTTP daemon over MemPalace. Backs the mempalace-daemon
adapter, ships the /search/age-fused endpoint
that the #45 leg exercises, and serves the live retrieval
path on this homelab.
The second-corpus discipline of SME requires more than one system under test. Familiar is the household-native memory system the #46 leg exercises through the same harness.
techempower-org/familiar.realm.watch
Wu et al. (ICLR 2025) — 500 curated questions across five
memory abilities. SME borrows the corpus and the GPT-4o judge
methodology (>97% human agreement) for the
cat_1 family. MIT licensed.
nakata-app's domain-adaptive fine-tune recipe for memory encoders. v0.6 set the FT-300 ceiling on LongMemEval; v0.7 unlocks BGE-large training on T4-class hardware via cached MNRL + gradient accumulation. SME measures the lift; AdaptMem builds the recipe.
nakata-app/adaptmem