Structural Memory Evaluation

What does a memory system know about itself?

A diagnostic framework for memory systems — RAG, knowledge graphs, personal knowledge bases, conversational memory — that tests what the system knows about its own structure, not just whether it can retrieve. Findings are deltas under controlled conditions, not absolute leaderboard scores.

Categories
9 diagnostic surfaces
Adapters
10 backends
Posture
before/after deltas
0 the conclusion
what the benchmarks say

A competitive memory system — with the headroom in its own hands.

MemPalace, on the Postgres + pgvector + Apache AGE fork, is a competitive and honestly-measured memory system — and SME's job here was to find where it stands and where the work is. The verdict: it retrieves with the field, and a frontier-grade answer is already reachable on this stack. The gap to that ceiling isn't the language model's reasoning — it's how faithfully the memory system delivers the right context into the prompt. The encouraging part: that lever is in MemPalace's own hands — ingestion fidelity and retrieval breadth — not a bigger model.

the substrate
0.927 / 0.966 R@5

Retrieval is competitive: R@5 0.927 on representative LongMemEval-S, and 0.966 in byte-identical parity with upstream ChromaDB. The move to Postgres + pgvector + Apache AGE cost zero recall — MemPalace finds the right memory.

the frontier
0.868 oracle QA

Handed the right memory (true oracle), the reader answers at 0.868 — within 0.2pp of the field's GPT-4o oracle (0.870). A frontier-grade answer is already reachable on this stack.

the lever
Memory delivery

The gap to that ceiling is what reaches the reader, not the model's reasoning: ~7pp ingestion fidelity + ~18pp retrieval breadth, ~0pp genuine reader. It's the memory system's to close — and it's MemPalace's to build.

the graph
1.87M live triples

The Apache AGE knowledge graph is live and non-trivial — 1,873,489 cleaned triples. It surfaces drawers vector + keyword search miss on structural cases (the structure categories read it directly); broad-recall fusion and the product path are the next integration.

what's unique
Cat 4 / 5 / 8 / 9

SME measures structural surfaces no leaderboard captures: ingestion integrity, gap-detection, ontology coherence, and invocation discipline — whether a system knows its own shape.

honestly measured
A null we trusted

The method caught its own error in public: a reader floor-lift that returned null exposed a substrate confound and turned an apparent 0.61 “reader ceiling” into the 0.87 finding. Diagnostic deltas, not a leaderboard rank.

the posture These are diagnostic deltas under controlled conditions — not a leaderboard claim. Cross-system LongMemEval numbers aren't directly comparable (answer-model choice alone swings ~24pp; oracle QA is a different metric from full-pipeline E2E QA; most competitor numbers are self-reported), so we deliberately publish no full-pipeline ranking. The 0.868 is the reader's ceiling once retrieval is perfect; the deployed number sits below it, pending a production daemon re-test. The readings below show the work.
the synthesis
campaign synthesis · the definitive read

What the whole campaign found.

SME was driven across nine diagnostic categories, nine benched substrates, and a 33-system published field. Five findings carry the report. Every number traces to a committed baselines/ artifact, and we never blur measured (our harness) against claimed (a vendor self-report). The interactive matrix below is the data; this is the story it tells. Read the full synthesis →

finding 1 · the bottleneck
Reader, not retrieval

Oracle retrieval R@5 0.974, deployed 0.927 — yet QA tops out lower because the reader loses the points, not the retriever. Widening top-5→top-20 buys +17.3pp QA (0.567→0.740) then plateaus; the residual to the 0.868 ceiling is synthesis, not retrieval.

finding 2 · four nulls
The vector backbone is the lever

Age-fusion showed no significant gain on three corpora (two CI-confirmed null under FDR); the hybrid graph leg is inert (hybrid ≡ union, byte-identical); cross-encoder rerank is neutral-to-negative. The dense-vector + BM25 backbone already does the work — corroborated from outside: ai-memory (no graph) hits R@5 0.920, level with the full stack.

finding 3 · storage-equivalence
The engine isn’t the variable

ChromaDB → postgres+pgvector, holding embedding/corpus/reader/judge fixed: retrieval 0.833 == 0.833 and QA 0.392 ≈ 0.384 are statistically identical. CI-confirmed — the paired QA delta is [−2.0, +2.8]pp, 9/250 discordant, p_adj 0.84. The substrate carries the answer; the storage engine does not.

finding 4 · the framework catching itself
Artifact → real KG

A capped /graph projection reported Cat 4 “98.98% one edge type, entropy 0.020.” The real KG reads entropy 0.645, a 61.87% giant component, modularity 0.796 (hierarchy PASS). “Monoculture / fragmented / flat” → “diverse / connected / hierarchical.” SME found its own measurement bug, fixed it, cross-checked it three ways.

finding 5 · ontology-sensitivity
Cat 4 moves, Cat 5 holds

Re-typing one graph flat→moderate→fine moves Cat 4 entropy 0.000→0.842→0.856 — while Cat 5 topology (components, giant size, Betti) is byte-identical. So compare systems on Cat 5; report Cat 4 with its type-count. The same graph reads “monoculture” or “healthy” purely by ontology granularity.

the field · two-axis cost wall
$0 vs hours to bench

Cheap-to-bench needs both no write-time LLM and fast ingest: ai-memory is $0 (R@5 0.920). The wall has two independent causes — a write-time LLM (Mem0 ~18h, Hindsight ~150h) and slow ingest throughput (agentmemory is LLM-free yet ~15h-walled). “No write-time LLM” is necessary but not sufficient.

statistical rigor · two honest tiers The campaign’s “±1 question = noise” language is replaced by paired bootstrap CIs (10k resamples) + Benjamini-Hochberg FDR correction. Tier 1 — CI-confirmed nulls: storage-equivalence and two of the four age-fusion nulls are true paired per-question A/Bs, every CI straddling zero (p_adj 0.84). Tier 2 — descriptive-only: the cross-system mempalace-vs-OMEGA R@5 (0.920 vs 0.900) and the remaining nulls are reported as point estimates with no CI — their per-question metrics aren’t paired-comparable, and computing a CI would launder a methodology mismatch into a rigorous-looking number. A framework that declines a tempting cross-system CI because the units don’t match is the measured-vs-claimed discipline enforced in code, not just prose.
Source · docs/research/2026-05-31-sme-campaign-synthesis.md (Nebula, #178) · every number traces to a committed baselines/ artifact (provenance index in the full doc) · the data this narrates is the interactive field explorer below read the full synthesis
i the diagnostic posture
posture · not score

A diagnostic, not a benchmark.

Standard memory benchmarks ask "can you find a memory?" SME asks four further questions: is the system actually retrieving what it should, does its structural layer earn its complexity, does the graph match what the README says, and when it ships, does the model actually reach the memory?

001 · method

Multiple passes.

Multiple passes over every system under test, across multiple corpus shapes and multiple retrieval conditions (A / B / C / D), so brittle default behaviours that hide on any single pass become visible when readings are compared side by side.

002 · surface

Nine categories.

The Lookup (1), The Crossing (2), The Dissonance (3), The Threshold (4), The Missing Room (5), The Archive (6), The Abacus (7), The Blueprint (8), The Handshake (9). Each measures a structural property retrieval-only benchmarks don't reach.

003 · output

Deltas, not ranks.

The defensible reading shape is a before/after delta under matched conditions, or a within-system A/B/C/D ablation. Cross-system rankings carry an unmeasured confound — corpus, ontology, retrieval config. SME calls that out instead of papering over it.

ii readings
verified readings · as of 2026-05-31

The published readings.

Every number below is sourced from a JSON in baselines/. Where the source file isn't published yet, the card says so. No rounded estimates, no extrapolations.

Field explorer · every memory system the survey names × every metric we have

The full field — sort it, filter it, follow the receipts

interactive

Thirty-three memory systems, two kinds of number side by side: Published — what each system’s own team or a survey reports (self-reported, NOT our harness) — and SME multipass — the Cat 1–9 readings we measured ourselves. The two are kept in unmistakably separate column-groups: never read a published QA number as something we benched. Sort any column, filter by status / architecture / metric, and follow the magic links — every cell that has a source is clickable: our receipts (the baseline/doc that holds the reading) and their claims (the leaderboard or paper) are styled differently on purpose. Only 8 of the 33 carry a full SME column — that gap is the honest headline. For the narrative this data supports — the five campaign findings, the statistical rigor, and the cost-wall taxonomy — see the synthesis above.

Status
Architecture
Show
View
heatmap: low→high Eemergent QA-defverified, deferred N/A·reasonmeasured-inapplicable (a finding) not runwired, not benched our receipt their claim
System Arch Status Published self-reported — NOT our harness SME multipass we measured (Cat 1–9)
LME QA LoCoMo BEAM-1M 1 2c 3 4 5 6 7 8 9a 9b
Click any column header to sort (numeric columns sort high→low first). Chips filter live + multi-select. The left edge of each system name is a status spine: refract = benched, amber = in-flight, grey = published-only. The vertical refract rule separates “published” from “what we measured” — the two groups are never the same claim.
the honesty line · published ≠ benched-by-us The Published group is self-reported — each system’s own team or the landscape survey. We did not run those numbers. They mix metrics (R@K recall is not QA accuracy — Celiums proved 100% retrieval can be 62% QA), mix answer models (a ~24pp swing on the same benchmark between GPT-4.1 and GPT-4o-mini), and mix subsets/judges. The SME multipass group is the only apples-to-apples axis — and only 8 of 33 have one, because benching a system through Cat 1–9 is real work. Mem0 appears twice by design: the OSS package we benched, and the cloud platform-v3 number its team reports (92–94% vs the OSS 61–68%) — “almost never made explicit,” per the survey.
Source · baselines/cross_system_multipass_matrix_2026-05-30.json (field_roster, cassia-2 #165) · published cells link to each system’s cited leaderboard/paper; SME cells link to our baseline/doc artifacts. Field survey: memorypalace/docs/research/2026-05-24-memory-system-benchmarks.md techempower-org/multipass-structural-memory-eval#165
Cross-system multipass · the 9 benched systems × Cat 1–9

The benched nine — role-grouped, every category readable

drill-down

The explorer above is the whole field at a glance; this is the role-grouped zoom on the nine systems we actually benched through Cat 1–9 — five memory products (incl. postgres_ingest, the mempalace-raw ablation), a no-structure control, a diagnostic orchestrator arm, and two wired-but-not-yet-benched baselines. It isn’t a leaderboard of nine competitors: the rows are grouped by role, and the most informative reading is where each system can even be measured. Two graph-native products (mempalace, OMEGA) take the full Cat 1–9; the extraction systems read N/A on the structural cats for three different architectural reasons; and an N/A is a measured finding, not a blank. The single-substrate mempalace deep-dive that follows is the supporting detail under this.

0.00measured number Eemergent (system-generated structure) QA-defverified + runnable, QA deferred N/A · reasonmeasured inapplicable — a finding not runwired adapter, never benched
System 1Lookup 2cStairway 3Dissonance 4Threshold 5Missing Room 6Archive 7Abacus 8Blueprint 9aHandshake 9bCall-through
Memory systems — products under test
mempalace verbatim + AGE graph R@5 0.927 R@5 0.960 0.00E entropy 0.645 · other 26.83% largest 61.87% · iso 22.2% 0.00E QA 0.580 mod 0.7961 · PASS 0.983 reachable
OMEGA embed + auto-relate R@5 0.900 R@5 0.920 recall 0.50E entropy 0.78 1 comp · 0 iso 0.00E QA 0.593 drift 0.875 N/A · no harness N/A · no harness
Hindsight extraction competitor QA-def QA-def N/A · no endpoint N/A · no endpoint N/A · no endpoint N/A · no endpoint QA-def N/A · no endpoint N/A · no harness N/A · no harness
Mem0-OSS extraction competitor QA-def QA-def N/A · graph removed N/A · graph removed N/A · graph removed N/A · graph removed QA-def N/A · graph removed N/A · no harness N/A · no harness
postgres_ingest mempalace-raw: verbatim postgres, no graph R@5 0.833 R@5 0.833 N/A · by design N/A · by design N/A · by design N/A · by design QA 0.392 N/A · by design N/A · no harness N/A · no harness
Baseline — no-structure control (the floor the structural delta is measured against; not a competitor)
flat ChromaDB vector, no graph R@5 0.833 R@5 0.833 N/A · by design N/A · by design N/A · by design N/A · by design QA 0.384 N/A · by design N/A · no harness N/A · no harness
Diagnostic arm — Cat 9a orchestrator-invocation probe (measures the model in front of the memory, not a product)
RLM Qwen-7B / Llama-70B orchestrator R@5 0.467 by-hop n/r n/r n/r n/r n/r n/r 0.467 · 7–27% invoke N/A · no harness
Wired, not yet benched — harness has these adapters; no Cat run exists
full_context Karpathy D1 · whole-vault-in-context not run not run not run not run not run not run not run not run not run not run
karpathy_compiled Karpathy D2 · LLM-compiled wiki not run not run not run not run not run not run not run not run not run not run
Rows are grouped by role — four memory products, one no-structure control, one diagnostic orchestrator arm, two wired-not-benched baselines — so the grid reads as roles, not eight competing systems. The two “not run” rows (Karpathy D1/D2) are wired adapters that have never been benched: an absence of measurement, rendered muted/dashed so they never read as a system that scored blanks. Every N/A cell is the opposite — a measured finding with a reason. Click System to sort.
three structural-N/A systems · three different reasons — the distinction is a feature The structural cats (3–8) read N/A for three architecturally distinct reasons, and the difference is the point: flat is N/A by design — it is the no-structure control, intentionally graphless, the baseline the structural delta is measured against. Hindsight is N/A · no graph endpoint — an extract-then-retrieve store that serves no standalone graph API at all. Mem0-OSS is N/A · graph layer removed — it once shipped graph memory and the open-source edition dropped it (the hosted platform keeps it; the OSS package under test returns isolated entities with zero edges). Same cell value, three different architectural stories — that is diagnostic signal, not noise.
the capstone · emergent typed structure is unsolved field-wide Across the two graph-native products, generating typed contradiction / supersession edges from raw content is ~0: mempalace has 0 contradicts and 0 supersedes edges in the live KG (emergent 0.00 / 0.00); OMEGA catches contradiction at recall 0.50 (1 of 2 themes, precision 0.25) and supersession 0.00. SME is the only framework that even asks the question, and the honest cross-field answer is “not yet.”
Source · baselines/cross_system_multipass_matrix_2026-05-30.json (cassia-2, expanded #163) + docs/benchmarks/2026-05-30-cross-system-multipass-matrix.md · mempalace structural cells = the EXACT post-re-map finals (#211/#226); OMEGA full Cat 1–9 (#178/#183); Hindsight (#220/#184) + Mem0-OSS (#221/#185) verdict rows; flat control (#125/#82); RLM 9a probe (#194). Footnote: oracle_retrieval (ceiling) + random_retrieval (floor) adapters are present in the registry but not run as matrix cells — they are diagnostic bounds, not products. techempower-org/multipass-structural-memory-eval#163
Head-to-head · identical harness

Apples to apples — same subset, same reader, same judge

rigorous axis

The only rows on this page measured under identical conditions — every system below was run by us on the same n=150 stratified LongMemEval-S subset, the same retrieval definition (session-level R@K), the same reader, and the same canonical gpt-5.3-chat judge. No self-reported numbers, no mixed answer models, no metric swaps. The field matrix below cites published leaderboards (caveat-heavy); this is the table that controls the variables.

System Storage R@5 E2E QA
mempalace deployed substrate (this work) verbatim 92.7% 58.0% o4-mini reader · retrieved ctx · canonical gpt-5.3-chat judge · n=150 macro ‖
comparator = oracle-n500, not strat150-S (closest apples, not pixel-perfect) · opus deployed→oracle gradient 0.567→0.868 in field card
OMEGA first independent competitor run extraction 90.0% 59.3% sme-rich · o4-mini reader · retrieved ctx · canonical gpt-5.3-chat judge · n=150 macro ✦
≈ parity (+1.3pp), same reader + context
Hindsight extraction, cloud-extractor extraction QA deferred extraction throughput / cost-gated

The honest read: mempalace’s edge is retrieval (R@5 +2.7pp); reader/QA is at parity (OMEGA +1.3pp, same-reader). On R@5, mempalace 0.927 vs OMEGA 0.900 — a fair same-rendering pair (both measured upstream-exact, --content-rules), so the 2.7pp edge is real: retrieval-only, no answer-model confound. On E2E QA, held to same reader, same context (o4-mini, retrieved, macro n=150), mempalace 0.580 vs OMEGA sme-rich 0.593≈ parity (OMEGA +1.3pp): two verbatim-vs-extraction systems land within a question of each other once reader and context are pinned. Hindsight’s cloud-extractor throughput keeps its QA pass cost-gated — deferred rather than guessed. Don’t confuse the 0.580 comparator with mempalace’s deployed→oracle gradient (deployed-E2E ladder 0.567→0.760 as retrieval widens limit 5→50 → 0.868 oracle ceiling): that gradient is an opus reader on retrieved/oracle context and lives in the field card below — it is not the like-for-like OMEGA comparison. This table — not the leaderboard — is the one to read.

OMEGA’s 0.593 is the fair sme-rich number (our harness, n=150, o4-mini reader + gpt-5.3-chat judge). OMEGA’s upstream-exact run scored just 0.37 — but that was date-starved: its reader context dropped session timestamps, and the temporal category (cat_6) jumped 0.04→0.36 once we restored the dates. Scoring the date-stripped 0.37 against mempalace would be the unfair comparison, so we use the date-restored sme-rich figure. (We ran the same date-confound check on our own LoCoMo numbers — see the explainer below — and there it came back clean, which is why we trust 0.593 here.) On R@5, the comparison stays upstream-exact on both sides (mempalace 0.927, OMEGA 0.900) — OMEGA’s sme-rich R@5 of 0.953 is not a comparator here, because no same-rendering mempalace pair exists for it; mixing renderings is exactly the apples-to-oranges this table refuses.

the cost wall · why two QA cells are deferred, not blank Hindsight ~150h / Mem0 ~18h to benchmark strat150 — vs verbatim-first mempalace ~0 marginal cost. That is the leaderboard-hidden headline: extraction-based memory systems are benchmark-throughput-bound. Both competitors run an LLM fact-extraction on every session ingest, and LongMemEval-S averages ~48 sessions/question, so the strat150 subset is ~7,200 ingests: Hindsight at ~60–96 s/session (local phi4, CPU) → ~150h to benchmark; Mem0-OSS at a warm ~9 s/ingest (ollama phi4 extractor) → ~18h. A verbatim-first system (mempalace) ingests raw text at ~zero marginal compute — orders of magnitude cheaper to evaluate. That is why the apples table publishes their QA as deferred rather than a guessed number: it isn’t reluctance, it is a real architectural cost. (Mem0’s extraction is also lossy by design — the verification smoke stored only 3 of 5 facts.) The field-reported 91.4% (Hindsight) / 94.4% (Mem0 platform) are their own leaderboard numbers on stronger answer models, not on-harness figures.
Field context · self-reported leaderboards

Where mempalace sits — and the honest gap

context · cite-with-caveats
Benchmark
LongMemEval-S
Our retrieval
R@5 0.927 (/search), n=150 stratified
Reader ceiling
0.868 true-oracle (gold present) — Opus|preference + canonical judge, n=500 — essentially the field's GPT-4o oracle (0.870). The earlier 0.61 was the retrieval-limited substrate.
Judge
gpt-5.3-chat + canonical type-specific prompts (NOT gpt-4o-2024-08-06)
Posture
diagnostic decomposition, not a leaderboard claim
Retrieval R@5
92.7%
mid-pack vs the R@K field (80–99%)
Reader ceiling (true oracle)
86.8%
gold present — matches the field's 87.0% oracle
“0.61” was substrate
+26pp
all from ingest + retrieval; ~0pp genuine reader
Structural cats
4/5/8/9
measured here; no field analogue

What this table is: the field column is self-reported leaderboard numbers — mixed metrics (R@5 vs oracle QA vs E2E QA, which are three different tests), mixed answer models (answer-model choice alone swings ~24pp), and mixed verification. It is context, cited with caveats — not a controlled comparison. For that, read the apples-to-apples table above (our own harness, identical conditions). The Verif. column below tells you who measured each number:

selfvendor’s own report indieindependent re-run (incl. ours) paperpeer/preprint refreference baseline
System Storage LongMemEval LoCoMo Answer model Verif.
R@5 oracle QA E2E QA
mempalace — this work (verbatim-first)
mempalace — deployed substrateverbatim92.7%0.567→0.760 limit 5→50 ‖38.8% daemon §gpt-5.3-chatindie
mempalace — ingest-fixedverbatim68.4% limit=5 ‖gpt-5.3-chatindie
mempalace — true-oracle ceilingverbatim86.8%gpt-5.3-chatindie
the field — E2E QA leaderboard (sorted by LongMemEval E2E QA; storage paradigm per MemPalace survey + independent check)
OMEGAextraction90.0% 95.4%GPT-4.1self indie
Mastraunstated94.9%GPT-5-miniself
Mem0 (platform v3)hybrid94.4%92.5%undisclosedself
Hindsightextraction91.4%89.6%Gemini 3 Proindie
True Memory (Pro)verbatim87.8%93.0%gpt-4.1-minipaper
Supermemoryextraction85.2%65.4%Gemini-3indie
EverOS / EverMindunstated83.0%93.1%undisclosedself
ENGRAM (paper, arXiv:2511.12960)extraction71.4%77.6%GPT-4o-minipaper
Zep / Graphitihybrid71.2%75.1%GPT-4opaper
Celiumsunstated62.3%Opusself
GPT-4o (reference)reference87.0%60.2%GPT-4oref
retrieval-only — R@K recall (LongMemEval R@5; not comparable to QA)
engram-2 verbatim99.0%74.5% QAself
ai-memory verbatim97.8%self
MemPalace (upstream)verbatim96.6%88.9% R@10indie
agentmemoryverbatim95.2%self
mcp-memory-serviceverbatim86.0%49.7% R@5self
LongMemEval columns are three different tests — R@5 (retrieval), oracle QA (gold given), E2E QA (full pipeline) — never compare across them. Click any column header to sort (groups collapse into one ranking).

engram-2 = github.com/199-biotechnologies/engram-2 (Paperfot AI) — a Rust CLI: SQLite + FTS5/BM25 + Gemini Embedding 2 + RRF + Cohere rerank, self-reporting 99.0% R@5 on LongMemEval-S. It returns verbatim source chunks (with an optional LLM claim/entity extraction pass). ai-memory = github.com/alphaonedev/ai-memory-mcp (FTS5 + embeddings, 97.8% R@5). The “Engram” name is overloaded: the same org also ships 199-biotechnologies/engram (MCP server, BM25+ColBERT+KG, 98.1% R@5), and the ENGRAM (paper) row is a separate, extraction-based academic system (arXiv:2511.12960, 71.4% QA). Storage paradigms marked unstated are ones we could not pin to a primary source — left blank rather than guessed.

OMEGA R@5 0.900 is our own measurement — the first independent head-to-head: OMEGA run on the identical n=150 stratified LongMemEval-S subset, session-level R@K, same retrieval definition as mempalace (which scores 0.927 on that subset — OMEGA trails by 2.7pp / 4 questions). It is not comparable to OMEGA’s self-reported 95.4%, which is a different metric (E2E QA, not R@5) with a stronger answer model (GPT-4.1; answer-model choice alone swings ~24pp). The on-harness R@5 is the apples-to-apples number. (Note: OMEGA ships without its bge-small embedding model and silently falls back to keyword-only FTS5; we ran omega setup --download-model to restore semantic retrieval before measuring, so this is the fair number, not a crippled one.)

§ mempalace daemon on LoCoMo — age-fused QA 0.388 (n=250 stratified, isolated scratch palace, never prod), drawer-level R@5 0.556. An A/B isolating the KG: restoring the graph-hydration fix (palace-daemon#202) moved QA +1.2pp over the vector-only fallback (0.376) with identical drawer-R@5 — so on LoCoMo the age-fused graph half adds ~nothing to top-5 retrieval (the graph-only hits don’t displace the vector top-5). The flat adapter scores the same QA (0.384), confirming the substrate, not the path, sets it. LoCoMo QA here (~0.39) is an our-harness figure (gpt-5.3-chat reader) — not comparable to the field’s 0.75–0.93 (stronger answer models + looser graders; cf. the event-ordering Kendall-τ vs binary-judge mismatch).

The deployed-E2E column is a retrieval-breadth ladder (real /search→reader→judge, opus reader, n=150): limit=5 0.567 (85/150) → limit=20 0.740 (111/150) → limit=50 0.760 (114/150). Deployed QA was retrieval-limited at limit=5; widening to 20 recovers +17.3pp and it plateaus by 50 — concrete proof the gap to the 0.868 oracle ceiling is retrieval breadth, not the reader. (One content-filtered judge call at limit=50, qid 95228167, is counted wrong — a conservative floor; scored correct it is 0.767, which doesn’t touch the plateau.) These are a live re-query and supersede the 05-29 cached 0.610 (pre-retrieval-drift pinned context). Only the 0.868 ceiling sits under oracle QA (gold handed to the reader, retrieval bypassed) — the like-for-like to the field’s GPT-4o 0.870 oracle (techempower-org/multipass-structural-memory-eval#117).

Read the Storage column first — it is the axis a LongMemEval leaderboard hides. mempalace is verbatim-first: it stores the raw turns and never summarizes (“store everything, then make it findable”), solving retrieval separately. It is not alone in that — True Memory (a paper-only design, arXiv:2605.04897, no code release) is also verbatim-first and reports 87.8% QA, so the paradigm is not what caps the leaderboard. The rest of the field sits on the other side of the split: hybrid systems that selectively extract while keeping some raw history (Mem0’s self-editing vector+graph+KV, Zep/Graphiti’s bi-temporal knowledge graph, Letta’s tiered memory), and pure extraction (Hindsight stores structured, time-aware facts instead of raw logs). So no, these are not all verbatim-first — but mempalace’s verbatim peers are real: True Memory, traditional RAG, and the R@K-only ChromaDB baselines (agentmemory, engram-2, ai-memory at 95–99% R@5). The paradigm split is taken from the MemPalace landscape survey, not assigned by us.

The honest read: on retrieval mempalace is competitive — R@5 0.927 sits mid-pack against the field's R@K leaders (96–99% for ChromaDB-baseline systems; mcp-memory-service 80–86%). On QA we deliberately publish no full-pipeline leaderboard number (that needs a field-standard answer model + the canonical gpt-4o-2024-08-06 judge we don't have). The apples-to-apples axis is the oracle — and an earlier reading of this card got it wrong, which is worth stating plainly: we reported 0.61 and called the 26-point gap to GPT-4o's 0.87 oracle a reader limit. It isn't. That 0.61 was retrieval-limited — the context reaching the reader came through /search at limit=5, and for single-session-assistant questions our upstream-parity ingest (user-turns-only, which upstream itself recommends against) dropped the assistant-authored gold entirely. Hand the reader the gold instead (true oracle, evidence sessions verbatim) and five of six categories recover, not just assistant: single-session-assistant 0.32→0.98 (ingest), temporal 0.36→0.75, knowledge-update 0.70→0.91, multi-session 0.71→0.87, single-session-preference 0.80→0.93 (all retrieval); only single-session-user carried over. That lifts the overall from 0.610 to a 0.868 reader ceiling — within 0.2pp of the field's 0.870. So the old 26-point “reader” gap decomposes as ~7pp ingest artifact (the dropped assistant turns; fixed → 0.684) + ~18pp retrieval breadth (limit=5 + chunking never delivering the gold) + ~0pp genuine reader. The reader was never the bottleneck — once the gold is in context it essentially matches the field's oracle; getting the gold to it is the whole game. The route here is itself a finding: an earlier reader floor-lift — three category-specific prompt clauses on the pinned /search context — returned a clean null (net −2 to +3 questions), which is precisely what flagged that the binding constraint was substrate, not prompt; the true-oracle test then confirmed it. That decomposition is the SME finding the leaderboards can't show. And the structural categories — ingestion integrity (Cat 4), gap-detection (Cat 5), ontology coherence (Cat 8), invocation discipline (Cat 9) — have no competitor analogue at all.

comparability discipline True-oracle = the ceiling, not the deployed number. The 0.868 figure is measured with the evidence sessions handed to the reader verbatim — i.e. the reader's ceiling once retrieval is perfect. A live system still has to retrieve those sessions; the deployed number (real chunked /search) sits between the 0.610 substrate reading and the ceiling, and a production daemon re-test is still pending. All four gradient numbers are abstention-credited (the field's own metric), so our 0.868 ceiling and the field's 0.870 oracle are a like-for-like comparison. Cross-system LongMemEval numbers are not directly comparable anyway: answer-model choice alone swings ~24pp, oracle QA is a different metric from full-pipeline E2E QA, and most competitor numbers are self-reported — so we do not place ourselves on the QA leaderboard. Field numbers are cited from the landscape survey; ours are sourced from baselines/.
Source · ours: baselines/reader_trueoracle_{ss-assistant,temporal,multi-session}_2026-05-29.json + reader_sweep_passA_canonical-judge_opus-preference_*.json (n=500 deployed) + longmemeval_s_strat150_*.reagg.json (R@5) · docs/benchmarks/2026-05-29-true-oracle-floor.md · field: docs/research/2026-05-29-comparison-readiness.md techempower-org/multipass-structural-memory-eval#116
Multipass · what the system knows about its own structure

SME structural diagnostics — no competitor analogue

unique contribution

The cross-system matrix above is the comparison — every system, every category. This is the single-substrate deep-dive beneath it: the same Cat 1–9 read in detail against mempalace alone, answering what does the memory system know about its own structure? These nine categories are SME’s unique contribution — LongMemEval, LoCoMo, BEAM, Mem0’s and Zep’s evaluations all measure end-to-end QA or retrieval recall. None of them diagnose canonical-collision dedup, edge-type monoculture, structural holes, ontology drift, or harness invocation rate. The analogue column is none for every row, by construction — that is the headline. Substrate under test: mempalace via the mempalace-daemon adapter against the live palace-daemon, reading the real AGE knowledge graph (1,156,314 entities / 1,873,489-edge cleaned RELATION set, full-graph EXACT — not the earlier capped /graph projection, and post the --drop-code-tokens junk DELETE). Diagnostic readings of one substrate — not leaderboard scores.

Cat What it measures MemPalace reading Corpus / adapter Analogue
1 Lookup Find a specific memory from a natural-language query R@5 0.867 full-recall 22/30; hop-1 0.889, hop-2 0.667 jp-realm-v0.1 · daemon /search/age-fused none
2c Stairway Multi-hop retrieval recall by hop depth — does structure scale? (structural − flat) +10pp · A flat 0.833 / B hybrid 0.933 / C age-fused 0.900 @K=5 · grows with depth (B/A 1.11×→1.25×) B−A +9.3pp@1-hop, +16.7pp@2-hop; graph-RRF sub-layer (C) is a neutral-to-slightly-negative tax vs B — third data point for the age-fusion-null thesis (#203) jp-realm-v0.1 · flat / daemon / age-fused none
3 Dissonance Detect and surface conflicting facts 0.00 emergent · the live palace KG has 0 contradicts edges — the enrichment pipeline generates no emergent contradiction structure on real content supersedes the earlier +1.00, which was a corpus-declared ceiling (good-dog-graph reading back hand-seeded edges, not detection). Cross-system: OMEGA auto-relate generated contradicts edges, recall 0.50 (#148/#215) live palace (real KG) · daemon --real-kg none
4 Threshold Is ingestion producing a clean graph? (dedup, field coverage, monoculture) 4a collisions 248 (1.9%) · 4b coverage 1.00 · 4c norm-entropy 0.645 · other 26.83% · 40 types full-graph EXACT (server-side cypher over the cleaned 1,873,489-edge RELATION set) — RESOLVED. Overturns the prior sampled 0.020/98.98%-tunnel capped-projection artifact, then the post-relabel-pre-DELETE intermediate (0.340/55.05%/237). The re-map ran: kg_predicate_norm de-monoculture relabel (~520k edges) + a --drop-code-tokens DELETE (48,135 junk shell-cmd/stopword/DOM-method edges). Entropy 0.020→0.340→0.645; other 98.98%→55.05%→26.83% (#211) live palace (real KG) · daemon /cypher none
5 Missing Room Identify what’s structurally missing (components, holes, gaps) largest component 61.87% (715,435 of 1,156,314) · isolates 22.2% (256,782) · 325,965 components EXACT full-graph WCC (server-side, post-DELETE) — replaces the bogus capped-projection 44.8%-isolate artifact. Honest note: isolates rose 20.4%→22.2% after the --drop-code-tokens DELETE — deleting 48k junk edges orphaned ~20k entities whose only edge was junk; pre-DELETE they were “connected” by noise, the cleaner graph honestly reports them isolated. The giant component is still well-connected (#211) live palace (real KG) · daemon none
6 Archive Current vs historical state, supersession tracking 0.00 emergent · the live palace KG has 0 supersedes edges — completeness 0.00 supersedes the earlier +1.00 (good-dog-graph declared ceiling, 8/8 hand-seeded). Cross-system: OMEGA also 0.00 — its temporal analogue evolution doesn’t normalize to supersedes. Emergent supersession is unsolved in both (#148/#215) live palace (real KG) · daemon --real-kg none
7 Abacus Does structure earn its token overhead? (graph vs no-graph) structure earns it: +10pp recall for 1.69× context (A vs B, token-comparable: 466→789 tok/q) flat is cheaper per-correct but lands 5 fewer correct (21/30 vs 26/30); the structural lift is hybrid retrieval, not graph traversal (#203) jp-realm-v0.1 · flat / daemon none
7b Latency Query latency distribution (YCSB p50 / p95) post-AGE-index golden set: p50 vector ~684ms / union ~515ms / hybrid ~689ms two honest facts, not a single-set win: on this drawer-query golden set the graph leg is inert (hybrid ≈ union, ~689ms) — the AGE-index speedup lands on entity-anchored queries, not these. The earlier hybrid p50 2064ms was a graph-firing query set; the two sets aren’t the same workload, so do not read a same-set 2064→689 improvement (#144/#227) live palace (AGE) · daemon (candidate-strategy) none
8 Blueprint Does the actual graph match what the system claims to do? “hierarchical” PASSES · modularity 0.7961 (218 communities) · introspection 1.0 LIVE EXACT full-graph (networkx, post-DELETE) — refutes the prior FAIL (modularity 0.009 was the capped-projection artifact) and replaces the verdict-only reading with a real number: 0.7961 >> 0.5. Introspection is now 1.0 on the deployed daemon (daemon restarted, /ontology serving) — was 1.0 capability / 0.0 deployed (#147/#211) live palace (real KG) · daemon none
9a Handshake Does the model actually invoke memory when it has access? opus-4-8 (Tau2 99.3): 100% invocation, 98.3% recall — invokes on every question, exceeds the deterministic 78.3% ceiling. Recall monotonic in Tau2: gemma4 41.7 → qwen3.5 75.0 → opus-4-8 98.3 prior RLM Qwen-7B/Llama-70B plateaued 46.7% (7–27% invocation) — ceiling was willingness to invoke, not retrieval (#194). 4B arms’ on-harness invocation-rate is a backfill follow-up; recall carried from the validated 2026-05-15 RLM run jp-realm-v0.1 · familiar / rlm / opus-4-8 none
9b Call-through Given an invocation, does the tool call complete and return a valid result? live surfaces reachable clean floor; mock-model probe path none
Every row’s competitor analogue is none — that is the point of this table, not a gap in it. The readings are diagnostic deltas on a single substrate, never cross-system leaderboard scores. Click Cat, corpus, or analogue to sort.
cross-system headline · emergent structure is largely unsolved The honest correction this matrix surfaces: generating typed contradiction / supersession edges from raw content is largely unsolved in both substrates. Emergent contradiction is 0.00 in mempalace (0 contradicts edges in the live KG) and 0.50 in OMEGA (auto-relate catches 1 of 2 ground-truth themes, precision 0.25); emergent supersession is 0.00 in both. The earlier +1.00 we published for Cat 3/6 masked this — it was a corpus-declared ceiling (the good-dog-graph adapter reading back hand-seeded edges), not emergent detection on real content. Declared-ceiling and emergent are different questions; the matrix now labels both, and the emergent column is the more honest one (techempower-org/multipass-structural-memory-eval#215).
why Hindsight & Mem0 are N/A on the structural cats — a finding, not a blank “N/A” on Cat 3/4/5/6/8 is itself a structural finding: the system has no typed graph to evaluate. The two extraction competitors reach it by different roads. Mem0-OSS once shipped a graph-memory layer and the open-source edition dropped it — the package under test is now a flat / vector store whose snapshot returns isolated entities with zero edges (the hosted Mem0 platform keeps graph memory; the OSS one does not). Hindsight exposes no standalone graph endpoint at all — it is an extract-then-retrieve store, not a queryable typed graph, so the structural probes have nothing to read. Both also read N/A on Cat 9a/9b (driven through a library / client API with no model-in-the-loop harness manifest — the Handshake is an orchestrator property, and there is no orchestrator to measure). Per the matrix legend, an N/A is reported as a real diagnostic outcome, never a missing cell (techempower-org/multipass-structural-memory-eval#178).
Cat-9 · orchestrator-model selection (Tau2 prior) Cat-9a Handshake recall tracks the orchestrator’s Tau2 tool-agent score, not its parameter count. A +37.7pp Tau2 gap predicted a +30–33pp Cat-9a recall gap to within ~5pp. The live data bears it out: RLM with Qwen-7B and Llama-70B both plateau at the same 46.7% despite a ~10× parameter difference — both ceiling at willingness to invoke the tool, not at retrieval. The deterministic familiar pipeline consistently invokes and lands at 78.3%. The lever for raising 9a is a high-Tau2 orchestrator (Opus 4.6 99.3%, GPT-5.4 98.9%, GLM-5 ~98%), not more parameters (techempower-org/multipass-structural-memory-eval#194).
measurement honesty · the structural column was re-measured over the real KG — and the exact counts now ship The structural cells carried four measurement traps, now corrected (#215) — and the fixes have landed: the re-map, the junk-edge DELETE, and a full-graph networkx pass all ran, so the absolute counts that were “pending” are now published EXACT, not verdict-only. (1) The capped projection (the daemon /graph sample was tunnel-dominated — Cat 4 read 0.020/98.98%, Cat 8 “hierarchical” FAILED at 0.009) is re-anchored to the full-graph EXACT: Cat 4 entropy 0.645 / other 26.83%, Cat 8 modularity 0.7961 (PASS). (2) The tautological scorer (Cat 3/6 +1.00 was the good-dog-graph reading back hand-seeded edges — a declared ceiling, not emergent detection) is replaced by the honest emergent read, 0.00. (3) The limit-dependent topology samples (Cat 5/8) that used to OOM at full scale are now computed server-side over the whole graph: Cat 5 WCC 61.87% largest / 22.2% isolates, Cat 8 modularity 0.7961. Cat 2c / Cat 7 keep their #203 flat Condition-A deltas (+10pp). The one open follow-up is the two 4B arms’ on-harness Cat 9a invocation rate. Per the diagnostic posture, each cell is a controlled reading of one substrate — and we’d rather publish a 0.00 emergent than a flattering declared 1.00.
Source · structural column re-measured over the real KG, EXACT full-graph post-DELETE: baselines/cross_system_multipass_matrix_2026-05-30.json + docs/benchmarks/2026-05-30-cross-system-multipass-matrix.md + docs/benchmarks/2026-05-31-cat458-real-kg-crossvalidation.md (#215, post #147/#210/#211 + the prod re-map + --drop-code-tokens DELETE + networkx full-graph modularity) · retrieval cats baselines/{jp_realm_v0_1_daemon_age_fused, cat2c_daemon_age, candidate_strategy_age}_2026-05-29.json · Cat 9a docs/benchmarks/2026-05-30-cat9a-tau2-orchestrator-ladder.md (#194) · Cat 2c/7 docs/benchmarks/2026-05-30-jprealm-flat-condA-cat2c-cat7.md (#203) techempower-org/multipass-structural-memory-eval#115
Comparability · reading our LoCoMo number

Our LoCoMo QA (~0.38) is not a substrate weakness

explainer · 2026-05-30

Taken next to the field’s LoCoMo QA (0.75–0.93), our ~0.38 looks alarming. It isn’t a verbatim-substrate failure — it is four measurement choices stacked on top of each other, none of which the leaderboard numbers share. Reading them in order:

Substrate is not the bottleneck
0.3840.388
flat adapter vs age-fused daemon — same QA
Retrieval-recall ceiling
R@5 0.44
the gold often isn’t in the top-5 to begin with
Judge
strict
binary, abstention-aware — no partial credit
Subset
adversarial-incl.
field reports often skip the adversarial split

1 — the substrate isn’t the limit. The flat adapter (plain verbatim + vector retrieval) scores 0.384; the full age-fused daemon (pgvector + AGE knowledge graph) scores 0.388 on the same n=250 stratified subset. If the verbatim store were the weakness, the graph-augmented path would pull ahead — it doesn’t. Whatever caps LoCoMo here sits above the substrate, in the retrieval-and-judge layer that every system shares. 2 — it’s a retrieval-recall ceiling. Drawer-level R@5 on LoCoMo is ~0.44: more than half the time the gold turn isn’t in the top-5 the reader sees, so the QA number is bounded by recall, not by the reader’s ability to answer. 3 — our judge is strict. It is a binary, abstention-aware grader with no partial credit and no looser string-overlap leniency — the same discipline we hold mempalace to everywhere else. Field LoCoMo numbers frequently use softer graders (and much stronger answer models), and answer-model choice alone swings results by ~24pp. 4 — we include the adversarial split. Our subset keeps the adversarial LoCoMo questions that many field reports quietly drop; those are exactly the ones designed to defeat retrieval.

temporal date-confound · ruled out We checked for the same temporal date-confound that bit OMEGA here, and ruled it out. Cassia’s diagnostic (n=50 temporal, capture-context) found 100% of reader contexts carried a date — zero date-stripped. The failures break down as retrieval-miss 24 / genuine-reasoning-fail 8 / IDK-despite-evidence 5: the gold isn’t reaching the reader, or the reasoning genuinely misses — not date-starvation. So unlike OMEGA (whose upstream-exact run was date-starved — cat_6 jumped 0.04→0.36 once dates were restored), LoCoMo’s 0.384 overall / 0.26 temporal is the genuine retrieval-ceiling-plus-reasoning number, not an artifact. Treat it as a real reading under our strictest conditions — still not apples-to-apples with the field’s 0.75–0.93 (stronger answer models + looser graders), but honestly ours (techempower-org/multipass-structural-memory-eval#108 / #110).
Source · baselines/locomo_*_flat_*.json + longmemeval/locomo daemon age-fused (n=250, isolated scratch palace) · KG A/B: palace-daemon#202 techempower-org/multipass-structural-memory-eval#108
BEAM · bucket regime-shift

As the haystack grows, the regime flips to retrieval

verified · 2026-05-30
Benchmark
BEAM (bucketed by haystack size)
Substrate
flat-local mempalace
Reader
gpt-5.3-chat
Judge
o4-mini
100K — full-context regime
0.649
whole bucket fits; needle abilities intact
500K — retrieval regime
0.487
retrieval kicks in; needle abilities collapse
1M — deep retrieval
0.471
plateaus — −0.016 vs 500K; 10M deferred (cost)
Bucket Overall QA Regime Per-ability (info-ext · temporal · multi-session · summ.)
100K full-context 0.649 whole bucket in context 0.85 · 0.75 · 0.60 · 0.87
500K retrieval 0.487 real retrieval (needle-in-haystack) 0.40 · 0.26 · 0.24 · 0.87 (holds)
1M deep retrieval 0.471 retrieval-limited (plateau) flat vs 500K; a couple abilities tick up

BEAM is the one benchmark here with a genuine regime shift — a cliff, then a plateau. At 100K the whole haystack fits in context, so the reader operates in a full-context regime and the needle abilities are intact. The cliff is 100K→500K (0.649→0.487, −0.162): the system crosses into a real retrieval regime and the needle-dependent abilities collapse — info-extraction 0.85→0.40, temporal 0.75→0.26, multi-session 0.60→0.24 — while the retrieval-light ability holds flat (summarization 0.87 in both, because summarizing doesn’t need a specific needle). Then it plateaus: 500K→1M is essentially flat (0.487→0.471, −0.016) and a couple of needle abilities even tick up. So once you’re retrieval-limited at a low top-K, haystack size barely matters — the bottleneck is retrieval breadth, not corpus scale. That asymmetry is the tell: a retrieval-recall ceiling, not a reader ceiling — the same finding as our LongMemEval decomposition (the reader is fine once the gold is in context; getting it there is the whole game).

grader artifact · event_ordering 0.0 event_ordering scores 0.0 in all three buckets — that is a grader mismatch, not a substrate failure. BEAM grades ordering with Kendall-τ-b (a rank correlation); our binary judge floors any full-sequence answer that isn’t an exact match. It drags all buckets equally, so the regime-shift story is unaffected, but it means the absolute substrate is stronger than the headline overall numbers suggest. (Same Kendall-τ-vs-binary-judge mismatch flagged on LoCoMo.)
1M floor · reader-window overflow At 1M, 58 of 700 questions (~8%) overflowed the reader’s context window and hard-zeroed — a mechanical floor, not a retrieval verdict. A real 1M deployment needs sub-session chunking to feed the reader; until then, treat 0.471 as a conservative 1M reading carrying an ~8% overflow penalty the plateau would otherwise sit above.
comparability · not the field’s BEAM-1M These buckets are flat-local mempalacenot directly comparable to the field’s published BEAM-1M (Mem0 70.1 / True Memory 76.6 / Hindsight 73.9), which run different infrastructure and stronger answer models. Read the columns as our own regime-shift diagnostic, not a leaderboard placement. 10M is deferred on cost.
Source · baselines/beam_{100k,500k,1m}_flat_*.json · BEAM bucket-regime harness (techempower-org#177) techempower-org/multipass-structural-memory-eval#177
Cat 9a · The Handshake

Tau2 score predicts empirical recall gap

verified · 2026-05-15
Corpus
jp-realm-v0.1 (n=30)
Backend
postgres + pgvector + AGE
Pattern
RLM orchestrator A/B
Models
qwen3.5:4b vs gemma4:e4b
ModelTau2Recall @ n=5Recall @ n=20Hit-rate @ n=5
gemma4:e4bbaseline0.4170.41757%
qwen3.5:4b+37.7pp0.7170.75093%
delta+37.7pp+30.0pp+33.3pp+36%

The published Tau2 tool-use benchmark gap between qwen3.5:4b and gemma4:e4b is +37.7 points in qwen's favour. On our independent 30-question Cat 9a-shaped RLM experiment, the empirical recall gap landed at +30.0pp at n=5 and +33.3pp at n=20Tau2 predicts the gap to within ~5pp. Tau2 is therefore a useful prior when picking an orchestrator model for any RLM-as-tool-use experiment, and a much stronger predictor than parameter count.

Source · jp-realm RLM A/B + published Tau2 scores (run artifacts recorded in the upstream issue, not committed to this fork) M0nkeyFl0wer/multipass-structural-memory-eval#3
LongMemEval-S 500Q · E2E QA

mempalace-daemon, default /search

headline · 2026-05-29
Corpus
longmemeval_oracle.json
n
500 questions
Adapter
mempalace-daemon (postgres + pgvector + AGE)
Endpoint
POST /search (default)
Reader
o4-mini
Judge
gpt-5.3-chat
Content rules
upstream-exact
Ingest
per-question wing isolation
R@5 overall
97.0%
drawer-id match, chunk-suffix-aware (techempower-org/multipass-structural-memory-eval#98)
QA accuracy
58.6%
judge label = CORRECT or canonical ABSTAIN
R@5 − QA gap
38pp
retrieval finds the session; the QA gap is downstream — substrate-bound (see comparison card)
SME CategoryLME typenR@5QA-acc
cat_1single-session IE150100.0%51.33%
cat_1_negativeabstention3096.67%90.00%
cat_2cmulti-session12198.35%74.38%
cat_3_partialknowledge-update72100.0%65.28%
cat_6temporal-reasoning12790.55%40.94%
overall50097.00%58.60%

The headline flips once the R@5 matcher is correct: the daemon finds the gold session in the top 5 97% of the time. The 38-point R@5→QA gap looked like a reader failure — but the true-oracle correction (see the comparison card) showed it is substrate: “R@5 found the session” doesn't mean the gold content reached the reader (limit=5 + chunking left it out), and with the gold actually present QA rises to a 0.868 ceiling. cat_6 (temporal) is the only category below the R@5 ceiling (90.55%) and also the worst QA (40.94%): its evidence is the most fragmented across the limited context. cat_1_negative's QA (90%) clears its R@5 because a correct abstention is the right answer regardless of what was retrieved.

how R@5 went from 3.97% to 97% (techempower-org/multipass-structural-memory-eval#98) Earlier readings of this card showed R@5 = 3.97% and framed the gap as a matcher artifact. It was: the daemon chunks each drawer into <parent>_chunk_NNNNNN sub-drawers and /search returns the chunk IDs, but the matcher compared them exact-string against the parent IDs we stored at ingest, so every hit read as a miss. techempower-org/multipass-structural-memory-eval#98 strips the suffix before comparing. The 97% is computed by re-scoring the existing 2026-05-28 rerun records — no new bench compute, same retrieved sets, correct matcher. QA-acc is unchanged by the fix (it never depended on the matcher).
Source · baselines/longmemeval_mempalace_daemon_2026-05-28-rerun.reagg.json (re-scored post-techempower-org/multipass-structural-memory-eval#98; rerun post-#67 of the 2026-05-28-attempt3 run) techempower-org/multipass-structural-memory-eval#44
LongMemEval-S 500Q · A/B leg

mempalace-daemon, /search/age-fused

A/B finding · 2026-05-28
Corpus
longmemeval_oracle.json
n
500 questions
Adapter
mempalace-daemon
Endpoint
POST /search/age-fused
Reader
o4-mini
Judge
gpt-5.3-chat
R@5 overall
90.2%
vs 97.0% on /search default
QA accuracy
17.4%
vs 58.6% on /search default (−41.2pp)
Context per query
457 chars mean
vs 2539 on /search default (5.5× less)
SME CategorynR@5 age-fusedQA defaultQA age-fused
cat_1150100.0%51.33%9.33%
cat_1_negative3093.33%90.00%100.00%
cat_2c12198.35%74.38%0.00%
cat_3_partial72100.0%65.28%1.39%
cat_612764.57%40.94%33.07%
overall50090.20%58.60%17.40%

The corrected matcher (techempower-org/multipass-structural-memory-eval#98) turns this card from a mystery into a clean finding. /search/age-fused retrieves the gold session at R@5 = 90.2% — almost as well as the default endpoint. Yet QA accuracy is 17.4%, a 73-point R@5→QA gap. The cause is the snippet width: age-fused returns ~457 chars of context per query (5.5× less than default's 2539). The gold session is in the results, but the reader is handed too thin a slice of it to answer. cat_2c (multi-session) collapses to 0% QA at 98% R@5 — the starkest illustration: every gold session retrieved, none answerable from the snippet. cat_1_negative rises to 100% because thin context makes the reader abstain, which is correct for unanswerable questions.

corrected 2026-05-29 — not a retrieval failure Earlier readings of this card showed R@5 = 0.00% and read the age-fused leg as broken retrieval, then (after the empty-triples theory was disproven) as a structural mystery. With the chunk-suffix matcher fix (techempower-org/multipass-structural-memory-eval#98), R@5 is actually 90.2%: age-fused retrieval works. The QA collapse is entirely a context-width problem — the daemon-side snippet contract investigated in techempower-org/palace-daemon#150. The empty-triples theory remains disproven (rerun against 1.83M triples gave byte-identical context_chars), and the LongMemEval-S rerun (techempower-org/multipass-structural-memory-eval#91) is still the right test for whether graph traversal earns its keep on a harder haystack — but on oracle, age-fused's retrieval is not the problem; its snippet width is.
Source · baselines/longmemeval_age_fused_2026-05-28-rerun.reagg.json (re-scored post-techempower-org/multipass-structural-memory-eval#98) techempower-org/multipass-structural-memory-eval#45
LongMemEval-S 500Q · second adapter

familiar adapter, same harness

three-way A/B · 2026-05-29
Corpus
longmemeval_oracle.json
n
500 questions
Adapter
familiar (Hybrid v4 + rerank, fork)
Reader
o4-mini
Judge
gpt-5.3-chat
R@5 overall
28.6%
lowest of three legs — Familiar's retrieval genuinely weaker
QA accuracy
31.0%
QA slightly exceeds R@5 — reader fine, retrieval is the limit
R@5 vs daemon
−68pp
28.6% vs 97.0% on daemon-direct
SME CategorynR@5 #44R@5 #45R@5 #46
cat_1150100.0%100.0%38.00%
cat_1_negative3096.67%93.33%6.67%
cat_2c12198.35%98.35%26.45%
cat_3_partial72100.0%100.0%34.72%
cat_612790.55%64.57%21.26%
overall R@550097.00%90.20%28.60%

Familiar.realm.watch is a separate memory system in this household — same corpus, different retrieval and reader architecture. The second-adapter discipline is the point: a single reading is a single corpus, and brittle defaults hide on any single corpus. With the corrected matcher (techempower-org/multipass-structural-memory-eval#98) the finding inverts from an earlier reading: Familiar's R@5 (28.6%) is the lowest of the three legs, not the highest. Its retrieval is genuinely weaker than daemon-direct's 97%. And unlike the daemon legs, Familiar's QA (31.0%) slightly exceeds its R@5 — the reader does fine with what it gets; the limit is what reaches it. A different shape from #44/#45, where retrieval finds the session but — as the true-oracle correction later showed — the gold content still often didn't reach the reader. Both point the same way: the limit is what reaches the reader, not its reasoning.

corrected 2026-05-29 — earlier "highest R@5" claim was a matcher artifact A prior reading of this card claimed Familiar had the highest R@5 of the three legs (10.94%, beating daemon's then-3.57%). That was an artifact of the broken substring matcher: Familiar happened to surface session-ids in its retrieved text where the daemon stripped them, so the substring matcher favoured Familiar while undercounting both. The chunk-suffix drawer-id matcher (techempower-org/multipass-structural-memory-eval#98) measures retrieval directly and reverses the picture: daemon 97%, age-fused 90.2%, Familiar 28.6%. Familiar's published Hybrid v4 + rerank stack is tuned for its own corpus shape, not LongMemEval's one-drawer-per-session topology — a fair cross-system caveat, not a defect.
Source · baselines/longmemeval_familiar_2026-05-28-rerun.reagg.json techempower-org/multipass-structural-memory-eval#46
encoder fine-tune · in-domain vs cross-domain

Fine-tuned encoder lifts in-domain — but retrieval isn't the bottleneck

cross-domain null · 2026-05-29
In-domain
LongMemEval-S via MemPalace's own bench (n=500)
Cross-domain
jp-realm-v0.1 (n=30, covered n=29)
Encoders
all-MiniLM-L6-v2 base vs adaptmem FT-300
Stacks
raw · +FT-300 · +hybrid_v4+FT-300
In-domain stack (LongMemEval-S, n=500)R@1R@5R@10
MemPal raw default0.8060.9660.982
+ adaptmem FT-3000.8620.9800.994
+ hybrid_v4 + FT-3000.9160.9900.998
katana FT-300 repro (held-out 200)0.9251.0001.000
Cross-domain (jp-realm-v0.1, covered n=29)R@1R@5R@10
base (all-MiniLM-L6-v2)0.34480.51720.6207
FT-300 (MNR fine-tune)0.36210.51720.6034
delta+1.73pp0.00−1.73pp

Domain-adaptive encoder fine-tuning is a real retrieval lift in-domain. On LongMemEval-S through MemPalace's own bench, adaptmem's FT-300 moves R@5 0.966→0.980 and R@1 0.806→0.862; stacked with hybrid_v4 retrieval the gains compose to R@1 0.916 / R@5 0.990 — encoder fine-tune (layer 2) and hybrid retrieval (layer 3) operate on different failure modes and add independent lift. Our own katana FT-300 reproduction hit R@5 = 1.000 on the held-out 200. So the encoder is not useless — quite the opposite.

What our jp-realm reading shows is narrower: the lift does not transfer cross-domain. A FT-300 encoder trained on conversational memory, dropped onto JP's personal KB, gives best delta +1.73pp at R@1 (R@5 flat) against a predicted +30–33pp — a clean cross-domain null, exactly the open question the orthogonal-layers note flagged. And it doesn't change the end-to-end picture either: on oracle LongMemEval retrieval is already ~0.974 while the reader leaves a ~45pp gap (techempower-org/multipass-structural-memory-eval#116). Encoder tuning is a legitimate retrieval improvement, but it is not the lever for end-to-end QA — what reaches the reader is.

in-domain numbers credited to nakata-app · MemPalace/mempalace D#1249 The in-domain table is nakata-app's measurement, evaluated through MemPalace's own longmemeval_bench.py with a monkey-patched encoder swap (same dataset, same encoder family, zero changes to eval logic) and posted to MemPalace/mempalace discussion #1249. The raw-baseline R@5 of 0.966 exactly reproduces MemPalace's published number. The orthogonality finding (fine-tune and hybrid retrieval compose) is the durable claim; the absolute percentages attribute to that thread. Framing and layer model: SME's docs/research/adaptmem-orthogonal-layers.md.
Source · cross-domain: baselines/jp_realm_encoder_swap_{default,ft300}_2026-05-29.json + jp_realm_encoder_delta_2026-05-29.json · in-domain repro: baselines/lme_substrate_ft300_katana_test200_2026-05-17.json · nakata-app: MemPalace/mempalace D#1249 techempower-org/multipass-structural-memory-eval#84
LongMemEval-S · retrieval A/B

Age-fusion ~neutral on representative data

headline · 2026-05-29
Corpus
longmemeval_s_cleaned.json
n
150 (stratified, 25 per question_type)
Adapter
mempalace-daemon
Endpoints
/search vs /search/age-fused
Scoring
drawer-id R@K (techempower-org/multipass-structural-memory-eval#98), retrieval-only
/search R@5
92.67%
plain endpoint, drawer-id match
/search/age-fused R@5
92.00%
graph-fused endpoint
age-fusion Δ R@5
−0.67pp
−1 question of 150 (n.s.)

On a representative, category-stratified sample, age-fusion shows NO significant R@5 gain over plain /search (Δ = −1 question of 150; R@1 +2 questions). The +2.0pp R@5 “win” seen on an earlier n=100 first-100 slice did not replicate — that slice was single-session-dominated (the S corpus is question_type-sorted; fixed via --stratify-by in techempower-org/multipass-structural-memory-eval#122). Per-category (n=25 each, ±1-question noise → directional hypothesis only): age-fusion helps temporal-reasoning + knowledge-update, hurts single/multi-session recall.

Source · baselines/longmemeval_s_strat150_{search,age_fused}_2026-05-29.reagg.json techempower-org/multipass-structural-memory-eval#91
LongMemEval oracle · reader sweep (Pass A)

A large R@5→QA gap — measured directly

headline · 2026-05-29
Corpus
longmemeval_oracle.json (pinned context, retrieval held fixed)
n
500
Readers
o4-mini, gpt-5.3-chat
Judge
gpt-5.3-chat
Retrieval ceiling
R@5 97.4%
Retrieval ceiling R@5
97.4%
oracle — gold pinned in context
o4-mini QA
50.4%
search-default context
gpt-5.3-chat QA
52.2%
search-default context
R@5→QA gap
~45pp
ceiling clears; QA doesn't — on this limited context
Endpointo4-mini QAgpt-5.3-chat QA
search-default50.4%52.2%
age-fused43.2%46.6%

With the search-default context pinned (R@5 0.974), both readers answer correctly only ~50% — a ~45pp R@5→QA gap. Correction (see the comparison card): this gap is substrate, not the reader. The true-oracle test later showed that “R@5 found the session” does not mean the gold content reached the reader — limit=5 + chunking (and, for assistant-authored answers, the user-only ingest) often left the answer out of the context entirely. Hand the reader the gold verbatim and QA rises to a 0.868 ceiling, near GPT-4o's oracle. So the ~50% reflects what reached the reader, not a reasoning limit. Age-fused context is harder still (QA 43–47%). Self-judge caveat: gpt-5.3-chat judges its own family on its own reader leg, which may inflate it.

Source · baselines/reader_sweep_passA_{search-default,age-fused}_2026-05-29.json techempower-org/multipass-structural-memory-eval#116
LongMemEval oracle · reader sweep (Pass B)

A stronger reader does NOT close the gap

counterintuitive · 2026-05-29
Corpus
pinned oracle (stratified 150, search-default)
Readers
claude-opus-4-8 (Bedrock), o4-mini, gpt-5.3-chat
Prompt
baseline
Judge
gpt-5.3-chat
claude-opus-4-8 QA
39.3%
strongest reader, worst overall
o4-mini QA
46.7%
same 150-question slice
gpt-5.3-chat QA
44.7%
self-family judge leg
CategoryOpus QA-acc
single-session-user84%
temporal-reasoning48%
knowledge-update48%
multi-session44%
single-session-assistant12%
single-session-preference0%

Opus 4.8 — the strongest reader — scored the worst overall (39.3%). But it is not a capability gap: Opus is the best reader on well-posed direct recall (single-session-user 84%, the highest of any reader on any category) and collapses only on mis-specified categories (preference 0%, assistant 12%). It is penalized for following the baseline prompt literally (over-abstaining where the gold answer is an inferred preference) and for thoroughness (reporting both old+new values → judged PARTIAL; answers 2× longer). The 45pp gap is prompt + judge design, not reader capability. Both follow-ups are now published below: the prompt-axis fix and the Opus-as-judge re-scoring.

Source · baselines/reader_sweep_passB_opus_strat150_search-default_2026-05-29.json techempower-org/multipass-structural-memory-eval#116
LongMemEval oracle · reader sweep (prompt axis)

The reader gap was the prompt, not the model

reader-prompt fix · 2026-05-29
Corpus
pinned oracle (stratified 150, 25/type, search-default)
Readers
claude-opus-4-8 (Bedrock), o4-mini
Prompt axis
baseline → committed → preference
Judge
gpt-5.3-chat (not yet canonical — floor only)
Context
full oracle
Opus baseline QA
36.0%
worst reader, baseline prompt
Opus preference QA
59.3%
best config overall, +23pp
ss-preference canary
0.04→0.76
Opus, +72pp on one category
Prompt swing
+23pp
same model, same context
Readerbaselinecommittedpreference
o4-mini0.4200.5070.527
claude-opus-4-80.3600.4730.593

“Opus is the worst reader” was a prompt artifact. With the preference-tuned prompt, Opus goes from worst (0.360) to best of any config (0.593) — a +23pp swing from changing nothing but the prompt. The entire gap lived in one category: single-session-preference, where Opus scored 0.04 under the baseline prompt (it refused to make a recommendation — the “say I don’t know” instruction over-fired on inference questions) and 0.76 under the preference prompt (+72pp). That single category drags the overall from 0.36 to 0.59. This is the reader-harness fix; the judge is still gpt-5.3-chat (not the LongMemEval-canonical type-specific judge), so 0.593 is a floor — the canonical-judge work compounds on top. Bottom line: the #116 reader gap is prompt design, not model capability.

Source · baselines/reader_sweep_passB_promptfix_2026-05-29.json techempower-org/multipass-structural-memory-eval#116
palace-daemon · FlashRank rerank A/B

Cross-encoder rerank lifts MRR +15–23%

verified · 2026-05-27
Repo
palace-daemon (#46)
Corpus
live palace, ~375k drawers
Backend
postgres + pgvector (familiar)
Model
ms-marco-TinyBERT-L-2-v2 (nano, CPU)
Query set
12 known-item, 11 usable, pool=20
Pattern
same pool, ordering-only A/B
RunR@5R@10MRR baseMRR rerankΔ MRR
#2 (confirming)0.909→0.9091.000.7480.921+23.1%
#1 (first)1.00→0.9091.000.7610.877+15.3%
movementrerank-spike 7→1 (+6) · fallback-contract 4→1 · 7 no-change1 regr.

A clean ordering-only A/B — retrieval is held constant, only the final sort changes (distance vs FlashRank score) on the same candidate pool. MRR lifts +15.3% (run #1) and +23.1% (run #2), driven by a buried answer rescued from rank 7 to rank 1 plus two smaller promotions; 7 of 11 queries were already optimal and rerank correctly left them alone. R@5 is a wash and R@10 is untouched, so MRR is the load-bearing metric for this known-item set. Rerank latency stayed at 47 ms mean (126 ms under host load), worst single request 557 ms — well inside a 1 s budget on CPU. Verdict: keep nano, A/B a larger model next.

the one regression — score compression daemon-deploy-arch moved 3→8 in both runs. The top 7 reranked passages all scored 0.9971–0.9994 (a 0.002 spread) and every one was genuinely on-topic — when 7 passages are all relevant and scored within 0.002, head ordering is a coin-flip. This is the distilled 2-layer cross-encoder's score-compression failure mode and the strongest case for A/B-testing a larger model (ms-marco-MiniLM-L-12-v2; this FlashRank build ships no L-6). Cross-validated: replaying frozen pools in --mode candidates reproduced run #1's metrics exactly.
Source · palace-daemon/docs/evals/rerank-eval-live-2026-05-27.json (run #2) · rerank-eval-2026-05-27.json (run #1) · rerank-candidates-2026-05-27.json techempower-org/palace-daemon
#111 · hybrid scorer-weight tuning

A documented weight that recovers MRR without losing R@5

verified · 2026-05-31
Corpus
live familiar palace (postgres + AGE)
Query set
12 labeled known-item (palace-daemon PR #64)
Method
in-process searcher, FlashRank OFF, weights set per sweep point
Knob
PALACE_HYBRID_VECTOR_WEIGHT / _BM25_WEIGHT (mempalace#342)
winning weight
0.85 / 0.15
vector / BM25
R@5
1.000
held — no recall lost
MRR
0.833
+2.5pp vs floor, +4.8pp vs default-hybrid
configR@5R@10MRR
vector / convex (default wts)0.9171.0000.808
union / convex (default wts)1.0001.0000.785
hybrid / convex (default 0.6/0.4)1.0001.0000.785
hybrid / convex (0.85/0.15)1.0001.0000.833

The #111 acceptance criterion — a documented hybrid weight achieving R@5 ≈ 1.000 without regressing MRR vs union/vector — is met by vector_weight 0.85 / bm25_weight 0.15. The default 0.6/0.4 hybrid had regressed MRR to 0.785 (below the 0.808 union/vector floor); 0.85/0.15 lifts it to 0.833 — +2.5pp over the floor, +4.8pp over the default-hybrid regression — while R@5 stays pinned at 1.000. MRR turns out to be non-monotonic in vector_weight: the earlier “graph promoted one query and demoted another” reading was wrong; the real mechanism is the convex blend over-weighting BM25 at 0.4. The weight ships as an env knob (mempalace#342) — the deployed default stays 0.6/0.4 because n=12 is below the n≥25 bar to flip a production default.

third age-fusion-null data point On these 12 golden queries the graph leg is inert: hybrid scores identically to union (R@5 1.000 / MRR 0.785 at default weights) — the lift comes from BM25 pool-widening, not graph traversal. That joins the Cat 2c/7 graph-RRF tax and the #91 age-fusion R@5 null as a third independent reading where the AGE graph half adds ~nothing to top-5 retrieval on drawer-query workloads (it earns its keep on entity-anchored queries).
Source · baselines/hybrid-scorer-weight-tuning-2026-05-31.json · docs/benchmarks/2026-05-31-hybrid-scorer-weight-tuning.md (SME#227 + mempalace#342, both merged) techempower-org/multipass-structural-memory-eval#111
#103 · cross-encoder rerank A/B (SME side)

Cross-encoder rerank is neutral-to-negative — keep it off

verified · 2026-05-31
Probe set
200 git-derived probes (same set as the #162 fusion A/B)
Substrate
scratch daemon seeded with the mempalace git/docs corpus
Method
daemon /search/hybrid per-request rerank flag, read-only A/B
R@10 (all three legs)
0.60
identical — rerank reorders, doesn’t recall more
MRR Δ (rerank on)
−0.6pp
0.299 off → 0.293 TinyBERT — slightly hurt
bigger model (L-12)
worse & 3×
MRR 0.284, p50 1523 ms vs 475 ms
legMRRR@5R@10foundp50
rerank OFF0.2990.5100.60120/200555 ms
TinyBERT-L-2-v2 (daemon default)0.2930.5150.60120/200475 ms
MiniLM-L-12-v2 (bigger)0.2840.5050.60120/2001523 ms

This is the SME-side rerank A/B — distinct from the palace-daemon FlashRank known-item A/B above (a 12-query run that found a +15–23% MRR lift on a different, answer-buried query shape). On a corpus-seeded scratch daemon — the git/docs targets re-ingested so the relevant set is actually present — the verdict is the opposite: cross-encoder rerank is neutral-to-slightly-negative. R@10 is identical at 0.60 across all three legs (rerank only reorders the top-K; it cannot recall what vector retrieval missed), MRR is slightly hurt by rerank (0.299→0.293), and the bigger MiniLM-L-12-v2 is both worse on MRR and 3× the latency (1523 ms vs 475 ms p50). Recommendation: keep rerank OFF / opt-in, do not promote the bigger model. This is the 4th independent reading that the structural/rerank layer doesn’t lift retrieval — the vector backbone is the lever (with the age-fusion null ×3: Cat 2c/7 graph-RRF tax, #91 age-fusion R@5, and the #111 graph-leg-inert finding).

both halves of the story · a measurement-validity catch, then the real number The first attempt (#225, on prod familiar) ran clean but measured nothing: the 200 git-derived probes target mempalace’s own git/docs corpus, which is no longer present in conversational-only prod familiar — only 17/200 targets surfaced (Recall@10 0.085), so the rerank had nothing to reorder. A clean exit code masked a near-empty corpus; that delta was corpus-floor noise, not a verdict. This corpus-seeded re-run supersedes it with the relevant set actually present. (It also corrects the #225 model label: ms-marco-MiniLM-L-6-v2 was never in the FlashRank build — #225’s rerank leg actually ran TinyBERT-L-2-v2, the same nano model that is the daemon default.)
Source · baselines/ce_rerank_corpus_seeded_2026-05-31.json (corpus-seeded; supersedes the blocked baselines/ce_rerank_ab_2026-05-30.json) · docs/benchmarks/2026-05-30-ce-rerank-ab.md techempower-org/multipass-structural-memory-eval#103
palace-daemon · NCD novelty calibration

novelty_score is continuous, not bimodal

verified · 2026-05-27
Repo
palace-daemon (#47)
Corpus
live palace, ~375k drawers
Backend
postgres (familiar:8085)
Method
gzip-NCD, block rolling window=20
Sample
35 (wing,room) groups, n=280
Scorer
production novelty.score_novelty
overall median
0.622
mean 0.558, stdev 0.222 (0=dup .. 1=novel)
redundant tail
p10 0.165
min 0.028; long tail below ~0.20
per-room median spread
~0.37
discoveries 0.30 vs planning 0.68
Roomnp10medianp90
references1440.1730.6360.781
architecture400.3630.6490.776
discoveries400.1260.3040.705
planning320.4960.6770.821
problems240.2150.6370.791

The gzip-NCD novelty_score on the live corpus is continuous and right-skewed — not bimodal: a broad novel hump across 0.55–0.85, a long redundant tail below ~0.20, and the middle band populated throughout with no empty valley. Per-room baselines diverge by ~0.37 in median (discoveries 0.30 vs planning 0.68), so a single global NCD cut mislabels content; the calibration recommends per-room percentile thresholds — redundant at p15(room), novel at p60(room).

scope — distribution, not ECE/Brier; and a silent no-op fixed This is a distribution calibration of the NCD score, not a probabilistic (ECE/Brier) calibration. It also surfaced two bugs: the live scorer was a silent no-op until palace-daemon #63 (it read window text from the wrong field, so every write returned novelty_score=1.0 — the numbers above were captured after the fix), and content_preview is truncated to ~200 chars, so the write path still scores against truncated neighbours (NCD on short prefixes is noisier and biased high). The calibration script computes the full-content distribution by default.
Source · palace-daemon/docs/evals/novelty_calibration.json (live, n=280) · offline synthetic fixture superseded · SME isotonic ECE/Brier follow-up: techempower-org/multipass-structural-memory-eval#105 techempower-org/palace-daemon
LongMemEval oracle · reader sweep (judge axis)

A stronger judge doesn't rescue the gap

judge axis · 2026-05-29
Corpus
pinned oracle (stratified 150, search-default)
Readers
claude-opus-4-8 / o4-mini / gpt-5.3-chat (same answers as Pass B)
Judge re-scored
gpt-5.3-chat → claude-opus-4-8
Method
same hypotheses, different judge
Reader (answers fixed)QA gpt5.3-judgeQA Opus-judgeΔ
claude-opus-4-80.3930.420+0.027
o4-mini0.4670.480+0.013
gpt-5.3-chat0.4470.480+0.033

To isolate the judge variable, the identical Pass B reader answers were re-graded with Opus-4.8 as judge (the original judge was gpt-5.3-chat). Every reader lifts modestly — +1–3pp — mostly by rescuing single-session-preference questions (the Opus reader's preference category moves 0.00→0.12). But the Opus judge does not change the ranking and does not rescue the Opus reader, which stays the lowest at 0.420. So the judge confound is real but small: the dominant factor in the reader gap is prompt design, not judge strictness. The prompt-fix sweep is published — see the prompt-axis card above.

Source · baselines/reader_sweep_passB_opus_REJUDGED_2026-05-29.json techempower-org/multipass-structural-memory-eval#116
LongMemEval oracle · canonical type-specific judge

Fixing the judge un-collapses preference — but the gap is the reader

canonical judge · 2026-05-29
Corpus
LongMemEval-S oracle, n=500 (all 6 types + 30 abstention)
Reader
gpt-5.3-chat (held constant, baseline prompt, full context)
Judge
gpt-5.3-chat + canonical type-specific prompts (NOT gpt-4o-2024-08-06 — not deployed here)
Method
same reader answers, paraphrased rubric → verbatim LongMemEval templates
preference /search
+20.0pp
0.133 → 0.333, un-collapsed
preference age-fused
+26.7pp
0.100 → 0.367, un-collapsed
spurious ABSTAIN
21 → 0
old judge invented them on temporal Qs
residual oracle gap
~31–36pp
reader/substrate, not the scorer
Question type/search Δage-fused Δ
single-session-preference+20.0pp+26.7pp
single-session-user+0.0pp+1.4pp
single-session-assistant+0.0pp+7.1pp
multi-session−3.8pp−6.0pp
knowledge-update−5.1pp−9.0pp
temporal-reasoning−2.3pp−2.3pp
OVERALL (harness)0.5100.456
OVERALL (+abstention)0.5620.510

The companion card swaps the judge model (Opus-as-judge) and finds it doesn't help. This card swaps the judge prompts: porting LongMemEval's verbatim type-specific templates — in particular the rubric-based preference template the old paraphrased judge lacked. That un-collapses single-session-preference (+20.0pp / +26.7pp, the largest single-category move) and removes the spurious-ABSTAIN noise (the old judge emitted 34 ABSTAIN labels, 21 on non-abstention temporal questions; the canonical binary judge can't). The stricter, more faithful templates pull KU and multi-session down a few points — that is the old judge having over-credited via a softer rubric, not a regression. The headline: the overall did NOT move toward the published 87% oracle — corrected, abstention-credited it is 0.562 / 0.510. The judge-prompt confound was real (preference collapse and label noise were genuine measurement artifacts) but it is not the bulk of the 35pp gap. The residual ~31–36pp is reader/substrate, not the scorer — and the true-oracle test (see the comparison card) later resolved which: it is substrate (what reaches the reader), not the reader's reasoning.

Disclosure. Judge is gpt-5.3-chat with canonical prompts, not the canonical gpt-4o-2024-08-06 snapshot (not deployed on this resource) — the fix is the prompts, not the model. 2 ERROR rows per run from an Azure content-filter trip (graceful retry-then-ERROR by design; counted as wrong, a 0.4% asymmetry vs the confounded denominator). Absolute numbers are understated by the aggregator’s ABSTAIN dead-code bug (#148) — it never credits a correct ABSTAIN — but that depresses both runs identically, so the deltas are valid.

Source · baselines/reader_sweep_passA_canonical-judge_{search-default,age-fused}_2026-05-29.json · docs/benchmarks/2026-05-29-canonical-judge-passA.md techempower-org/multipass-structural-memory-eval#116
iii methodology
conditions · baselines · corpora

Four conditions. Two baselines. Multiple corpora.

The methodology is load-bearing. The headline number changes less than the condition that produced it — so the condition is what gets named, versioned, and reported.

conditions

A / B / C / D

A. System under test, default config. B. System with the structural layer disabled (graph-off, registry-off, etc.). C. System with retrieval replaced by oracle retrieval — isolates reader from retriever. D. Karpathy baselines: full-context (D1, whole corpus in prompt) and karpathy-compiled (D2, LLM-compiled wiki). If your structure can't beat D1, the structure isn't earning its complexity.

corpora

Multiple shapes.

SME ships two evaluation corpora — jp-realm-v0.1 (a personal-knowledge corpus with adversarial entity overlap) and good-dog-corpus (24 notes across 6 domains with pre-authored questions). External corpora — LongMemEval, LoCoMo, MINE — integrate through the same adapter ABC. A single-corpus reading is a sketch; the load-bearing readings come from the same diagnostic run across multiple corpus shapes.

posture

Diagnostic, not benchmark.

Cross-system absolute rankings are out of scope. Two systems with different ontologies, different corpora, or different retrieval conditions produce readings that are not directly comparable. SME supports controlled cross-system runs when corpora and ontologies are matched, but treats unconditioned "system X scores higher than system Y" claims as confounded.

scope

What SME doesn't measure.

Generalisation of deltas across corpus scale, ontology design quality, operator workflow beyond the diagnostic report, live agentic memory dynamics (read-after-write, JEPSEN-shaped questions — Cat 10 in the backlog), and human-judgment calibration without explicit calibration runs. Naming the boundaries is part of the framing — readings that fall outside SME's scope are invitations to reach for a complementary tool, not failures of the framework.

iv how we got here
research narrative

A timeline of findings.

The story of SME against MemPalace is a sequence of paired readings — each one a structural answer to the last one's surprise. Below: the load-bearing ones, with the date each was confirmed.

2026-04-11
Categories get names.
The nine categories pick up palace-nod subtitles — The Lookup (1), The Crossing (2), The Blueprint (8), The Handshake (9), and so on. Internal codes unchanged so configs don't break; the names are for the docs. (Spec v8, addendum.)
2026-04-25
The daemon adapter ships.
MemPalaceDaemonAdapter targets palace-daemon on the postgres + pgvector + AGE backend — the production retrieval path. Live daemon at familiar.jphe.in:8085 (migrated from disks circa 2026-05-24, both palace-daemon and postgres moved together).
2026-05-15
Tau2 predicts the Cat 9a gap.
On a 30-question Cat 9a-shaped RLM experiment, the empirical recall gap between gemma4:e4b and qwen3.5:4b lands at +30pp at n=5, +33pp at n=20 — matching the published +37.7pp Tau2 gap to within ~5pp. Cross-corpus generalisation to SME-shaped diagnostics works. Tau2 becomes a defensible prior for orchestrator-model selection.
2026-05-17
Substrate-floor parity verified.
R@5 = 0.9660 on LongMemEval-S, byte-identical between the postgres+pgvector backend and the legacy ChromaDB backend. The "did the migration regress retrieval?" question gets a measured "no" instead of an inferred one. (Commit c3e204e.)
2026-05-17
AdaptMem variants matter.
Two distinct encoder weights both labelled "adaptmem FT-300" exist on disk — one trained on conversational query/session pairs (LongMemEval-domain, R@5 = 1.000 on 200-test) and one trained on Python def/docstring text (CodeSearchNet-domain, R@5 = 0.928–0.966). Same fine-tune algorithm, different domains, six-point spread. Domain matters more than the algorithm.
2026-05-22
Fork moves to techempower-org.
The SME working fork transfers from jphein/ to techempower-org/. PRs from the fork target M0nkeyFl0wer/multipass-structural-memory-eval (the canonical upstream, maintained by Ben West). The MemPalace fork moves the same week. Issue references use full org/repo#N form on both sides of the fork boundary.
2026-05-25
AGE backfill: entities yes, triples no.
/backfill-age reports 100% on the entity pass (142,315 entities in ~61 min, zero errors) but the relationship layer is effectively empty (triples: 1 per kg_stats). /search/age-fused works structurally but the graph half of the RRF fusion has nothing to contribute. Implication: an A/B that compares vector-only against age-fused right now measures capacity for lift, not realised lift.
2026-05-29
Retrieval finds the session; the gold doesn't reach the reader.
500 questions through the full ingest-retrieve-read-judge pipeline. After the chunk-suffix matcher fix (techempower-org/multipass-structural-memory-eval#98), R@5 = 97.0% on daemon-direct — the gold session is in the top 5 almost every time — yet QA was only 58.6%, a 38-point gap we first charged to the reader. The true-oracle test corrected that: “R@5 found the session” doesn't mean the gold content reached the reader (limit=5 + chunking, and user-only ingest dropping assistant turns, left it out). Hand the reader the gold and QA rises to 0.868 — within 0.2pp of GPT-4o's oracle. So the bottleneck on oracle is not the reader's reasoning; it is what reaches the reader. The age-fused leg makes the point louder: R@5 = 90.2% with QA = 17.4%, because its snippets are 5.5× narrower (techempower-org/palace-daemon#150) — less reaches the reader, worse QA.
v credits
acknowledgements

SME stands on borrowed shoulders.

Every category in SME borrows from work done elsewhere. The diagnostic framing only works because the systems it points at and the benchmarks it borrows from are open in the first place.

multipass-structural-memory-eval (upstream)
SME framework · maintainer

Canonical upstream of the framework. SME is named, scoped, and maintained here; the fork at techempower-org submits PRs back.

M0nkeyFl0wer/multipass-structural-memory-eval
M0nkeyFl0wer (Ben West)
multipass-structural-memory-eval (fork)
SME framework · working fork

The techempower-org fork — bench harness, daemon adapter, A/B leg infrastructure, content-rules loader. Carries the readings on this page.

techempower-org/multipass-structural-memory-eval
jpheinM0nkeyFl0wer
MemPalace (upstream)
Memory system · primary test target

The memory palace for AI: verbatim storage, local-first, permanent. SME's primary test target and the conversation partner for much of the structural diagnostic methodology.

MemPalace/mempalace
MemPalace (fork)
Memory system · daemon-served fork

The MemPalace fork served by palace-daemon in this household. Carries the canonicalisation, hybrid fusion, and KG-extraction changes the readings exercise.

techempower-org/mempalace
jpheinigorlsbensigmvalentsevfatkobratmuskalarnoldwendermilla-jovovichnakata-appand many more upstream
palace-daemon
Production retrieval surface

The HTTP daemon over MemPalace. Backs the mempalace-daemon adapter, ships the /search/age-fused endpoint that the #45 leg exercises, and serves the live retrieval path on this homelab.

rboarescu/palace-daemon
rboarescujphein
familiar.realm.watch
Second memory system · A/B partner

The second-corpus discipline of SME requires more than one system under test. Familiar is the household-native memory system the #46 leg exercises through the same harness.

techempower-org/familiar.realm.watch
jphein
LongMemEval
External benchmark · corpus + judge

Wu et al. (ICLR 2025) — 500 curated questions across five memory abilities. SME borrows the corpus and the GPT-4o judge methodology (>97% human agreement) for the cat_1 family. MIT licensed.

xiaowu0162/LongMemEval
xiaowu0162
AdaptMem
Encoder fine-tuning recipe

nakata-app's domain-adaptive fine-tune recipe for memory encoders. v0.6 set the FT-300 ceiling on LongMemEval; v0.7 unlocks BGE-large training on T4-class hardware via cached MNRL + gradient accumulation. SME measures the lift; AdaptMem builds the recipe.

nakata-app/adaptmem
nakata-app