SME — Structural Memory Evaluation

Field explorer · every memory system the survey names × every metric we have

The full field — sort it, filter it, follow the receipts

interactive

Thirty-three memory systems, two kinds of number side by side: Published — what each system’s own team or a survey reports (self-reported, NOT our harness) — and SME multipass — the Cat 1–9 readings we measured ourselves. The two are kept in unmistakably separate column-groups: never read a published QA number as something we benched. Sort any column, filter by status / architecture / metric, and follow the magic links — every cell that has a source is clickable: our receipts (the baseline/doc that holds the reading) and their claims (the leaderboard or paper) are styled differently on purpose. Only 8 of the 33 carry a full SME column — that gap is the honest headline. For the narrative this data supports — the five campaign findings, the statistical rigor, and the cost-wall taxonomy — see the synthesis above.

heatmap: low→high Eemergent QA-defverified, deferred N/A·reasonmeasured-inapplicable (a finding) not runwired, not benched our receipt their claim

System	Arch	Status	Published self-reported — NOT our harness			SME multipass we measured (Cat 1–9)
System	Arch	Status	LME QA	LoCoMo	BEAM-1M	1	2c	3	4	5	6	7	8	9a	9b

Click any column header to sort (numeric columns sort high→low first). Chips filter live + multi-select. The left edge of each system name is a status spine: refract = benched, amber = in-flight, grey = published-only. The vertical refract rule separates “published” from “what we measured” — the two groups are never the same claim.

the honesty line · published ≠ benched-by-us The Published group is self-reported — each system’s own team or the landscape survey. We did not run those numbers. They mix metrics (R@K recall is not QA accuracy — Celiums proved 100% retrieval can be 62% QA), mix answer models (a ~24pp swing on the same benchmark between GPT-4.1 and GPT-4o-mini), and mix subsets/judges. The SME multipass group is the only apples-to-apples axis — and only 8 of 33 have one, because benching a system through Cat 1–9 is real work. Mem0 appears twice by design: the OSS package we benched, and the cloud platform-v3 number its team reports (92–94% vs the OSS 61–68%) — “almost never made explicit,” per the survey.

Source · baselines/cross_system_multipass_matrix_2026-05-30.json (field_roster, cassia-2 #165) · published cells link to each system’s cited leaderboard/paper; SME cells link to our baseline/doc artifacts. Field survey: memorypalace/docs/research/2026-05-24-memory-system-benchmarks.md techempower-org/multipass-structural-memory-eval#165

Cross-system multipass · the 9 benched systems × Cat 1–9

The benched nine — role-grouped, every category readable

drill-down

The explorer above is the whole field at a glance; this is the role-grouped zoom on the nine systems we actually benched through Cat 1–9 — five memory products (incl. postgres_ingest, the mempalace-raw ablation), a no-structure control, a diagnostic orchestrator arm, and two wired-but-not-yet-benched baselines. It isn’t a leaderboard of nine competitors: the rows are grouped by role, and the most informative reading is where each system can even be measured. Two graph-native products (mempalace, OMEGA) take the full Cat 1–9; the extraction systems read N/A on the structural cats for three different architectural reasons; and an N/A is a measured finding, not a blank. The single-substrate mempalace deep-dive that follows is the supporting detail under this.

0.00measured number Eemergent (system-generated structure) QA-defverified + runnable, QA deferred N/A · reasonmeasured inapplicable — a finding not runwired adapter, never benched

System	1Lookup	2cStairway	3Dissonance	4Threshold	5Missing Room	6Archive	7Abacus	8Blueprint	9aHandshake	9bCall-through
Memory systems — products under test
mempalace verbatim + AGE graph	R@5 0.927	R@5 0.960	0.00E	entropy 0.645 · other 26.83%	largest 61.87% · iso 22.2%	0.00E	QA 0.580	mod 0.7961 · PASS	0.983	reachable
OMEGA embed + auto-relate	R@5 0.900	R@5 0.920	recall 0.50E	entropy 0.78	1 comp · 0 iso	0.00E	QA 0.593	drift 0.875	N/A · no harness	N/A · no harness
Hindsight extraction competitor	QA-def	QA-def	N/A · no endpoint	N/A · no endpoint	N/A · no endpoint	N/A · no endpoint	QA-def	N/A · no endpoint	N/A · no harness	N/A · no harness
Mem0-OSS extraction competitor	QA-def	QA-def	N/A · graph removed	N/A · graph removed	N/A · graph removed	N/A · graph removed	QA-def	N/A · graph removed	N/A · no harness	N/A · no harness
postgres_ingest mempalace-raw: verbatim postgres, no graph	R@5 0.833	R@5 0.833	N/A · by design	N/A · by design	N/A · by design	N/A · by design	QA 0.392	N/A · by design	N/A · no harness	N/A · no harness
Baseline — no-structure control (the floor the structural delta is measured against; not a competitor)
flat ChromaDB vector, no graph	R@5 0.833	R@5 0.833	N/A · by design	N/A · by design	N/A · by design	N/A · by design	QA 0.384	N/A · by design	N/A · no harness	N/A · no harness
Diagnostic arm — Cat 9a orchestrator-invocation probe (measures the model in front of the memory, not a product)
RLM Qwen-7B / Llama-70B orchestrator	R@5 0.467	by-hop	n/r	n/r	n/r	n/r	n/r	n/r	0.467 · 7–27% invoke	N/A · no harness
Wired, not yet benched — harness has these adapters; no Cat run exists
full_context Karpathy D1 · whole-vault-in-context	not run	not run	not run	not run	not run	not run	not run	not run	not run	not run
karpathy_compiled Karpathy D2 · LLM-compiled wiki	not run	not run	not run	not run	not run	not run	not run	not run	not run	not run

Rows are grouped by role — four memory products, one no-structure control, one diagnostic orchestrator arm, two wired-not-benched baselines — so the grid reads as roles, not eight competing systems. The two “not run” rows (Karpathy D1/D2) are wired adapters that have never been benched: an absence of measurement, rendered muted/dashed so they never read as a system that scored blanks. Every N/A cell is the opposite — a measured finding with a reason. Click System to sort.

three structural-N/A systems · three different reasons — the distinction is a feature The structural cats (3–8) read N/A for three architecturally distinct reasons, and the difference is the point: flat is N/A by design — it is the no-structure control, intentionally graphless, the baseline the structural delta is measured against. Hindsight is N/A · no graph endpoint — an extract-then-retrieve store that serves no standalone graph API at all. Mem0-OSS is N/A · graph layer removed — it once shipped graph memory and the open-source edition dropped it (the hosted platform keeps it; the OSS package under test returns isolated entities with zero edges). Same cell value, three different architectural stories — that is diagnostic signal, not noise.

the capstone · emergent typed structure is unsolved field-wide Across the two graph-native products, generating typed contradiction / supersession edges from raw content is ~0: mempalace has 0 contradicts and 0 supersedes edges in the live KG (emergent 0.00 / 0.00); OMEGA catches contradiction at recall 0.50 (1 of 2 themes, precision 0.25) and supersession 0.00. SME is the only framework that even asks the question, and the honest cross-field answer is “not yet.”

Source · baselines/cross_system_multipass_matrix_2026-05-30.json (cassia-2, expanded #163) + docs/benchmarks/2026-05-30-cross-system-multipass-matrix.md · mempalace structural cells = the EXACT post-re-map finals (#211/#226); OMEGA full Cat 1–9 (#178/#183); Hindsight (#220/#184) + Mem0-OSS (#221/#185) verdict rows; flat control (#125/#82); RLM 9a probe (#194). Footnote: oracle_retrieval (ceiling) + random_retrieval (floor) adapters are present in the registry but not run as matrix cells — they are diagnostic bounds, not products. techempower-org/multipass-structural-memory-eval#163

Head-to-head · identical harness

Apples to apples — same subset, same reader, same judge

rigorous axis

The only rows on this page measured under identical conditions — every system below was run by us on the same n=150 stratified LongMemEval-S subset, the same retrieval definition (session-level R@K), the same reader, and the same canonical gpt-5.3-chat judge. No self-reported numbers, no mixed answer models, no metric swaps. The field matrix below cites published leaderboards (caveat-heavy); this is the table that controls the variables.

System	Storage	R@5	E2E QA
mempalace deployed substrate (this work)	verbatim	92.7%	58.0% o4-mini reader · retrieved ctx · canonical gpt-5.3-chat judge · n=150 macro ‖ comparator = oracle-n500, not strat150-S (closest apples, not pixel-perfect) · opus deployed→oracle gradient 0.567→0.868 in field card
OMEGA first independent competitor run	extraction	90.0%	59.3% sme-rich · o4-mini reader · retrieved ctx · canonical gpt-5.3-chat judge · n=150 macro ✦ ≈ parity (+1.3pp), same reader + context
Hindsight extraction, cloud-extractor	extraction	—	QA deferred extraction throughput / cost-gated

The honest read: mempalace’s edge is retrieval (R@5 +2.7pp); reader/QA is at parity (OMEGA +1.3pp, same-reader). On R@5, mempalace 0.927 vs OMEGA 0.900 — a fair same-rendering pair (both measured upstream-exact, --content-rules), so the 2.7pp edge is real: retrieval-only, no answer-model confound. On E2E QA, held to same reader, same context (o4-mini, retrieved, macro n=150), mempalace 0.580 vs OMEGA sme-rich 0.593 — ≈ parity (OMEGA +1.3pp): two verbatim-vs-extraction systems land within a question of each other once reader and context are pinned. Hindsight’s cloud-extractor throughput keeps its QA pass cost-gated — deferred rather than guessed. Don’t confuse the 0.580 comparator with mempalace’s deployed→oracle gradient (deployed-E2E ladder 0.567→0.760 as retrieval widens limit 5→50 → 0.868 oracle ceiling): that gradient is an opus reader on retrieved/oracle context and lives in the field card below — it is not the like-for-like OMEGA comparison. This table — not the leaderboard — is the one to read.

✦ OMEGA’s 0.593 is the fair sme-rich number (our harness, n=150, o4-mini reader + gpt-5.3-chat judge). OMEGA’s upstream-exact run scored just 0.37 — but that was date-starved: its reader context dropped session timestamps, and the temporal category (cat_6) jumped 0.04→0.36 once we restored the dates. Scoring the date-stripped 0.37 against mempalace would be the unfair comparison, so we use the date-restored sme-rich figure. (We ran the same date-confound check on our own LoCoMo numbers — see the explainer below — and there it came back clean, which is why we trust 0.593 here.) On R@5, the comparison stays upstream-exact on both sides (mempalace 0.927, OMEGA 0.900) — OMEGA’s sme-rich R@5 of 0.953 is not a comparator here, because no same-rendering mempalace pair exists for it; mixing renderings is exactly the apples-to-oranges this table refuses.

the cost wall · why two QA cells are deferred, not blank Hindsight ~150h / Mem0 ~18h to benchmark strat150 — vs verbatim-first mempalace ~0 marginal cost. That is the leaderboard-hidden headline: extraction-based memory systems are benchmark-throughput-bound. Both competitors run an LLM fact-extraction on every session ingest, and LongMemEval-S averages ~48 sessions/question, so the strat150 subset is ~7,200 ingests: Hindsight at ~60–96 s/session (local phi4, CPU) → ~150h to benchmark; Mem0-OSS at a warm ~9 s/ingest (ollama phi4 extractor) → ~18h. A verbatim-first system (mempalace) ingests raw text at ~zero marginal compute — orders of magnitude cheaper to evaluate. That is why the apples table publishes their QA as deferred rather than a guessed number: it isn’t reluctance, it is a real architectural cost. (Mem0’s extraction is also lossy by design — the verification smoke stored only 3 of 5 facts.) The field-reported 91.4% (Hindsight) / 94.4% (Mem0 platform) are their own leaderboard numbers on stronger answer models, not on-harness figures.

Field context · self-reported leaderboards

Where mempalace sits — and the honest gap

context · cite-with-caveats

Benchmark: LongMemEval-S
Our retrieval: R@5 0.927 (/search), n=150 stratified
Reader ceiling: 0.868 true-oracle (gold present) — Opus|preference + canonical judge, n=500 — essentially the field's GPT-4o oracle (0.870). The earlier 0.61 was the retrieval-limited substrate.
Judge: gpt-5.3-chat + canonical type-specific prompts (NOT gpt-4o-2024-08-06)
Posture: diagnostic decomposition, not a leaderboard claim

Retrieval R@5

92.7%

mid-pack vs the R@K field (80–99%)

Reader ceiling (true oracle)

86.8%

gold present — matches the field's 87.0% oracle

“0.61” was substrate

+26pp

all from ingest + retrieval; ~0pp genuine reader

Structural cats

4/5/8/9

measured here; no field analogue

What this table is: the field column is self-reported leaderboard numbers — mixed metrics (R@5 vs oracle QA vs E2E QA, which are three different tests), mixed answer models (answer-model choice alone swings ~24pp), and mixed verification. It is context, cited with caveats — not a controlled comparison. For that, read the apples-to-apples table above (our own harness, identical conditions). The Verif. column below tells you who measured each number:

selfvendor’s own report indieindependent re-run (incl. ours) paperpeer/preprint refreference baseline

System	Storage	LongMemEval			LoCoMo	Answer model	Verif.
System	Storage	R@5	oracle QA	E2E QA	LoCoMo	Answer model	Verif.
mempalace — this work (verbatim-first)
mempalace — deployed substrate	verbatim	92.7%	—	0.567→0.760 limit 5→50 ‖	38.8% daemon §	gpt-5.3-chat	indie
mempalace — ingest-fixed	verbatim	—	—	68.4% limit=5 ‖	—	gpt-5.3-chat	indie
mempalace — true-oracle ceiling	verbatim	—	86.8%	—	—	gpt-5.3-chat	indie
the field — E2E QA leaderboard (sorted by LongMemEval E2E QA; storage paradigm per MemPalace survey + independent check)
OMEGA	extraction	90.0% ^‡	—	95.4%	—	GPT-4.1	self indie
Mastra	unstated	—	—	94.9%	—	GPT-5-mini	self
Mem0 (platform v3)	hybrid	—	—	94.4%	92.5%	undisclosed	self
Hindsight	extraction	—	—	91.4%	89.6%	Gemini 3 Pro	indie
True Memory (Pro)	verbatim	—	—	87.8%	93.0%	gpt-4.1-mini	paper
Supermemory	extraction	—	—	85.2%	65.4%	Gemini-3	indie
EverOS / EverMind	unstated	—	—	83.0%	93.1%	undisclosed	self
ENGRAM (paper, arXiv:2511.12960)	extraction	—	—	71.4%	77.6%	GPT-4o-mini	paper
Zep / Graphiti	hybrid	—	—	71.2%	75.1%	GPT-4o	paper
Celiums	unstated	—	—	62.3%	—	Opus	self
GPT-4o (reference)	reference	—	87.0%	60.2%	—	GPT-4o	ref
retrieval-only — R@K recall (LongMemEval R@5; not comparable to QA)
engram-2 ^†	verbatim	99.0%	—	—	74.5% QA	—	self
ai-memory ^†	verbatim	97.8%	—	—	—	—	self
MemPalace (upstream)	verbatim	96.6%	—	—	88.9% R@10	—	indie
agentmemory	verbatim	95.2%	—	—	—	—	self
mcp-memory-service	verbatim	86.0%	—	—	49.7% R@5	—	self

LongMemEval columns are three different tests — R@5 (retrieval), oracle QA (gold given), E2E QA (full pipeline) — never compare across them. Click any column header to sort (groups collapse into one ranking).

† engram-2 = github.com/199-biotechnologies/engram-2 (Paperfot AI) — a Rust CLI: SQLite + FTS5/BM25 + Gemini Embedding 2 + RRF + Cohere rerank, self-reporting 99.0% R@5 on LongMemEval-S. It returns verbatim source chunks (with an optional LLM claim/entity extraction pass). ai-memory = github.com/alphaonedev/ai-memory-mcp (FTS5 + embeddings, 97.8% R@5). The “Engram” name is overloaded: the same org also ships 199-biotechnologies/engram (MCP server, BM25+ColBERT+KG, 98.1% R@5), and the ENGRAM (paper) row is a separate, extraction-based academic system (arXiv:2511.12960, 71.4% QA). Storage paradigms marked unstated are ones we could not pin to a primary source — left blank rather than guessed.

‡ OMEGA R@5 0.900 is our own measurement — the first independent head-to-head: OMEGA run on the identical n=150 stratified LongMemEval-S subset, session-level R@K, same retrieval definition as mempalace (which scores 0.927 on that subset — OMEGA trails by 2.7pp / 4 questions). It is not comparable to OMEGA’s self-reported 95.4%, which is a different metric (E2E QA, not R@5) with a stronger answer model (GPT-4.1; answer-model choice alone swings ~24pp). The on-harness R@5 is the apples-to-apples number. (Note: OMEGA ships without its bge-small embedding model and silently falls back to keyword-only FTS5; we ran omega setup --download-model to restore semantic retrieval before measuring, so this is the fair number, not a crippled one.)

§ mempalace daemon on LoCoMo — age-fused QA 0.388 (n=250 stratified, isolated scratch palace, never prod), drawer-level R@5 0.556. An A/B isolating the KG: restoring the graph-hydration fix (palace-daemon#202) moved QA +1.2pp over the vector-only fallback (0.376) with identical drawer-R@5 — so on LoCoMo the age-fused graph half adds ~nothing to top-5 retrieval (the graph-only hits don’t displace the vector top-5). The flat adapter scores the same QA (0.384), confirming the substrate, not the path, sets it. LoCoMo QA here (~0.39) is an our-harness figure (gpt-5.3-chat reader) — not comparable to the field’s 0.75–0.93 (stronger answer models + looser graders; cf. the event-ordering Kendall-τ vs binary-judge mismatch).

‖ The deployed-E2E column is a retrieval-breadth ladder (real /search→reader→judge, opus reader, n=150): limit=5 0.567 (85/150) → limit=20 0.740 (111/150) → limit=50 0.760 (114/150). Deployed QA was retrieval-limited at limit=5; widening to 20 recovers +17.3pp and it plateaus by 50 — concrete proof the gap to the 0.868 oracle ceiling is retrieval breadth, not the reader. (One content-filtered judge call at limit=50, qid 95228167, is counted wrong — a conservative floor; scored correct it is 0.767, which doesn’t touch the plateau.) These are a live re-query and supersede the 05-29 cached 0.610 (pre-retrieval-drift pinned context). Only the 0.868 ceiling sits under oracle QA (gold handed to the reader, retrieval bypassed) — the like-for-like to the field’s GPT-4o 0.870 oracle (techempower-org/multipass-structural-memory-eval#117).

Read the Storage column first — it is the axis a LongMemEval leaderboard hides. mempalace is verbatim-first: it stores the raw turns and never summarizes (“store everything, then make it findable”), solving retrieval separately. It is not alone in that — True Memory (a paper-only design, arXiv:2605.04897, no code release) is also verbatim-first and reports 87.8% QA, so the paradigm is not what caps the leaderboard. The rest of the field sits on the other side of the split: hybrid systems that selectively extract while keeping some raw history (Mem0’s self-editing vector+graph+KV, Zep/Graphiti’s bi-temporal knowledge graph, Letta’s tiered memory), and pure extraction (Hindsight stores structured, time-aware facts instead of raw logs). So no, these are not all verbatim-first — but mempalace’s verbatim peers are real: True Memory, traditional RAG, and the R@K-only ChromaDB baselines (agentmemory, engram-2, ai-memory at 95–99% R@5). The paradigm split is taken from the MemPalace landscape survey, not assigned by us.

The honest read: on retrieval mempalace is competitive — R@5 0.927 sits mid-pack against the field's R@K leaders (96–99% for ChromaDB-baseline systems; mcp-memory-service 80–86%). On QA we deliberately publish no full-pipeline leaderboard number (that needs a field-standard answer model + the canonical gpt-4o-2024-08-06 judge we don't have). The apples-to-apples axis is the oracle — and an earlier reading of this card got it wrong, which is worth stating plainly: we reported 0.61 and called the 26-point gap to GPT-4o's 0.87 oracle a reader limit. It isn't. That 0.61 was retrieval-limited — the context reaching the reader came through /search at limit=5, and for single-session-assistant questions our upstream-parity ingest (user-turns-only, which upstream itself recommends against) dropped the assistant-authored gold entirely. Hand the reader the gold instead (true oracle, evidence sessions verbatim) and five of six categories recover, not just assistant: single-session-assistant 0.32→0.98 (ingest), temporal 0.36→0.75, knowledge-update 0.70→0.91, multi-session 0.71→0.87, single-session-preference 0.80→0.93 (all retrieval); only single-session-user carried over. That lifts the overall from 0.610 to a 0.868 reader ceiling — within 0.2pp of the field's 0.870. So the old 26-point “reader” gap decomposes as ~7pp ingest artifact (the dropped assistant turns; fixed → 0.684) + ~18pp retrieval breadth (limit=5 + chunking never delivering the gold) + ~0pp genuine reader. The reader was never the bottleneck — once the gold is in context it essentially matches the field's oracle; getting the gold to it is the whole game. The route here is itself a finding: an earlier reader floor-lift — three category-specific prompt clauses on the pinned /search context — returned a clean null (net −2 to +3 questions), which is precisely what flagged that the binding constraint was substrate, not prompt; the true-oracle test then confirmed it. That decomposition is the SME finding the leaderboards can't show. And the structural categories — ingestion integrity (Cat 4), gap-detection (Cat 5), ontology coherence (Cat 8), invocation discipline (Cat 9) — have no competitor analogue at all.

comparability discipline True-oracle = the ceiling, not the deployed number. The 0.868 figure is measured with the evidence sessions handed to the reader verbatim — i.e. the reader's ceiling once retrieval is perfect. A live system still has to retrieve those sessions; the deployed number (real chunked /search) sits between the 0.610 substrate reading and the ceiling, and a production daemon re-test is still pending. All four gradient numbers are abstention-credited (the field's own metric), so our 0.868 ceiling and the field's 0.870 oracle are a like-for-like comparison. Cross-system LongMemEval numbers are not directly comparable anyway: answer-model choice alone swings ~24pp, oracle QA is a different metric from full-pipeline E2E QA, and most competitor numbers are self-reported — so we do not place ourselves on the QA leaderboard. Field numbers are cited from the landscape survey; ours are sourced from baselines/.

Source · ours: baselines/reader_trueoracle_{ss-assistant,temporal,multi-session}_2026-05-29.json + reader_sweep_passA_canonical-judge_opus-preference_*.json (n=500 deployed) + longmemeval_s_strat150_*.reagg.json (R@5) · docs/benchmarks/2026-05-29-true-oracle-floor.md · field: docs/research/2026-05-29-comparison-readiness.md techempower-org/multipass-structural-memory-eval#116

Multipass · what the system knows about its own structure

SME structural diagnostics — no competitor analogue

unique contribution

The cross-system matrix above is the comparison — every system, every category. This is the single-substrate deep-dive beneath it: the same Cat 1–9 read in detail against mempalace alone, answering what does the memory system know about its own structure? These nine categories are SME’s unique contribution — LongMemEval, LoCoMo, BEAM, Mem0’s and Zep’s evaluations all measure end-to-end QA or retrieval recall. None of them diagnose canonical-collision dedup, edge-type monoculture, structural holes, ontology drift, or harness invocation rate. The analogue column is none for every row, by construction — that is the headline. Substrate under test: mempalace via the mempalace-daemon adapter against the live palace-daemon, reading the real AGE knowledge graph (1,156,314 entities / 1,873,489-edge cleaned RELATION set, full-graph EXACT — not the earlier capped /graph projection, and post the --drop-code-tokens junk DELETE). Diagnostic readings of one substrate — not leaderboard scores.

Cat	What it measures	MemPalace reading	Corpus / adapter	Analogue
1 Lookup	Find a specific memory from a natural-language query	R@5 0.867 full-recall 22/30; hop-1 0.889, hop-2 0.667	jp-realm-v0.1 · daemon /search/age-fused	none
2c Stairway	Multi-hop retrieval recall by hop depth — does structure scale?	(structural − flat) +10pp · A flat 0.833 / B hybrid 0.933 / C age-fused 0.900 @K=5 · grows with depth (B/A 1.11×→1.25×) B−A +9.3pp@1-hop, +16.7pp@2-hop; graph-RRF sub-layer (C) is a neutral-to-slightly-negative tax vs B — third data point for the age-fusion-null thesis (#203)	jp-realm-v0.1 · flat / daemon / age-fused	none
3 Dissonance	Detect and surface conflicting facts	0.00 emergent · the live palace KG has 0 `contradicts` edges — the enrichment pipeline generates no emergent contradiction structure on real content supersedes the earlier +1.00, which was a corpus-declared ceiling (good-dog-graph reading back hand-seeded edges, not detection). Cross-system: OMEGA auto-relate generated contradicts edges, recall 0.50 (#148/#215)	live palace (real KG) · daemon --real-kg	none
4 Threshold	Is ingestion producing a clean graph? (dedup, field coverage, monoculture)	4a collisions 248 (1.9%) · 4b coverage 1.00 · 4c norm-entropy 0.645 · `other` 26.83% · 40 types full-graph EXACT (server-side cypher over the cleaned 1,873,489-edge RELATION set) — RESOLVED. Overturns the prior sampled 0.020/98.98%-tunnel capped-projection artifact, then the post-relabel-pre-DELETE intermediate (0.340/55.05%/237). The re-map ran: `kg_predicate_norm` de-monoculture relabel (~520k edges) + a `--drop-code-tokens` DELETE (48,135 junk shell-cmd/stopword/DOM-method edges). Entropy 0.020→0.340→0.645; `other` 98.98%→55.05%→26.83% (#211)	live palace (real KG) · daemon /cypher	none
5 Missing Room	Identify what’s structurally missing (components, holes, gaps)	largest component 61.87% (715,435 of 1,156,314) · isolates 22.2% (256,782) · 325,965 components EXACT full-graph WCC (server-side, post-DELETE) — replaces the bogus capped-projection 44.8%-isolate artifact. Honest note: isolates rose 20.4%→22.2% after the `--drop-code-tokens` DELETE — deleting 48k junk edges orphaned ~20k entities whose only edge was junk; pre-DELETE they were “connected” by noise, the cleaner graph honestly reports them isolated. The giant component is still well-connected (#211)	live palace (real KG) · daemon	none
6 Archive	Current vs historical state, supersession tracking	0.00 emergent · the live palace KG has 0 `supersedes` edges — completeness 0.00 supersedes the earlier +1.00 (good-dog-graph declared ceiling, 8/8 hand-seeded). Cross-system: OMEGA also 0.00 — its temporal analogue `evolution` doesn’t normalize to supersedes. Emergent supersession is unsolved in both (#148/#215)	live palace (real KG) · daemon --real-kg	none
7 Abacus	Does structure earn its token overhead? (graph vs no-graph)	structure earns it: +10pp recall for 1.69× context (A vs B, token-comparable: 466→789 tok/q) flat is cheaper per-correct but lands 5 fewer correct (21/30 vs 26/30); the structural lift is hybrid retrieval, not graph traversal (#203)	jp-realm-v0.1 · flat / daemon	none
7b Latency	Query latency distribution (YCSB p50 / p95)	post-AGE-index golden set: p50 vector ~684ms / union ~515ms / hybrid ~689ms two honest facts, not a single-set win: on this drawer-query golden set the graph leg is inert (hybrid ≈ union, ~689ms) — the AGE-index speedup lands on entity-anchored queries, not these. The earlier hybrid p50 2064ms was a graph-firing query set; the two sets aren’t the same workload, so do not read a same-set 2064→689 improvement (#144/#227)	live palace (AGE) · daemon (candidate-strategy)	none
8 Blueprint	Does the actual graph match what the system claims to do?	“hierarchical” PASSES · modularity 0.7961 (218 communities) · introspection 1.0 LIVE EXACT full-graph (networkx, post-DELETE) — refutes the prior FAIL (modularity 0.009 was the capped-projection artifact) and replaces the verdict-only reading with a real number: 0.7961 >> 0.5. Introspection is now 1.0 on the deployed daemon (daemon restarted, `/ontology` serving) — was 1.0 capability / 0.0 deployed (#147/#211)	live palace (real KG) · daemon	none
9a Handshake	Does the model actually invoke memory when it has access?	opus-4-8 (Tau2 99.3): 100% invocation, 98.3% recall — invokes on every question, exceeds the deterministic 78.3% ceiling. Recall monotonic in Tau2: gemma4 41.7 → qwen3.5 75.0 → opus-4-8 98.3 prior RLM Qwen-7B/Llama-70B plateaued 46.7% (7–27% invocation) — ceiling was willingness to invoke, not retrieval (#194). 4B arms’ on-harness invocation-rate is a backfill follow-up; recall carried from the validated 2026-05-15 RLM run	jp-realm-v0.1 · familiar / rlm / opus-4-8	none
9b Call-through	Given an invocation, does the tool call complete and return a valid result?	live surfaces reachable clean floor; mock-model probe path	—	none

Every row’s competitor analogue is none — that is the point of this table, not a gap in it. The readings are diagnostic deltas on a single substrate, never cross-system leaderboard scores. Click Cat, corpus, or analogue to sort.

cross-system headline · emergent structure is largely unsolved The honest correction this matrix surfaces: generating typed contradiction / supersession edges from raw content is largely unsolved in both substrates. Emergent contradiction is 0.00 in mempalace (0 contradicts edges in the live KG) and 0.50 in OMEGA (auto-relate catches 1 of 2 ground-truth themes, precision 0.25); emergent supersession is 0.00 in both. The earlier +1.00 we published for Cat 3/6 masked this — it was a corpus-declared ceiling (the good-dog-graph adapter reading back hand-seeded edges), not emergent detection on real content. Declared-ceiling and emergent are different questions; the matrix now labels both, and the emergent column is the more honest one (techempower-org/multipass-structural-memory-eval#215).

why Hindsight & Mem0 are N/A on the structural cats — a finding, not a blank “N/A” on Cat 3/4/5/6/8 is itself a structural finding: the system has no typed graph to evaluate. The two extraction competitors reach it by different roads. Mem0-OSS once shipped a graph-memory layer and the open-source edition dropped it — the package under test is now a flat / vector store whose snapshot returns isolated entities with zero edges (the hosted Mem0 platform keeps graph memory; the OSS one does not). Hindsight exposes no standalone graph endpoint at all — it is an extract-then-retrieve store, not a queryable typed graph, so the structural probes have nothing to read. Both also read N/A on Cat 9a/9b (driven through a library / client API with no model-in-the-loop harness manifest — the Handshake is an orchestrator property, and there is no orchestrator to measure). Per the matrix legend, an N/A is reported as a real diagnostic outcome, never a missing cell (techempower-org/multipass-structural-memory-eval#178).

Cat-9 · orchestrator-model selection (Tau2 prior) Cat-9a Handshake recall tracks the orchestrator’s Tau2 tool-agent score, not its parameter count. A +37.7pp Tau2 gap predicted a +30–33pp Cat-9a recall gap to within ~5pp. The live data bears it out: RLM with Qwen-7B and Llama-70B both plateau at the same 46.7% despite a ~10× parameter difference — both ceiling at willingness to invoke the tool, not at retrieval. The deterministic familiar pipeline consistently invokes and lands at 78.3%. The lever for raising 9a is a high-Tau2 orchestrator (Opus 4.6 99.3%, GPT-5.4 98.9%, GLM-5 ~98%), not more parameters (techempower-org/multipass-structural-memory-eval#194).

measurement honesty · the structural column was re-measured over the real KG — and the exact counts now ship The structural cells carried four measurement traps, now corrected (#215) — and the fixes have landed: the re-map, the junk-edge DELETE, and a full-graph networkx pass all ran, so the absolute counts that were “pending” are now published EXACT, not verdict-only. (1) The capped projection (the daemon /graph sample was tunnel-dominated — Cat 4 read 0.020/98.98%, Cat 8 “hierarchical” FAILED at 0.009) is re-anchored to the full-graph EXACT: Cat 4 entropy 0.645 / other 26.83%, Cat 8 modularity 0.7961 (PASS). (2) The tautological scorer (Cat 3/6 +1.00 was the good-dog-graph reading back hand-seeded edges — a declared ceiling, not emergent detection) is replaced by the honest emergent read, 0.00. (3) The limit-dependent topology samples (Cat 5/8) that used to OOM at full scale are now computed server-side over the whole graph: Cat 5 WCC 61.87% largest / 22.2% isolates, Cat 8 modularity 0.7961. Cat 2c / Cat 7 keep their #203 flat Condition-A deltas (+10pp). The one open follow-up is the two 4B arms’ on-harness Cat 9a invocation rate. Per the diagnostic posture, each cell is a controlled reading of one substrate — and we’d rather publish a 0.00 emergent than a flattering declared 1.00.

Source · structural column re-measured over the real KG, EXACT full-graph post-DELETE: baselines/cross_system_multipass_matrix_2026-05-30.json + docs/benchmarks/2026-05-30-cross-system-multipass-matrix.md + docs/benchmarks/2026-05-31-cat458-real-kg-crossvalidation.md (#215, post #147/#210/#211 + the prod re-map + --drop-code-tokens DELETE + networkx full-graph modularity) · retrieval cats baselines/{jp_realm_v0_1_daemon_age_fused, cat2c_daemon_age, candidate_strategy_age}_2026-05-29.json · Cat 9a docs/benchmarks/2026-05-30-cat9a-tau2-orchestrator-ladder.md (#194) · Cat 2c/7 docs/benchmarks/2026-05-30-jprealm-flat-condA-cat2c-cat7.md (#203) techempower-org/multipass-structural-memory-eval#115

Comparability · reading our LoCoMo number

Our LoCoMo QA (~0.38) is not a substrate weakness

explainer · 2026-05-30

Taken next to the field’s LoCoMo QA (0.75–0.93), our ~0.38 looks alarming. It isn’t a verbatim-substrate failure — it is four measurement choices stacked on top of each other, none of which the leaderboard numbers share. Reading them in order:

Substrate is not the bottleneck

0.384 ≈ 0.388

flat adapter vs age-fused daemon — same QA

Retrieval-recall ceiling

R@5 0.44

the gold often isn’t in the top-5 to begin with

Judge

strict

binary, abstention-aware — no partial credit

Subset

adversarial-incl.

field reports often skip the adversarial split

1 — the substrate isn’t the limit. The flat adapter (plain verbatim + vector retrieval) scores 0.384; the full age-fused daemon (pgvector + AGE knowledge graph) scores 0.388 on the same n=250 stratified subset. If the verbatim store were the weakness, the graph-augmented path would pull ahead — it doesn’t. Whatever caps LoCoMo here sits above the substrate, in the retrieval-and-judge layer that every system shares. 2 — it’s a retrieval-recall ceiling. Drawer-level R@5 on LoCoMo is ~0.44: more than half the time the gold turn isn’t in the top-5 the reader sees, so the QA number is bounded by recall, not by the reader’s ability to answer. 3 — our judge is strict. It is a binary, abstention-aware grader with no partial credit and no looser string-overlap leniency — the same discipline we hold mempalace to everywhere else. Field LoCoMo numbers frequently use softer graders (and much stronger answer models), and answer-model choice alone swings results by ~24pp. 4 — we include the adversarial split. Our subset keeps the adversarial LoCoMo questions that many field reports quietly drop; those are exactly the ones designed to defeat retrieval.

temporal date-confound · ruled out We checked for the same temporal date-confound that bit OMEGA here, and ruled it out. Cassia’s diagnostic (n=50 temporal, capture-context) found 100% of reader contexts carried a date — zero date-stripped. The failures break down as retrieval-miss 24 / genuine-reasoning-fail 8 / IDK-despite-evidence 5: the gold isn’t reaching the reader, or the reasoning genuinely misses — not date-starvation. So unlike OMEGA (whose upstream-exact run was date-starved — cat_6 jumped 0.04→0.36 once dates were restored), LoCoMo’s 0.384 overall / 0.26 temporal is the genuine retrieval-ceiling-plus-reasoning number, not an artifact. Treat it as a real reading under our strictest conditions — still not apples-to-apples with the field’s 0.75–0.93 (stronger answer models + looser graders), but honestly ours (techempower-org/multipass-structural-memory-eval#108 / #110).

Source · baselines/locomo_*_flat_*.json + longmemeval/locomo daemon age-fused (n=250, isolated scratch palace) · KG A/B: palace-daemon#202 techempower-org/multipass-structural-memory-eval#108

BEAM · bucket regime-shift

As the haystack grows, the regime flips to retrieval

verified · 2026-05-30

Benchmark: BEAM (bucketed by haystack size)
Substrate: flat-local mempalace
Reader: gpt-5.3-chat
Judge: o4-mini

100K — full-context regime

0.649

whole bucket fits; needle abilities intact

500K — retrieval regime

0.487

retrieval kicks in; needle abilities collapse

1M — deep retrieval

0.471

plateaus — −0.016 vs 500K; 10M deferred (cost)

Bucket	Overall QA	Regime	Per-ability (info-ext · temporal · multi-session · summ.)
100K full-context	0.649	whole bucket in context	0.85 · 0.75 · 0.60 · 0.87
500K retrieval	0.487	real retrieval (needle-in-haystack)	0.40 · 0.26 · 0.24 · 0.87 (holds)
1M deep retrieval	0.471	retrieval-limited (plateau)	flat vs 500K; a couple abilities tick up

BEAM is the one benchmark here with a genuine regime shift — a cliff, then a plateau. At 100K the whole haystack fits in context, so the reader operates in a full-context regime and the needle abilities are intact. The cliff is 100K→500K (0.649→0.487, −0.162): the system crosses into a real retrieval regime and the needle-dependent abilities collapse — info-extraction 0.85→0.40, temporal 0.75→0.26, multi-session 0.60→0.24 — while the retrieval-light ability holds flat (summarization 0.87 in both, because summarizing doesn’t need a specific needle). Then it plateaus: 500K→1M is essentially flat (0.487→0.471, −0.016) and a couple of needle abilities even tick up. So once you’re retrieval-limited at a low top-K, haystack size barely matters — the bottleneck is retrieval breadth, not corpus scale. That asymmetry is the tell: a retrieval-recall ceiling, not a reader ceiling — the same finding as our LongMemEval decomposition (the reader is fine once the gold is in context; getting it there is the whole game).

grader artifact · event_ordering 0.0 event_ordering scores 0.0 in all three buckets — that is a grader mismatch, not a substrate failure. BEAM grades ordering with Kendall-τ-b (a rank correlation); our binary judge floors any full-sequence answer that isn’t an exact match. It drags all buckets equally, so the regime-shift story is unaffected, but it means the absolute substrate is stronger than the headline overall numbers suggest. (Same Kendall-τ-vs-binary-judge mismatch flagged on LoCoMo.)

1M floor · reader-window overflow At 1M, 58 of 700 questions (~8%) overflowed the reader’s context window and hard-zeroed — a mechanical floor, not a retrieval verdict. A real 1M deployment needs sub-session chunking to feed the reader; until then, treat 0.471 as a conservative 1M reading carrying an ~8% overflow penalty the plateau would otherwise sit above.

comparability · not the field’s BEAM-1M These buckets are flat-local mempalace — not directly comparable to the field’s published BEAM-1M (Mem0 70.1 / True Memory 76.6 / Hindsight 73.9), which run different infrastructure and stronger answer models. Read the columns as our own regime-shift diagnostic, not a leaderboard placement. 10M is deferred on cost.

Source · baselines/beam_{100k,500k,1m}_flat_*.json · BEAM bucket-regime harness (techempower-org#177) techempower-org/multipass-structural-memory-eval#177

Cat 9a · The Handshake

Tau2 score predicts empirical recall gap

verified · 2026-05-15

Corpus: jp-realm-v0.1 (n=30)
Backend: postgres + pgvector + AGE
Pattern: RLM orchestrator A/B
Models: qwen3.5:4b vs gemma4:e4b

Model	Tau2	Recall @ n=5	Recall @ n=20	Hit-rate @ n=5
gemma4:e4b	baseline	0.417	0.417	57%
qwen3.5:4b	+37.7pp	0.717	0.750	93%
delta	+37.7pp	+30.0pp	+33.3pp	+36%

The published Tau2 tool-use benchmark gap between qwen3.5:4b and gemma4:e4b is +37.7 points in qwen's favour. On our independent 30-question Cat 9a-shaped RLM experiment, the empirical recall gap landed at +30.0pp at n=5 and +33.3pp at n=20 — Tau2 predicts the gap to within ~5pp. Tau2 is therefore a useful prior when picking an orchestrator model for any RLM-as-tool-use experiment, and a much stronger predictor than parameter count.

Source · jp-realm RLM A/B + published Tau2 scores (run artifacts recorded in the upstream issue, not committed to this fork) M0nkeyFl0wer/multipass-structural-memory-eval#3

LongMemEval-S 500Q · E2E QA

mempalace-daemon, default /search

headline · 2026-05-29

Corpus: longmemeval_oracle.json
n: 500 questions
Adapter: mempalace-daemon (postgres + pgvector + AGE)
Endpoint: POST /search (default)
Reader: o4-mini
Judge: gpt-5.3-chat
Content rules: upstream-exact
Ingest: per-question wing isolation

R@5 overall

97.0%

drawer-id match, chunk-suffix-aware (techempower-org/multipass-structural-memory-eval#98)

QA accuracy

58.6%

judge label = CORRECT or canonical ABSTAIN

R@5 − QA gap

38pp

retrieval finds the session; the QA gap is downstream — substrate-bound (see comparison card)

SME Category	LME type	n	R@5	QA-acc
cat_1	single-session IE	150	100.0%	51.33%
cat_1_negative	abstention	30	96.67%	90.00%
cat_2c	multi-session	121	98.35%	74.38%
cat_3_partial	knowledge-update	72	100.0%	65.28%
cat_6	temporal-reasoning	127	90.55%	40.94%
overall	—	500	97.00%	58.60%

The headline flips once the R@5 matcher is correct: the daemon finds the gold session in the top 5 97% of the time. The 38-point R@5→QA gap looked like a reader failure — but the true-oracle correction (see the comparison card) showed it is substrate: “R@5 found the session” doesn't mean the gold content reached the reader (limit=5 + chunking left it out), and with the gold actually present QA rises to a 0.868 ceiling. cat_6 (temporal) is the only category below the R@5 ceiling (90.55%) and also the worst QA (40.94%): its evidence is the most fragmented across the limited context. cat_1_negative's QA (90%) clears its R@5 because a correct abstention is the right answer regardless of what was retrieved.

how R@5 went from 3.97% to 97% (techempower-org/multipass-structural-memory-eval#98) Earlier readings of this card showed R@5 = 3.97% and framed the gap as a matcher artifact. It was: the daemon chunks each drawer into <parent>_chunk_NNNNNN sub-drawers and /search returns the chunk IDs, but the matcher compared them exact-string against the parent IDs we stored at ingest, so every hit read as a miss. techempower-org/multipass-structural-memory-eval#98 strips the suffix before comparing. The 97% is computed by re-scoring the existing 2026-05-28 rerun records — no new bench compute, same retrieved sets, correct matcher. QA-acc is unchanged by the fix (it never depended on the matcher).

Source · baselines/longmemeval_mempalace_daemon_2026-05-28-rerun.reagg.json (re-scored post-techempower-org/multipass-structural-memory-eval#98; rerun post-#67 of the 2026-05-28-attempt3 run) techempower-org/multipass-structural-memory-eval#44

LongMemEval-S 500Q · A/B leg

mempalace-daemon, /search/age-fused

A/B finding · 2026-05-28

Corpus: longmemeval_oracle.json
n: 500 questions
Adapter: mempalace-daemon
Endpoint: POST /search/age-fused
Reader: o4-mini
Judge: gpt-5.3-chat

R@5 overall

90.2%

vs 97.0% on /search default

QA accuracy

17.4%

vs 58.6% on /search default (−41.2pp)

Context per query

457 chars mean

vs 2539 on /search default (5.5× less)

SME Category	n	R@5 age-fused	QA default	QA age-fused
cat_1	150	100.0%	51.33%	9.33%
cat_1_negative	30	93.33%	90.00%	100.00%
cat_2c	121	98.35%	74.38%	0.00%
cat_3_partial	72	100.0%	65.28%	1.39%
cat_6	127	64.57%	40.94%	33.07%
overall	500	90.20%	58.60%	17.40%

The corrected matcher (techempower-org/multipass-structural-memory-eval#98) turns this card from a mystery into a clean finding. /search/age-fused retrieves the gold session at R@5 = 90.2% — almost as well as the default endpoint. Yet QA accuracy is 17.4%, a 73-point R@5→QA gap. The cause is the snippet width: age-fused returns ~457 chars of context per query (5.5× less than default's 2539). The gold session is in the results, but the reader is handed too thin a slice of it to answer. cat_2c (multi-session) collapses to 0% QA at 98% R@5 — the starkest illustration: every gold session retrieved, none answerable from the snippet. cat_1_negative rises to 100% because thin context makes the reader abstain, which is correct for unanswerable questions.

corrected 2026-05-29 — not a retrieval failure Earlier readings of this card showed R@5 = 0.00% and read the age-fused leg as broken retrieval, then (after the empty-triples theory was disproven) as a structural mystery. With the chunk-suffix matcher fix (techempower-org/multipass-structural-memory-eval#98), R@5 is actually 90.2%: age-fused retrieval works. The QA collapse is entirely a context-width problem — the daemon-side snippet contract investigated in techempower-org/palace-daemon#150. The empty-triples theory remains disproven (rerun against 1.83M triples gave byte-identical context_chars), and the LongMemEval-S rerun (techempower-org/multipass-structural-memory-eval#91) is still the right test for whether graph traversal earns its keep on a harder haystack — but on oracle, age-fused's retrieval is not the problem; its snippet width is.

Source · baselines/longmemeval_age_fused_2026-05-28-rerun.reagg.json (re-scored post-techempower-org/multipass-structural-memory-eval#98) techempower-org/multipass-structural-memory-eval#45

LongMemEval-S 500Q · second adapter

familiar adapter, same harness

three-way A/B · 2026-05-29

Corpus: longmemeval_oracle.json
n: 500 questions
Adapter: familiar (Hybrid v4 + rerank, fork)
Reader: o4-mini
Judge: gpt-5.3-chat

R@5 overall

28.6%

lowest of three legs — Familiar's retrieval genuinely weaker

QA accuracy

31.0%

QA slightly exceeds R@5 — reader fine, retrieval is the limit

R@5 vs daemon

−68pp

28.6% vs 97.0% on daemon-direct

SME Category	n	R@5 #44	R@5 #45	R@5 #46
cat_1	150	100.0%	100.0%	38.00%
cat_1_negative	30	96.67%	93.33%	6.67%
cat_2c	121	98.35%	98.35%	26.45%
cat_3_partial	72	100.0%	100.0%	34.72%
cat_6	127	90.55%	64.57%	21.26%
overall R@5	500	97.00%	90.20%	28.60%

Familiar.realm.watch is a separate memory system in this household — same corpus, different retrieval and reader architecture. The second-adapter discipline is the point: a single reading is a single corpus, and brittle defaults hide on any single corpus. With the corrected matcher (techempower-org/multipass-structural-memory-eval#98) the finding inverts from an earlier reading: Familiar's R@5 (28.6%) is the lowest of the three legs, not the highest. Its retrieval is genuinely weaker than daemon-direct's 97%. And unlike the daemon legs, Familiar's QA (31.0%) slightly exceeds its R@5 — the reader does fine with what it gets; the limit is what reaches it. A different shape from #44/#45, where retrieval finds the session but — as the true-oracle correction later showed — the gold content still often didn't reach the reader. Both point the same way: the limit is what reaches the reader, not its reasoning.

corrected 2026-05-29 — earlier "highest R@5" claim was a matcher artifact A prior reading of this card claimed Familiar had the highest R@5 of the three legs (10.94%, beating daemon's then-3.57%). That was an artifact of the broken substring matcher: Familiar happened to surface session-ids in its retrieved text where the daemon stripped them, so the substring matcher favoured Familiar while undercounting both. The chunk-suffix drawer-id matcher (techempower-org/multipass-structural-memory-eval#98) measures retrieval directly and reverses the picture: daemon 97%, age-fused 90.2%, Familiar 28.6%. Familiar's published Hybrid v4 + rerank stack is tuned for its own corpus shape, not LongMemEval's one-drawer-per-session topology — a fair cross-system caveat, not a defect.

Source · baselines/longmemeval_familiar_2026-05-28-rerun.reagg.json techempower-org/multipass-structural-memory-eval#46

encoder fine-tune · in-domain vs cross-domain

Fine-tuned encoder lifts in-domain — but retrieval isn't the bottleneck

cross-domain null · 2026-05-29

In-domain: LongMemEval-S via MemPalace's own bench (n=500)
Cross-domain: jp-realm-v0.1 (n=30, covered n=29)
Encoders: all-MiniLM-L6-v2 base vs adaptmem FT-300
Stacks: raw · +FT-300 · +hybrid_v4+FT-300

In-domain stack (LongMemEval-S, n=500)	R@1	R@5	R@10
MemPal raw default	0.806	0.966	0.982
+ adaptmem FT-300	0.862	0.980	0.994
+ hybrid_v4 + FT-300	0.916	0.990	0.998
katana FT-300 repro (held-out 200)	0.925	1.000	1.000

Cross-domain (jp-realm-v0.1, covered n=29)	R@1	R@5	R@10
base (all-MiniLM-L6-v2)	0.3448	0.5172	0.6207
FT-300 (MNR fine-tune)	0.3621	0.5172	0.6034
delta	+1.73pp	0.00	−1.73pp

Domain-adaptive encoder fine-tuning is a real retrieval lift in-domain. On LongMemEval-S through MemPalace's own bench, adaptmem's FT-300 moves R@5 0.966→0.980 and R@1 0.806→0.862; stacked with hybrid_v4 retrieval the gains compose to R@1 0.916 / R@5 0.990 — encoder fine-tune (layer 2) and hybrid retrieval (layer 3) operate on different failure modes and add independent lift. Our own katana FT-300 reproduction hit R@5 = 1.000 on the held-out 200. So the encoder is not useless — quite the opposite.

What our jp-realm reading shows is narrower: the lift does not transfer cross-domain. A FT-300 encoder trained on conversational memory, dropped onto JP's personal KB, gives best delta +1.73pp at R@1 (R@5 flat) against a predicted +30–33pp — a clean cross-domain null, exactly the open question the orthogonal-layers note flagged. And it doesn't change the end-to-end picture either: on oracle LongMemEval retrieval is already ~0.974 while the reader leaves a ~45pp gap (techempower-org/multipass-structural-memory-eval#116). Encoder tuning is a legitimate retrieval improvement, but it is not the lever for end-to-end QA — what reaches the reader is.

in-domain numbers credited to nakata-app · MemPalace/mempalace D#1249 The in-domain table is nakata-app's measurement, evaluated through MemPalace's own longmemeval_bench.py with a monkey-patched encoder swap (same dataset, same encoder family, zero changes to eval logic) and posted to MemPalace/mempalace discussion #1249. The raw-baseline R@5 of 0.966 exactly reproduces MemPalace's published number. The orthogonality finding (fine-tune and hybrid retrieval compose) is the durable claim; the absolute percentages attribute to that thread. Framing and layer model: SME's docs/research/adaptmem-orthogonal-layers.md.

Source · cross-domain: baselines/jp_realm_encoder_swap_{default,ft300}_2026-05-29.json + jp_realm_encoder_delta_2026-05-29.json · in-domain repro: baselines/lme_substrate_ft300_katana_test200_2026-05-17.json · nakata-app: MemPalace/mempalace D#1249 techempower-org/multipass-structural-memory-eval#84

LongMemEval-S · retrieval A/B

Age-fusion ~neutral on representative data

headline · 2026-05-29

Corpus: longmemeval_s_cleaned.json
n: 150 (stratified, 25 per question_type)
Adapter: mempalace-daemon
Endpoints: /search vs /search/age-fused
Scoring: drawer-id R@K (techempower-org/multipass-structural-memory-eval#98), retrieval-only

/search R@5

92.67%

plain endpoint, drawer-id match

/search/age-fused R@5

92.00%

graph-fused endpoint

age-fusion Δ R@5

−0.67pp

−1 question of 150 (n.s.)

On a representative, category-stratified sample, age-fusion shows NO significant R@5 gain over plain /search (Δ = −1 question of 150; R@1 +2 questions). The +2.0pp R@5 “win” seen on an earlier n=100 first-100 slice did not replicate — that slice was single-session-dominated (the S corpus is question_type-sorted; fixed via --stratify-by in techempower-org/multipass-structural-memory-eval#122). Per-category (n=25 each, ±1-question noise → directional hypothesis only): age-fusion helps temporal-reasoning + knowledge-update, hurts single/multi-session recall.

Source · baselines/longmemeval_s_strat150_{search,age_fused}_2026-05-29.reagg.json techempower-org/multipass-structural-memory-eval#91

LongMemEval oracle · reader sweep (Pass A)

A large R@5→QA gap — measured directly

headline · 2026-05-29

Corpus: longmemeval_oracle.json (pinned context, retrieval held fixed)
n: 500
Readers: o4-mini, gpt-5.3-chat
Judge: gpt-5.3-chat
Retrieval ceiling: R@5 97.4%

Retrieval ceiling R@5

97.4%

oracle — gold pinned in context

o4-mini QA

50.4%

search-default context

gpt-5.3-chat QA

52.2%

search-default context

R@5→QA gap

~45pp

ceiling clears; QA doesn't — on this limited context

Endpoint	o4-mini QA	gpt-5.3-chat QA
search-default	50.4%	52.2%
age-fused	43.2%	46.6%

With the search-default context pinned (R@5 0.974), both readers answer correctly only ~50% — a ~45pp R@5→QA gap. Correction (see the comparison card): this gap is substrate, not the reader. The true-oracle test later showed that “R@5 found the session” does not mean the gold content reached the reader — limit=5 + chunking (and, for assistant-authored answers, the user-only ingest) often left the answer out of the context entirely. Hand the reader the gold verbatim and QA rises to a 0.868 ceiling, near GPT-4o's oracle. So the ~50% reflects what reached the reader, not a reasoning limit. Age-fused context is harder still (QA 43–47%). Self-judge caveat: gpt-5.3-chat judges its own family on its own reader leg, which may inflate it.

Source · baselines/reader_sweep_passA_{search-default,age-fused}_2026-05-29.json techempower-org/multipass-structural-memory-eval#116

LongMemEval oracle · reader sweep (Pass B)

A stronger reader does NOT close the gap

counterintuitive · 2026-05-29

Corpus: pinned oracle (stratified 150, search-default)
Readers: claude-opus-4-8 (Bedrock), o4-mini, gpt-5.3-chat
Prompt: baseline
Judge: gpt-5.3-chat

claude-opus-4-8 QA

39.3%

strongest reader, worst overall

o4-mini QA

46.7%

same 150-question slice

gpt-5.3-chat QA

44.7%

self-family judge leg

Category	Opus QA-acc
single-session-user	84%
temporal-reasoning	48%
knowledge-update	48%
multi-session	44%
single-session-assistant	12%
single-session-preference	0%

Opus 4.8 — the strongest reader — scored the worst overall (39.3%). But it is not a capability gap: Opus is the best reader on well-posed direct recall (single-session-user 84%, the highest of any reader on any category) and collapses only on mis-specified categories (preference 0%, assistant 12%). It is penalized for following the baseline prompt literally (over-abstaining where the gold answer is an inferred preference) and for thoroughness (reporting both old+new values → judged PARTIAL; answers 2× longer). The 45pp gap is prompt + judge design, not reader capability. Both follow-ups are now published below: the prompt-axis fix and the Opus-as-judge re-scoring.

Source · baselines/reader_sweep_passB_opus_strat150_search-default_2026-05-29.json techempower-org/multipass-structural-memory-eval#116

LongMemEval oracle · reader sweep (prompt axis)

The reader gap was the prompt, not the model

reader-prompt fix · 2026-05-29

Corpus: pinned oracle (stratified 150, 25/type, search-default)
Readers: claude-opus-4-8 (Bedrock), o4-mini
Prompt axis: baseline → committed → preference
Judge: gpt-5.3-chat (not yet canonical — floor only)
Context: full oracle

Opus baseline QA

36.0%

worst reader, baseline prompt

Opus preference QA

59.3%

best config overall, +23pp

ss-preference canary

0.04→0.76

Opus, +72pp on one category

Prompt swing

+23pp

same model, same context

Reader	baseline	committed	preference
o4-mini	0.420	0.507	0.527
claude-opus-4-8	0.360	0.473	0.593

“Opus is the worst reader” was a prompt artifact. With the preference-tuned prompt, Opus goes from worst (0.360) to best of any config (0.593) — a +23pp swing from changing nothing but the prompt. The entire gap lived in one category: single-session-preference, where Opus scored 0.04 under the baseline prompt (it refused to make a recommendation — the “say I don’t know” instruction over-fired on inference questions) and 0.76 under the preference prompt (+72pp). That single category drags the overall from 0.36 to 0.59. This is the reader-harness fix; the judge is still gpt-5.3-chat (not the LongMemEval-canonical type-specific judge), so 0.593 is a floor — the canonical-judge work compounds on top. Bottom line: the #116 reader gap is prompt design, not model capability.

Source · baselines/reader_sweep_passB_promptfix_2026-05-29.json techempower-org/multipass-structural-memory-eval#116

palace-daemon · FlashRank rerank A/B

Cross-encoder rerank lifts MRR +15–23%

verified · 2026-05-27

Repo: palace-daemon (#46)
Corpus: live palace, ~375k drawers
Backend: postgres + pgvector (familiar)
Model: ms-marco-TinyBERT-L-2-v2 (nano, CPU)
Query set: 12 known-item, 11 usable, pool=20
Pattern: same pool, ordering-only A/B

Run	R@5	R@10	MRR base	MRR rerank	Δ MRR
#2 (confirming)	0.909→0.909	1.00	0.748	0.921	+23.1%
#1 (first)	1.00→0.909	1.00	0.761	0.877	+15.3%
movement	rerank-spike 7→1 (+6) · fallback-contract 4→1 · 7 no-change				1 regr.

A clean ordering-only A/B — retrieval is held constant, only the final sort changes (distance vs FlashRank score) on the same candidate pool. MRR lifts +15.3% (run #1) and +23.1% (run #2), driven by a buried answer rescued from rank 7 to rank 1 plus two smaller promotions; 7 of 11 queries were already optimal and rerank correctly left them alone. R@5 is a wash and R@10 is untouched, so MRR is the load-bearing metric for this known-item set. Rerank latency stayed at 47 ms mean (126 ms under host load), worst single request 557 ms — well inside a 1 s budget on CPU. Verdict: keep nano, A/B a larger model next.

the one regression — score compression daemon-deploy-arch moved 3→8 in both runs. The top 7 reranked passages all scored 0.9971–0.9994 (a 0.002 spread) and every one was genuinely on-topic — when 7 passages are all relevant and scored within 0.002, head ordering is a coin-flip. This is the distilled 2-layer cross-encoder's score-compression failure mode and the strongest case for A/B-testing a larger model (ms-marco-MiniLM-L-12-v2; this FlashRank build ships no L-6). Cross-validated: replaying frozen pools in --mode candidates reproduced run #1's metrics exactly.

Source · palace-daemon/docs/evals/rerank-eval-live-2026-05-27.json (run #2) · rerank-eval-2026-05-27.json (run #1) · rerank-candidates-2026-05-27.json techempower-org/palace-daemon

#111 · hybrid scorer-weight tuning

A documented weight that recovers MRR without losing R@5

verified · 2026-05-31

Corpus: live familiar palace (postgres + AGE)
Query set: 12 labeled known-item (palace-daemon PR #64)
Method: in-process searcher, FlashRank OFF, weights set per sweep point
Knob: PALACE_HYBRID_VECTOR_WEIGHT / _BM25_WEIGHT (mempalace#342)

winning weight

0.85 / 0.15

vector / BM25

R@5

1.000

held — no recall lost

MRR

0.833

+2.5pp vs floor, +4.8pp vs default-hybrid

config	R@5	R@10	MRR
vector / convex (default wts)	0.917	1.000	0.808
union / convex (default wts)	1.000	1.000	0.785
hybrid / convex (default 0.6/0.4)	1.000	1.000	0.785
hybrid / convex (0.85/0.15)	1.000	1.000	0.833

The #111 acceptance criterion — a documented hybrid weight achieving R@5 ≈ 1.000 without regressing MRR vs union/vector — is met by vector_weight 0.85 / bm25_weight 0.15. The default 0.6/0.4 hybrid had regressed MRR to 0.785 (below the 0.808 union/vector floor); 0.85/0.15 lifts it to 0.833 — +2.5pp over the floor, +4.8pp over the default-hybrid regression — while R@5 stays pinned at 1.000. MRR turns out to be non-monotonic in vector_weight: the earlier “graph promoted one query and demoted another” reading was wrong; the real mechanism is the convex blend over-weighting BM25 at 0.4. The weight ships as an env knob (mempalace#342) — the deployed default stays 0.6/0.4 because n=12 is below the n≥25 bar to flip a production default.

third age-fusion-null data point On these 12 golden queries the graph leg is inert: hybrid scores identically to union (R@5 1.000 / MRR 0.785 at default weights) — the lift comes from BM25 pool-widening, not graph traversal. That joins the Cat 2c/7 graph-RRF tax and the #91 age-fusion R@5 null as a third independent reading where the AGE graph half adds ~nothing to top-5 retrieval on drawer-query workloads (it earns its keep on entity-anchored queries).

Source · baselines/hybrid-scorer-weight-tuning-2026-05-31.json · docs/benchmarks/2026-05-31-hybrid-scorer-weight-tuning.md (SME#227 + mempalace#342, both merged) techempower-org/multipass-structural-memory-eval#111

#103 · cross-encoder rerank A/B (SME side)

Cross-encoder rerank is neutral-to-negative — keep it off

verified · 2026-05-31

Probe set: 200 git-derived probes (same set as the #162 fusion A/B)
Substrate: scratch daemon seeded with the mempalace git/docs corpus
Method: daemon /search/hybrid per-request rerank flag, read-only A/B

R@10 (all three legs)

0.60

identical — rerank reorders, doesn’t recall more

MRR Δ (rerank on)

−0.6pp

0.299 off → 0.293 TinyBERT — slightly hurt

bigger model (L-12)

worse & 3×

MRR 0.284, p50 1523 ms vs 475 ms

leg	MRR	R@5	R@10	found	p50
rerank OFF	0.299	0.510	0.60	120/200	555 ms
TinyBERT-L-2-v2 (daemon default)	0.293	0.515	0.60	120/200	475 ms
MiniLM-L-12-v2 (bigger)	0.284	0.505	0.60	120/200	1523 ms

This is the SME-side rerank A/B — distinct from the palace-daemon FlashRank known-item A/B above (a 12-query run that found a +15–23% MRR lift on a different, answer-buried query shape). On a corpus-seeded scratch daemon — the git/docs targets re-ingested so the relevant set is actually present — the verdict is the opposite: cross-encoder rerank is neutral-to-slightly-negative. R@10 is identical at 0.60 across all three legs (rerank only reorders the top-K; it cannot recall what vector retrieval missed), MRR is slightly hurt by rerank (0.299→0.293), and the bigger MiniLM-L-12-v2 is both worse on MRR and 3× the latency (1523 ms vs 475 ms p50). Recommendation: keep rerank OFF / opt-in, do not promote the bigger model. This is the 4th independent reading that the structural/rerank layer doesn’t lift retrieval — the vector backbone is the lever (with the age-fusion null ×3: Cat 2c/7 graph-RRF tax, #91 age-fusion R@5, and the #111 graph-leg-inert finding).

both halves of the story · a measurement-validity catch, then the real number The first attempt (#225, on prod familiar) ran clean but measured nothing: the 200 git-derived probes target mempalace’s own git/docs corpus, which is no longer present in conversational-only prod familiar — only 17/200 targets surfaced (Recall@10 0.085), so the rerank had nothing to reorder. A clean exit code masked a near-empty corpus; that delta was corpus-floor noise, not a verdict. This corpus-seeded re-run supersedes it with the relevant set actually present. (It also corrects the #225 model label: ms-marco-MiniLM-L-6-v2 was never in the FlashRank build — #225’s rerank leg actually ran TinyBERT-L-2-v2, the same nano model that is the daemon default.)

Source · baselines/ce_rerank_corpus_seeded_2026-05-31.json (corpus-seeded; supersedes the blocked baselines/ce_rerank_ab_2026-05-30.json) · docs/benchmarks/2026-05-30-ce-rerank-ab.md techempower-org/multipass-structural-memory-eval#103

palace-daemon · NCD novelty calibration

novelty_score is continuous, not bimodal

verified · 2026-05-27

Repo: palace-daemon (#47)
Corpus: live palace, ~375k drawers
Backend: postgres (familiar:8085)
Method: gzip-NCD, block rolling window=20
Sample: 35 (wing,room) groups, n=280
Scorer: production novelty.score_novelty

overall median

0.622

mean 0.558, stdev 0.222 (0=dup .. 1=novel)

redundant tail

p10 0.165

min 0.028; long tail below ~0.20

per-room median spread

~0.37

discoveries 0.30 vs planning 0.68

Room	n	p10	median	p90
references	144	0.173	0.636	0.781
architecture	40	0.363	0.649	0.776
discoveries	40	0.126	0.304	0.705
planning	32	0.496	0.677	0.821
problems	24	0.215	0.637	0.791

The gzip-NCD novelty_score on the live corpus is continuous and right-skewed — not bimodal: a broad novel hump across 0.55–0.85, a long redundant tail below ~0.20, and the middle band populated throughout with no empty valley. Per-room baselines diverge by ~0.37 in median (discoveries 0.30 vs planning 0.68), so a single global NCD cut mislabels content; the calibration recommends per-room percentile thresholds — redundant at p15(room), novel at p60(room).

scope — distribution, not ECE/Brier; and a silent no-op fixed This is a distribution calibration of the NCD score, not a probabilistic (ECE/Brier) calibration. It also surfaced two bugs: the live scorer was a silent no-op until palace-daemon #63 (it read window text from the wrong field, so every write returned novelty_score=1.0 — the numbers above were captured after the fix), and content_preview is truncated to ~200 chars, so the write path still scores against truncated neighbours (NCD on short prefixes is noisier and biased high). The calibration script computes the full-content distribution by default.

Source · palace-daemon/docs/evals/novelty_calibration.json (live, n=280) · offline synthetic fixture superseded · SME isotonic ECE/Brier follow-up: techempower-org/multipass-structural-memory-eval#105 techempower-org/palace-daemon

LongMemEval oracle · reader sweep (judge axis)

A stronger judge doesn't rescue the gap

judge axis · 2026-05-29

Corpus: pinned oracle (stratified 150, search-default)
Readers: claude-opus-4-8 / o4-mini / gpt-5.3-chat (same answers as Pass B)
Judge re-scored: gpt-5.3-chat → claude-opus-4-8
Method: same hypotheses, different judge

Reader (answers fixed)	QA gpt5.3-judge	QA Opus-judge	Δ
claude-opus-4-8	0.393	0.420	+0.027
o4-mini	0.467	0.480	+0.013
gpt-5.3-chat	0.447	0.480	+0.033

To isolate the judge variable, the identical Pass B reader answers were re-graded with Opus-4.8 as judge (the original judge was gpt-5.3-chat). Every reader lifts modestly — +1–3pp — mostly by rescuing single-session-preference questions (the Opus reader's preference category moves 0.00→0.12). But the Opus judge does not change the ranking and does not rescue the Opus reader, which stays the lowest at 0.420. So the judge confound is real but small: the dominant factor in the reader gap is prompt design, not judge strictness. The prompt-fix sweep is published — see the prompt-axis card above.

Source · baselines/reader_sweep_passB_opus_REJUDGED_2026-05-29.json techempower-org/multipass-structural-memory-eval#116

LongMemEval oracle · canonical type-specific judge

Fixing the judge un-collapses preference — but the gap is the reader

canonical judge · 2026-05-29

Corpus: LongMemEval-S oracle, n=500 (all 6 types + 30 abstention)
Reader: gpt-5.3-chat (held constant, baseline prompt, full context)
Judge: gpt-5.3-chat + canonical type-specific prompts (NOT gpt-4o-2024-08-06 — not deployed here)
Method: same reader answers, paraphrased rubric → verbatim LongMemEval templates

preference /search

+20.0pp

0.133 → 0.333, un-collapsed

preference age-fused

+26.7pp

0.100 → 0.367, un-collapsed

spurious ABSTAIN

21 → 0

old judge invented them on temporal Qs

residual oracle gap

~31–36pp

reader/substrate, not the scorer

Question type	/search Δ	age-fused Δ
single-session-preference	+20.0pp	+26.7pp
single-session-user	+0.0pp	+1.4pp
single-session-assistant	+0.0pp	+7.1pp
multi-session	−3.8pp	−6.0pp
knowledge-update	−5.1pp	−9.0pp
temporal-reasoning	−2.3pp	−2.3pp
OVERALL (harness)	0.510	0.456
OVERALL (+abstention)	0.562	0.510

The companion card swaps the judge model (Opus-as-judge) and finds it doesn't help. This card swaps the judge prompts: porting LongMemEval's verbatim type-specific templates — in particular the rubric-based preference template the old paraphrased judge lacked. That un-collapses single-session-preference (+20.0pp / +26.7pp, the largest single-category move) and removes the spurious-ABSTAIN noise (the old judge emitted 34 ABSTAIN labels, 21 on non-abstention temporal questions; the canonical binary judge can't). The stricter, more faithful templates pull KU and multi-session down a few points — that is the old judge having over-credited via a softer rubric, not a regression. The headline: the overall did NOT move toward the published 87% oracle — corrected, abstention-credited it is 0.562 / 0.510. The judge-prompt confound was real (preference collapse and label noise were genuine measurement artifacts) but it is not the bulk of the 35pp gap. The residual ~31–36pp is reader/substrate, not the scorer — and the true-oracle test (see the comparison card) later resolved which: it is substrate (what reaches the reader), not the reader's reasoning.

Disclosure. Judge is gpt-5.3-chat with canonical prompts, not the canonical gpt-4o-2024-08-06 snapshot (not deployed on this resource) — the fix is the prompts, not the model. 2 ERROR rows per run from an Azure content-filter trip (graceful retry-then-ERROR by design; counted as wrong, a 0.4% asymmetry vs the confounded denominator). Absolute numbers are understated by the aggregator’s ABSTAIN dead-code bug (#148) — it never credits a correct ABSTAIN — but that depresses both runs identically, so the deltas are valid.

Source · baselines/reader_sweep_passA_canonical-judge_{search-default,age-fused}_2026-05-29.json · docs/benchmarks/2026-05-29-canonical-judge-passA.md techempower-org/multipass-structural-memory-eval#116

What does a memory system know about itself?

A competitive memory system — with the headroom in its own hands.

What the whole campaign found.

A diagnostic, not a benchmark.

Multiple passes.

Nine categories.

Deltas, not ranks.

The published readings.

The full field — sort it, filter it, follow the receipts

The benched nine — role-grouped, every category readable

Apples to apples — same subset, same reader, same judge

Where mempalace sits — and the honest gap

SME structural diagnostics — no competitor analogue

Our LoCoMo QA (~0.38) is not a substrate weakness

As the haystack grows, the regime flips to retrieval

Tau2 score predicts empirical recall gap

mempalace-daemon, default /search

mempalace-daemon, /search/age-fused

familiar adapter, same harness

Fine-tuned encoder lifts in-domain — but retrieval isn't the bottleneck

Age-fusion ~neutral on representative data

A large R@5→QA gap — measured directly

A stronger reader does NOT close the gap

The reader gap was the prompt, not the model

Cross-encoder rerank lifts MRR +15–23%

A documented weight that recovers MRR without losing R@5

Cross-encoder rerank is neutral-to-negative — keep it off

novelty_score is continuous, not bimodal

A stronger judge doesn't rescue the gap

Fixing the judge un-collapses preference — but the gap is the reader

Four conditions. Two baselines. Multiple corpora.

A / B / C / D

Multiple shapes.

Diagnostic, not benchmark.

What SME doesn't measure.

A timeline of findings.

SME stands on borrowed shoulders.