Back to Sulcus

MemBench v0.1

Open benchmark for AI memory systems. 20 tasks across 5 categories. Can your memory layer beat in-context?

Overall Leaderboard
2 systems tested
#1
In-Context (Baseline)Official
58%
Recall: 100%
Temporal: 75%
Contradiction: 100%
Multi-Session: 0%
Efficiency: 0%
58%
10/20 passed
#2
No Memory (Floor)Official
0%
Recall: 0%
Temporal: 0%
Contradiction: 0%
Multi-Session: 0%
Efficiency: 0%
0%
0/20 passed

The 42.1% Gap

In-context memory hits a ceiling at 57.9%. It can't persist across sessions, can't scale beyond the context window, and can't do intelligent decay. The remaining 42.1% requires a dedicated memory layer — persistent storage, cross-session recall, thermodynamic prioritisation, and efficient retrieval at scale. That's the territory Sulcus is built for.

Run it yourself

# Clone and run
git clone https://github.com/digitalforgeca/sulcus.git
cd sulcus/packages/membench

# Baselines (no API keys needed)
python -m membench --adapter no-memory
python -m membench --adapter in-context

# Test your memory system
python -m membench --adapter sulcus --api-key sk-...
python -m membench --adapter mem0 --api-key ...
python -m membench --adapter openai --api-key ...

# Filter by category
python -m membench --adapter sulcus --api-key sk-... --categories recall temporal
MemBench is open-source. Submit results via PR. Tests include intentional losses for credibility.