Back to Sulcus
MemBench v0.1
Open benchmark for AI memory systems. 20 tasks across 5 categories. Can your memory layer beat in-context?
Overall Leaderboard
2 systems tested#1
In-Context (Baseline)Official
58%
Recall: 100%
Temporal: 75%
Contradiction: 100%
Multi-Session: 0%
Efficiency: 0%
58%
10/20 passed
#2
No Memory (Floor)Official
0%
Recall: 0%
Temporal: 0%
Contradiction: 0%
Multi-Session: 0%
Efficiency: 0%
0%
0/20 passed
The 42.1% Gap
In-context memory hits a ceiling at 57.9%. It can't persist across sessions, can't scale beyond the context window, and can't do intelligent decay. The remaining 42.1% requires a dedicated memory layer — persistent storage, cross-session recall, thermodynamic prioritisation, and efficient retrieval at scale. That's the territory Sulcus is built for.
Run it yourself
# Clone and run git clone https://github.com/digitalforgeca/sulcus.git cd sulcus/packages/membench # Baselines (no API keys needed) python -m membench --adapter no-memory python -m membench --adapter in-context # Test your memory system python -m membench --adapter sulcus --api-key sk-... python -m membench --adapter mem0 --api-key ... python -m membench --adapter openai --api-key ... # Filter by category python -m membench --adapter sulcus --api-key sk-... --categories recall temporal
MemBench is open-source. Submit results via PR. Tests include intentional losses for credibility.