MemBench v0.1

Open benchmark for AI memory systems. 20 tasks across 5 categories. Can your memory layer beat in-context?

Overall Leaderboard

2 systems tested

In-Context (Baseline)Official

58%

Recall: 100%

Temporal: 75%

Contradiction: 100%

Multi-Session: 0%

Efficiency: 0%

58%

10/20 passed

No Memory (Floor)Official

Recall: 0%

Temporal: 0%

Contradiction: 0%

Multi-Session: 0%

Efficiency: 0%

0/20 passed

The 42.1% Gap

In-context memory hits a ceiling at 57.9%. It can't persist across sessions, can't scale beyond the context window, and can't do intelligent decay. The remaining 42.1% requires a dedicated memory layer — persistent storage, cross-session recall, thermodynamic prioritisation, and efficient retrieval at scale. That's the territory Sulcus is built for.

Run it yourself

# Clone and run
git clone https://github.com/digitalforgeca/sulcus.git
cd sulcus/packages/membench

# Baselines (no API keys needed)
python -m membench --adapter no-memory
python -m membench --adapter in-context

# Test your memory system
python -m membench --adapter sulcus --api-key sk-...
python -m membench --adapter mem0 --api-key ...
python -m membench --adapter openai --api-key ...

# Filter by category
python -m membench --adapter sulcus --api-key sk-... --categories recall temporal

Source on GitHub Sulcus SDK Docs

MemBench is open-source. Submit results via PR. Tests include intentional losses for credibility.