Hi,
LLM agents are brilliant in the moment and amnesiac by design.
You explain your stack, your constraints, your decisions — then open a new chat and do it all again.
Mnemostroma is my attempt to fix that without changing how you work.
It's a local daemon that sits between you and your agents. It watches the conversation I/O silently, decides what's worth keeping, compresses it into structured memory, and surfaces it back when it's relevant. You never call "save". You never write a prompt to recall something. The agent just... knows.
What's unusual about the design:
The agent only reads memory — it never writes it. All observation, classification and storage happens in a separate pipeline running in the background. This turned out to be a surprisingly important constraint: it means the memory layer is completely decoupled from the agent's behavior and can't be "confused" by the model into storing garbage.
Under the hood:
Dual-stream async pipeline (Observer + Content), RAM-first index, SQLite WAL persistence. Five memory layers with gradual decay — important decisions stay, low-value noise fades. Semantic retrieval via numpy matmul over ONNX INT8 embeddings, ~20 ms. No torch. No transformers. No cloud. No Docker. ~420 MB RAM baseline.
Try it today:
pip install "git+https://github.com/GG-QandV/mnemostroma.git"
mnemostroma setup # downloads ~300 MB ONNX models, generates TLS cert
mnemostroma on
mnemostroma status
Connects to Claude Desktop, Claude Code, Cursor, Windsurf, Zed and anything else that speaks MCP. There's also a passthrough proxy mode for Claude Code — you launch your IDE through a wrapper, the Observer starts capturing without touching your workflow.
Status: v1.8.1 beta. 400+ tests passing. Not on PyPI yet (git install only). API surface is stabilizing; breaking changes are unlikely but possible.
Privacy: everything lives in ~/.mnemostroma as plain SQLite. Local-only logging subsystem for latency/diagnostics — can be disabled or wiped anytime. Nothing leaves your machine.
A few things I'm genuinely unsure about and would love input on:
- The ~420MB (400-650) RAM footprint for a background daemon — dealbreaker for you, or fine?
- The "agent reads, Observer writes" split — does this feel right, or would you want the agent to be able to annotate its own memory?
- Which integration matters most to you: VS Code, Cursor, a standalone CLI, something else?
- What's your biggest fear about persistent agent memory — wrong recalls? Stale decisions? Privacy?
I'm in the thread. Happy to go deep on architecture, share internals, or hear "this is over-engineered and here's why."
If you run it and something breaks — tell me. There's detailed local telemetry and I'd rather tune against real usage than synthetic tests.
This article was originally published by DEV Community and written by Yevhenii.
Read original article on DEV Community