We run a lot of Claude Code sessions. At any given time there might be six or eight open — different repos, different tasks, sometimes different agents working in parallel. Each session spawns its own MCP server processes for the tools it uses. For most tools, this is fine. For Quarry, which loads a 200MB ONNX embedding model into memory, it meant six copies of the same model sitting in RAM doing nothing.
That’s the story of how we ended up writing a Go proxy, then discovered the proxy was solving the wrong problem, and eventually arrived at an architecture that works for both cases.
The problem has three faces
Memory multiplication. Quarry loads snowflake-arctic-embed-m-v1.5 (200MB) on startup. Ten sessions means 2GB of duplicate embedding models. The LanceDB index is opened separately in each process too.
Resource contention. Vox plays audio through the machine’s speakers. When Biff broadcasts a /wall message, every session’s independent Vox process synthesizes and plays the same announcement simultaneously — a chorus of identical voices. A machine has one pair of speakers; it should have one audio process.
The hook timing budget. Claude Code hooks (PreToolUse, SessionStart) have roughly 100ms before they feel sluggish. Our Python tools were taking 1.5–4.7 seconds to cold-start because of import trees. Biff’s hook script called the full CLI, which imported typer, nats, pydantic, and fastmcp before reaching the handler — which only needed stdlib [1]. Quarry’s session-start hook imported lancedb (16.2s on first load), onnxruntime, pymupdf, and beautifulsoup — but the handler only needed sqlite3 and subprocess.
What we tried first: thin Python clients
The obvious fix was to separate the CLI’s import tree so hooks could avoid the heavy dependencies. We tried this across several projects.
It didn’t work well. Python’s import system is infectious — importing one module that imports another that imports a heavy dependency pulls the whole tree in. We documented the pattern across projects in our hooks standard: the three-layer fix was a _stdlib.py module with only standard library imports, a lightweight entry point, and lazy __init__.py that defers heavy imports. Biff got its hook time from 3.7s down to 0.29s this way [2].
But that only solved the hook problem. The MCP server process itself still loaded everything, and there was still one process per session. We needed a way to share expensive state across sessions.
The proxy pattern
In March 2026 we built mcp-proxy — a Go binary (~6MB, <10ms startup) that sits between Claude Code and a shared daemon process:
Claude Code ←── stdio ──→ mcp-proxy ←── WebSocket ──→ daemon
(one process)
Claude Code thinks it’s talking to an MCP server over stdio. The proxy forwards every message, unmodified, to a single daemon via WebSocket. Three sessions share one daemon — one embedding model, one audio device, one NATS connection.
The proxy doesn’t parse MCP messages. They pass through opaque. This means the proxy works with any MCP server that speaks WebSocket — we didn’t have to build separate proxies for each tool.
Why WebSocket and not Unix sockets or HTTP? We considered both [3]. HTTP is request-response only — no server-initiated push. When Biff’s tool list changes (a new session joins), the daemon needs to notify all connected clients via tools/list_changed. Unix domain sockets work but require hand-rolled framing and keepalive. WebSocket gives us RFC 6455 framing, built-in ping/pong liveness detection, and bidirectional push — all standard, all tested.
What the proxy actually fixed
Quarry adopted mcp-proxy immediately. Ten sessions now share one embedding model (~220MB total instead of 2GB). The daemon opens LanceDB once; per-session state (which database is selected) is tracked via session keys passed on the WebSocket upgrade URL.
Vox adopted it next. One daemon, one audio output, no duplicate synthesis. When Biff broadcasts, the Vox daemon synthesizes once and plays once. The 5-second deduplication window (DaemonContext.should_play) catches duplicate notifications that arrive from different sessions within the same wall broadcast [4].
For hooks, the proxy’s --hook mode sends one-shot JSON-RPC messages to the daemon in about 15ms — comfortably within the 100ms budget. No Python imports at all.
The plot twist
Then something surprising happened. After shipping the Vox daemon with mcp-proxy, we spent eight rounds fixing path resolution bugs. The daemon needed to know each session’s working directory to find .vox/config.md. We tried resolving the CWD from the session PID via lsof, then looking for config files relative to the project root. Every piece of this chain broke.
The root cause wasn’t implementation defects — it was architectural ambition. The daemon was trying to be “project-aware” across sessions, maintaining per-connection state via Python ContextVars. This was the wrong boundary.
Vox’s v3 architecture [5] took a different approach: make the MCP server so lightweight that spawning one per session is cheap. The per-session server handles config resolution (walk up from CWD, find .vox/config.md) and validation. It calls the daemon only for the expensive operations — TTS synthesis and audio playback. The daemon knows nothing about projects, sessions, or Claude Code. It accepts text and parameters, and it speaks.
This eliminated mcp-proxy from Vox entirely. No Go binary, no WebSocket bridge, no class of “MCP session doesn’t survive daemon restart” bugs. Memory dropped from ~100MB (daemon + proxy + MCP servers) to ~30MB (daemon + lightweight MCP servers).
Where we landed: two patterns for two problems
The MCP singleton problem turns out to have two distinct solutions, depending on what’s expensive:
Pattern 1: Proxy bridge (Quarry). When the expensive resource is the MCP server itself — an embedding model that takes seconds to load and hundreds of megabytes to hold — the proxy pattern is correct. One daemon, many sessions, shared state. mcp-proxy handles the plumbing.
Pattern 2: Lightweight client + singleton service (Vox). When the expensive resource is a downstream service (audio output, display server) but the MCP logic itself is cheap, skip the proxy. Make each session’s MCP server lightweight enough that per-session spawning is fine, and have it call the singleton service directly.
The distinction is about where the state boundary falls. Quarry’s state is the embedding model and the LanceDB index — that’s the MCP server’s core function, and sharing it across sessions is the whole point. Vox’s state is the audio device — that’s downstream of the MCP server, and the MCP server doesn’t need to share anything.
What we don’t know yet
We haven’t tested this at scale beyond a single developer machine with 10–15 sessions. The memory savings are real and measurable, but we don’t know how the WebSocket connection pool behaves under heavier concurrent load. The mcp-proxy formal specification covers the state machine (6 states, 43 transitions verified by ProB [6]), but the specification doesn’t model resource contention under load.
We also don’t know whether the two-pattern split is stable. It’s possible that as tools get more complex, a hybrid emerges where the proxy handles some sessions and direct connections handle others. We haven’t needed that yet.
References
- Punt Labs. “Hook Import Tax — Lightweight Entry Point.” Biff DESIGN.md, DES-028. 2026. github.com
- Punt Labs. “Hook Startup Performance.” punt-kit hooks standard, §12. 2026. github.com
- Punt Labs. “Transport — WebSocket.” mcp-proxy DESIGN.md, DES-001. 2026. github.com
- Punt Labs. “Daemon Mode — Single Process with mcp-proxy.” Vox DESIGN.md, DES-021. 2026. github.com
- Punt Labs. “Vox v3 — Audio Server Architecture.” Vox DESIGN.md, DES-028. 2026. github.com
- Punt Labs. “mcp-proxy Z Specification.” mcp-proxy docs/mcp-proxy.tex. 2026. github.com