LangGraph memory footprint grows unbounded on long-running graphs
Long LangGraph runs allocate 2x–3x more memory than logical state size because deepcopy builds a complete parallel object graph on every checkpoint, and GC lags behind allocation rate. Teams report peak RSS that doesn't match their mental model of state size. The fix is to stop allocating those intermediate objects at all — which is exactly what RustSQLiteCheckpointer does.
The symptom
You run your LangGraph workload. Peak memory on a fresh start is comfortable — maybe 200 MB. You let it run for an hour of production traffic and check again: 600 MB. Your logical state size hasn’t grown proportionally. You suspect a leak. You dig in with tracemalloc and gc.get_objects() and can’t find one.
There isn’t a leak, not in the traditional sense. The memory is being freed eventually. It’s just being allocated faster than Python’s garbage collector can reclaim it.
The root cause
Every time LangGraph checkpoints state, copy.deepcopy() walks the state structure and allocates a brand-new parallel object graph:
- New dict headers for each dict in the state
- New list buffers for each list
- New copies of all the nested strings, numbers, custom objects
- New
memodict to track already-seen references for cycle handling
All of this then gets serialized (another allocation pass) and the deep-copied object graph becomes garbage. The GC will clean it up, but not before it’s already been counted in your peak RSS.
At a steady-state allocation rate of, say, 50 MB/sec (not unusual for non-trivial state checkpointed 20 times per second), you’ll see peak memory grow by a factor of 2–3× beyond the actual logical state footprint. The difference is just “stuff waiting to be collected.”
Why GC can’t fix it
Python’s garbage collector is generational and reference-counted. It’s efficient when given time. Under high allocation pressure — which is exactly what deepcopy on every checkpoint creates — it can’t keep up:
- Reference counts tick down immediately but only free objects when they hit zero
- Generational collection runs periodically, not on every allocation
- Large temporary allocations spend time in the young generation before being moved/freed
The result is a sawtooth memory profile where peak RSS is substantially higher than mean “live” memory. Your monitoring dashboards show the peaks. Your OOM killer sees the peaks. Your bill is based on the peaks.
Who sees this most
- High-QPS agent services where checkpointing happens frequently
- Long-context conversations where state has accumulated significant size
- Workloads on constrained containers where a 2× memory headroom becomes the difference between stable and OOM-killed
- Multi-tenant deployments where several agents share a process and compete for RAM
What doesn’t help
gc.collect()manually — adds CPU without fixing the root issue- Smaller heap sizing — causes OOMs, not fixes them
- Per-tenant process isolation — masks the issue with more containers; doesn’t reduce per-tenant overhead
- Going to
MemorySaver— removes disk I/O but keeps the deepcopy allocation pattern
What actually helps
Stop allocating the intermediate object graph. The Rust checkpointer walks the state structure once, directly into a reusable byte buffer. The buffer gets reused across checkpoints (amortizing even its own allocation cost). No parallel object graph gets built. The allocation rate that was causing your peak memory problem simply doesn’t exist.
We’ve measured 20–40% peak memory reduction on realistic workloads after swapping in RustSQLiteCheckpointer. The savings are driven almost entirely by the eliminated intermediate allocations — the serialized bytes themselves are the same size regardless of who produces them.
Adoption
See the guide. One-line drop-in. Same on-disk format, no migration. Reversible.
Related
- Checkpoint serialization overhead — the related CPU pain point, same root cause
- Why deepcopy kills LangGraph — the architectural story
How fast-langraph addresses this
RustSQLiteCheckpointer serializes state into a reusable byte buffer instead of allocating a fresh Python object graph per checkpoint. Memory allocation rate drops sharply — we typically measure 20–40% lower peak memory on realistic workloads, driven almost entirely by skipping deepcopy's intermediate objects.