LangGraph checkpoint serialization overhead on large state
LangGraph checkpoints serialize state via Python's deepcopy on every super-step. For any non-trivial agent state (30 KB and up), this becomes the dominant cost of running the graph. Teams report p95 latency doubling or tripling as state grows. Fast-langraph's RustSQLiteCheckpointer eliminates the bottleneck with a drop-in replacement.
The pain
If you’re running LangGraph in production with persistent checkpointing enabled — which is the whole point of production LangGraph — you’ve probably noticed something unsettling: the longer your agent conversations get, the slower each new step becomes. Not a little slower. A lot slower. On a 10-turn conversation, each new turn might be snappy; by the 50th turn, each step is taking hundreds of milliseconds longer than the one before.
What’s going on: the state dict LangGraph is checkpointing keeps growing. Messages accumulate. Tool outputs get stored. Intermediate reasoning scratchpads pile up. Every super-step, all of that state gets passed through copy.deepcopy() before being written to your checkpointer.
Why it happens
LangGraph’s checkpointing contract guarantees that each checkpoint is an independent, self-contained snapshot of state. To provide that guarantee safely, the state is deep-copied before persistence — otherwise a downstream node mutating a shared dict could corrupt a stored checkpoint.
Deep-copying is correct. It is also, in Python, very expensive:
- The recursion walks every object in the state graph
- Each node allocates a new Python object (dict header, list buffer, etc.)
- Each allocation is a GIL cycle and a heap hit
- The whole walk runs in interpreted bytecode
For a small state (a few KB), deepcopy takes microseconds and you never notice. For a 235 KB agent state — messages, embeddings, tool outputs, scratchpads — deepcopy takes 206 ms. Per checkpoint.
On a graph that checkpoints 50 times per invocation, that’s 10 seconds of wall clock burned on serialization alone. Not LLM calls. Not tool execution. Not node work. Just deepcopy.
Who hits this first
Teams running:
- Multi-turn agent conversations where the history accumulates in state
- RAG pipelines that cache retrieved chunks in the state dict
- Tool-use agents that collect observations across many steps
- Reflection loops that maintain a scratchpad of self-critique
Anything where state is roughly monotonically growing over a graph’s lifetime. The 10-turn graph that was fine in dev becomes the 100-turn graph that times out in prod.
The symptoms to look for
- p95 latency grows sharply as conversation length or state size grows
tracemallocshows large allocation rates during graph execution- Profiling shows significant wall clock inside
copy.deepcopyor the checkpointer’sputmethod - Memory footprint balloons on long-running invocations
If any of these match your workload, this pain point is the most likely cause.
Workarounds that don’t solve it
- Pickling with protocol 5: marginal improvement, still Python-bound
- Custom
__copy__methods: help for specific classes but not the overall pattern MemorySaverinstead of SQLite: moves the cost but doesn’t eliminate it — deepcopy still runs- Smaller checkpoint retention: reduces total storage but not per-step cost
- Schema trimming: helps if you can cut state size, but often you can’t without losing functionality
All of these are legitimate optimizations. None of them close the order-of-magnitude gap.
The real fix
RustSQLiteCheckpointer is a drop-in replacement for LangGraph’s SQLite checkpointer. It uses the same on-disk format, same API, same configuration. The difference is that its serialization path is native Rust code walking a byte buffer — no Python object allocation, no GIL contention in the hot loop, no interpreter overhead.
Measured results on the same hardware:
| State size | LangGraph (deepcopy) | RustSQLiteCheckpointer | Speedup |
|---|---|---|---|
| 3.8 KB | 15.29 ms | 0.35 ms | 43× |
| 35 KB | 52.00 ms | 0.29 ms | 178× |
| 235 KB | 206.21 ms | 0.28 ms | 737× |
Note that the Rust time is roughly flat across state sizes — that’s the structural advantage. Deepcopy’s cost is per-node; Rust’s is per-buffer-byte. As your state grows, Python’s numbers grow proportionally while Rust’s barely move.
How to adopt it
See the full guide. The short version:
from fast_langgraph import RustSQLiteCheckpointer
checkpointer = RustSQLiteCheckpointer("state.db")
graph = graph.compile(checkpointer=checkpointer)
One line. No database migration. Reversible.
Related
- Why Python’s deepcopy kills LangGraph — the architectural deep-dive
- Rust SQLite checkpointer guide
- Benchmarks — full serialization numbers
How fast-langraph addresses this
RustSQLiteCheckpointer replaces the serialization path entirely. The on-disk format is unchanged, so existing databases migrate for free. Measured speedups range from 43x on small state to 737x on complex 235 KB state — a linear scaling that tracks state size.