← Back to Knowledge
deepcopyserializationcheckpointperformance

Why Python's deepcopy kills LangGraph at scale

Neul Labs · · Category: performance
TL;DR

Python's deepcopy allocates new objects for every node in your state graph, touches the GIL, and runs interpreted bytecode the whole way. On a 235 KB agent state, that's 206 ms per checkpoint. fast-langraph skips the object model entirely and does the equivalent walk in Rust — 0.28 ms, a 737x speedup.

There’s a question we get after every benchmark presentation: “How can a serialization library be 737× faster than the one Python ships with? That seems too good to be true.”

It’s not. Here’s the actual explanation.

What deepcopy does

When you call copy.deepcopy(state) on a LangGraph state dict, Python does something that seems simple but is deeply expensive at scale:

  1. Walks the object graph recursively. Every dict, list, tuple, custom object, and primitive gets visited.
  2. Allocates a new Python object for every node. New dict headers, new list buffers, new reference counts.
  3. Acquires the GIL for every allocation. This is uncontended in a single thread, but it’s a memory fence and a bookkeeping operation every time.
  4. Runs interpreted bytecode the whole way. The loop logic, the recursion, the isinstance checks, the memo dict for cycle detection — all of it is Python-level.

Individually, each of these operations is fast. Collectively, on a 235 KB LangGraph state object with hundreds of nested nodes, they add up to 206 ms.

Why this is the wrong abstraction for checkpoints

The catch is that deepcopy is doing far more work than you actually need for a checkpoint. A LangGraph checkpoint is fundamentally a byte buffer. You want to get the current state onto disk or into a database row, and later get it back out. Deepcopy is building you a new in-memory Python object graph so you can then turn around and serialize that — double work.

The optimal path for checkpointing looks like this:

  1. Walk the state structure once
  2. Write its bytes directly into an output buffer
  3. Flush that buffer to storage

No intermediate Python objects. No double allocation. No GIL contention per node. Just a linear walk and a linear write.

What Rust lets you do

That’s what the RustSQLiteCheckpointer does. The walk happens in Rust against a Vec<u8> buffer. No per-node object allocation. No GIL cycles inside the hot loop. The entire serialization runs in what’s effectively a tight native-code loop with cache-friendly access patterns.

State sizePython deepcopyRust serializerSpeedup
3.8 KB15.29 ms0.35 ms43×
35 KB52.00 ms0.29 ms178×
235 KB206.21 ms0.28 ms737×

Notice what happens to the Rust column: it barely grows. Going from 35 KB to 235 KB (nearly 7×) costs it 0.00 ms — within noise. The Python column grows linearly-ish. That’s the whole story in one table.

”But my state is small, does this matter?”

If your state is under ~1 KB and flat, probably not. Python’s dict is already written in C and has been optimized over decades. fast-langraph’s checkpointer will still work, but the speedup margin will be small because there wasn’t much overhead to eliminate.

The interesting inflection point is somewhere around 10–30 KB. Most production agent graphs blow past this quickly once they start accumulating message history, tool outputs, and intermediate reasoning scratchpads. The moment your state hits that zone, deepcopy latency becomes visible in your p95 numbers.

This isn’t a LangGraph bug

Important context: LangGraph is not doing anything wrong by using deepcopy. Deepcopy is the right default for a general-purpose Python framework. It’s correct, it handles arbitrary user types, and it’s what ships in the standard library. The issue is that the checkpoint serialization path is the wrong place for a general-purpose deep-copy, because checkpoint serialization has a much narrower contract: we only need bytes, not new objects.

fast-langraph’s architectural bet is that the checkpoint path is a well-isolated interface (it sits behind BaseCheckpointSaver) and can be swapped without touching the rest of LangGraph. The upstream framework stays general-purpose; we specialize the one operation where specialization pays 737×.

The broader lesson

Whenever you see a Python framework with a “checkpoint” or “snapshot” or “clone” in its hot path, that’s usually a deepcopy-shaped hole waiting for a Rust-shaped peg. It’s the single highest-leverage optimization target in Python AI infrastructure, and it’s why fast-langraph leads with it.

For the step-by-step on adopting the Rust checkpointer, see the guide. For the full benchmark methodology, see /benchmarks.

Frequently asked questions

Why is Python's deepcopy so slow on large state? +

deepcopy recursively allocates new Python objects for every node in the state graph. Each allocation takes the GIL, runs interpreted bytecode, and writes to a heap that then needs garbage collection. On complex state with nested dicts, lists, and custom objects, you pay this cost on every node in the structure. Speed is roughly linear in node count.

Why does fast-langraph achieve a 737x speedup specifically? +

The 737x number comes from a 235 KB state benchmark — deepcopy takes 206 ms, while the Rust implementation takes 0.28 ms. The speedup scales with state size and complexity because deepcopy's overhead is per-node while Rust's buffer walk is effectively free per node.

Does this matter for small graphs? +

No. If your state is small (a few KB) and simple (flat dicts, no deep nesting), Python's dict is already implemented in hand-optimized C and the difference is negligible. fast-langraph's advantage shows up when state gets complex, which is when production workloads hit a wall anyway.

Is there a way to keep using deepcopy and still be fast? +

Not really. You can use pickle with protocol 5 for somewhat faster serialization, or build custom __copy__ methods for your state classes. Both help but neither closes the order-of-magnitude gap. The architectural issue is that Python's object model forces per-node overhead that native code simply doesn't have.