LangGraph vs fast-langraph: side-by-side benchmarks

Every benchmark on this page is reproducible from the public scripts/ directory in the fast-langraph repo. Same hardware, same inputs, same LangGraph version. No selective reporting.

Test environment

Python 3.12.3
Linux 6.14.0 x86_64
LangGraph 1.0.4 (commit 4d01e69b)
fast-langraph latest main
Generated 2025-12-10

Checkpoint serialization

Serializing a LangGraph state dict through the checkpoint path.

State size	LangGraph (deepcopy)	fast-langraph (Rust)	fast-langraph advantage
3.8 KB	15.29 ms	0.35 ms	43×
35 KB	52.00 ms	0.29 ms	178×
235 KB	206.21 ms	0.28 ms	737×

Rust’s advantage scales with state size because deepcopy’s cost is per-node while the Rust path is effectively a linear buffer walk. At very small state sizes the gap is still 43× — but absolute numbers are low enough (15 ms) that you might not notice.

Sustained state updates

Simulating graph execution with continuous state updates.

Workload	Steps	LangGraph	fast-langraph	Advantage
Quick	1,000	83.98 ms	1.83 ms	45.9×
Medium	100	7.56 ms	0.57 ms	13.2×

The “quick” workload represents tight inner loops where state updates dominate. The 45.9× number is the kind of thing you see on graphs that stream tokens or aggregate many small observations.

End-to-end graph simulation

A realistic 20-node graph running 50 iterations with checkpointing enabled.

Implementation	Total time	Relative
LangGraph baseline	25.26 ms	1.00×
fast-langraph (full)	9.11 ms	2.77×

This is the number we cite as the “honest end-to-end” speedup. It’s not 737× because end-to-end time includes node execution, LLM call latency (stubbed here), and other work that fast-langraph doesn’t touch. It’s 2.77× because the parts we do touch — checkpoint, executor, apply_writes — are a significant fraction but not the whole wall clock.

For real production workloads, the end-to-end multiplier varies from ~1.5× (LLM-dominated) to ~5–8× (checkpoint-dominated) depending on where time is actually going.

LLM caching

Simulated LLM workload with 90% cache hit rate on a 100-call sequence.

Implementation	Total time	Relative
No cache	108.48 ms	1.00×
fast-langraph `@cached`	11.09 ms	9.78×

Speedup scales linearly with hit rate. At 50% it’s roughly 2×; at 90% it’s ~10×.

Channel operations

Here’s where we’re honest about a place fast-langraph does not beat LangGraph:

Operation	LangGraph	fast-langraph (Rust)	Winner
`RustLastValue` channel update (per op)	51.17 ns	317.29 ns	LangGraph

Rust loses here by about 6×. Why? Because channel updates are such tiny operations that Python function call overhead (which is still present even with a Rust backend via PyO3) outweighs the computation. Python’s reference to its own C code has no such overhead.

This is why the shim doesn’t replace the channel update path — it only replaces apply_writes (the batch update function), where Rust’s SIMD/pipelining advantages can amortize the PyO3 boundary cost. Individual channel updates stay on the Python path.

The lesson: we don’t rewrite anything that Python already does fast. We pick the right battles.

Simple vs. deep dict merge

Another area we’re honest about:

Operation	Python `{a, b}`	Rust `merge_dicts`	Winner
1000-key merge × 10,000 iters	209.94 ms	1,084.81 ms	Python (5.2×)

Python’s built-in dict merge is hand-optimized C. For simple dict merges, it stays faster than anything we can reasonably do via PyO3. For deep nested merges with messages to append, the story flips and langgraph_state_update wins — but only when the structure is non-trivial.

fast-langraph uses Rust where it pays and stays out of the way where it doesn’t. If you see a claim like “Rust always wins at Python operations,” that claim is wrong. Rust wins when there’s meaningful work to amortize; Python’s C layer wins when there isn’t.

In-memory checkpoint operations

Operation	LangGraph	fast-langraph
PUT (1000 ops)	baseline	1.40 ms (1.40 μs/op)
GET (1000 ops)	baseline	3.73 ms (3.73 μs/op)

In-memory checkpointer operations are already fast (no disk, no serialization). fast-langraph’s in-memory checkpointer matches or slightly improves on the baseline; the real win is on the SQLite path where serialization dominates.

Profiler overhead

Metric	Value
Without profiling	28.11 ms
With profiling	44.28 ms
Overhead per op	1.62 μs

The profiler adds ~1.6 μs per operation. On a microbenchmark like this, that’s 57% relative overhead. On a real graph where operations take milliseconds, it’s effectively invisible.

How to reproduce

git clone https://github.com/neul-labs/fast-langgraph
cd fast-langgraph
uv run python scripts/generate_benchmark_report.py

Or individual suites:

uv run python scripts/benchmark_rust_strengths.py
uv run python scripts/benchmark_complex_structures.py
uv run python scripts/benchmark_all_features.py
uv run python scripts/benchmark_rust_channels.py
cargo bench

All benchmark code is public. All inputs are deterministic. All outputs land in BENCHMARK.md.

The TL;DR for architects

Big win: checkpoint serialization (737×). Adopt RustSQLiteCheckpointer if your state is non-trivial.
Big win: sustained state updates (46×) and end-to-end (2.77×). Adopt the shim universally.
Big win: LLM caching (9.78×). Adopt @cached where your graph has redundant prompts.
No win: per-op channel updates and simple dict merges. fast-langraph knows this and stays out of those paths.

If you want to know which bucket your workload is in, run the profiler first. Or let us run it for you.