← Back to Knowledge
benchmarkscomparisonperformance

LangGraph vs fast-langraph: side-by-side benchmarks

Neul Labs · · Category: comparison
TL;DR

Across every operation we measured, fast-langraph equals or beats vanilla LangGraph. The biggest wins are checkpoint serialization (737x on large state), sustained state updates (46x on quick workloads), and end-to-end graph execution (2.77x on realistic workloads with checkpointing). Small-state simple dict merges are the one place Python's hand-optimized C implementation still wins — and fast-langraph doesn't use Rust there.

Every benchmark on this page is reproducible from the public scripts/ directory in the fast-langraph repo. Same hardware, same inputs, same LangGraph version. No selective reporting.

Test environment

  • Python 3.12.3
  • Linux 6.14.0 x86_64
  • LangGraph 1.0.4 (commit 4d01e69b)
  • fast-langraph latest main
  • Generated 2025-12-10

Checkpoint serialization

Serializing a LangGraph state dict through the checkpoint path.

State sizeLangGraph (deepcopy)fast-langraph (Rust)fast-langraph advantage
3.8 KB15.29 ms0.35 ms43×
35 KB52.00 ms0.29 ms178×
235 KB206.21 ms0.28 ms737×

Rust’s advantage scales with state size because deepcopy’s cost is per-node while the Rust path is effectively a linear buffer walk. At very small state sizes the gap is still 43× — but absolute numbers are low enough (15 ms) that you might not notice.

Sustained state updates

Simulating graph execution with continuous state updates.

WorkloadStepsLangGraphfast-langraphAdvantage
Quick1,00083.98 ms1.83 ms45.9×
Medium1007.56 ms0.57 ms13.2×

The “quick” workload represents tight inner loops where state updates dominate. The 45.9× number is the kind of thing you see on graphs that stream tokens or aggregate many small observations.

End-to-end graph simulation

A realistic 20-node graph running 50 iterations with checkpointing enabled.

ImplementationTotal timeRelative
LangGraph baseline25.26 ms1.00×
fast-langraph (full)9.11 ms2.77×

This is the number we cite as the “honest end-to-end” speedup. It’s not 737× because end-to-end time includes node execution, LLM call latency (stubbed here), and other work that fast-langraph doesn’t touch. It’s 2.77× because the parts we do touch — checkpoint, executor, apply_writes — are a significant fraction but not the whole wall clock.

For real production workloads, the end-to-end multiplier varies from ~1.5× (LLM-dominated) to ~5–8× (checkpoint-dominated) depending on where time is actually going.

LLM caching

Simulated LLM workload with 90% cache hit rate on a 100-call sequence.

ImplementationTotal timeRelative
No cache108.48 ms1.00×
fast-langraph @cached11.09 ms9.78×

Speedup scales linearly with hit rate. At 50% it’s roughly 2×; at 90% it’s ~10×.

Channel operations

Here’s where we’re honest about a place fast-langraph does not beat LangGraph:

OperationLangGraphfast-langraph (Rust)Winner
RustLastValue channel update (per op)51.17 ns317.29 nsLangGraph

Rust loses here by about 6×. Why? Because channel updates are such tiny operations that Python function call overhead (which is still present even with a Rust backend via PyO3) outweighs the computation. Python’s reference to its own C code has no such overhead.

This is why the shim doesn’t replace the channel update path — it only replaces apply_writes (the batch update function), where Rust’s SIMD/pipelining advantages can amortize the PyO3 boundary cost. Individual channel updates stay on the Python path.

The lesson: we don’t rewrite anything that Python already does fast. We pick the right battles.

Simple vs. deep dict merge

Another area we’re honest about:

OperationPython {**a, **b}Rust merge_dictsWinner
1000-key merge × 10,000 iters209.94 ms1,084.81 msPython (5.2×)

Python’s built-in dict merge is hand-optimized C. For simple dict merges, it stays faster than anything we can reasonably do via PyO3. For deep nested merges with messages to append, the story flips and langgraph_state_update wins — but only when the structure is non-trivial.

fast-langraph uses Rust where it pays and stays out of the way where it doesn’t. If you see a claim like “Rust always wins at Python operations,” that claim is wrong. Rust wins when there’s meaningful work to amortize; Python’s C layer wins when there isn’t.

In-memory checkpoint operations

OperationLangGraphfast-langraph
PUT (1000 ops)baseline1.40 ms (1.40 μs/op)
GET (1000 ops)baseline3.73 ms (3.73 μs/op)

In-memory checkpointer operations are already fast (no disk, no serialization). fast-langraph’s in-memory checkpointer matches or slightly improves on the baseline; the real win is on the SQLite path where serialization dominates.

Profiler overhead

MetricValue
Without profiling28.11 ms
With profiling44.28 ms
Overhead per op1.62 μs

The profiler adds ~1.6 μs per operation. On a microbenchmark like this, that’s 57% relative overhead. On a real graph where operations take milliseconds, it’s effectively invisible.

How to reproduce

git clone https://github.com/neul-labs/fast-langgraph
cd fast-langgraph
uv run python scripts/generate_benchmark_report.py

Or individual suites:

uv run python scripts/benchmark_rust_strengths.py
uv run python scripts/benchmark_complex_structures.py
uv run python scripts/benchmark_all_features.py
uv run python scripts/benchmark_rust_channels.py
cargo bench

All benchmark code is public. All inputs are deterministic. All outputs land in BENCHMARK.md.

The TL;DR for architects

  • Big win: checkpoint serialization (737×). Adopt RustSQLiteCheckpointer if your state is non-trivial.
  • Big win: sustained state updates (46×) and end-to-end (2.77×). Adopt the shim universally.
  • Big win: LLM caching (9.78×). Adopt @cached where your graph has redundant prompts.
  • No win: per-op channel updates and simple dict merges. fast-langraph knows this and stays out of those paths.

If you want to know which bucket your workload is in, run the profiler first. Or let us run it for you.