← Back to Knowledge
productionscalingarchitecturecost

Scaling LangGraph in production: the three real bottlenecks

Neul Labs · · Category: architecture
TL;DR

Production LangGraph workloads hit three predictable walls: deepcopy-bound checkpoints, thread-pool churn per invocation, and redundant LLM calls. Each one has a clear lever and a clear cost story. The three together typically give teams 3–8x latency improvements and 30–60% LLM cost reductions.

If you’ve run LangGraph in production for more than a month, you’ve probably hit a wall. Latency spikes. LLM bills that don’t match traffic. Memory that balloons over time. The surprising thing isn’t that these problems exist — it’s how consistent they are across very different teams and use cases.

There are three bottlenecks. They show up in the same order. They have the same fixes. Here’s the map.

Bottleneck 1: checkpoint serialization

Symptom: p95 latency jumps when you enable checkpointing. Memory allocation rate climbs. tracemalloc shows tens of MB allocated per invocation even though your state “feels small.”

Root cause: Python’s deepcopy runs on every checkpoint write. On complex state — messages, tool outputs, scratchpads, cached embeddings — deepcopy’s per-node allocation overhead dominates. We measured 206 ms for a 235 KB state (see why deepcopy kills LangGraph).

Fix: RustSQLiteCheckpointer replaces the serialization path end-to-end. Same on-disk format, so no data migration. Peak speedup of 737× on large state, with linear gains as state grows.

Cost math: On a workload doing 100 checkpoints per request at ~50 ms each (Python), checkpoint time alone is 5 seconds per request. Moving to Rust cuts that to ~28 ms — effectively free. At 1 QPS sustained, that’s 5 CPU-seconds saved per second of real time: you’ve freed an entire core.

Bottleneck 2: executor churn

Symptom: Short graphs are slower than they “should” be. p50 latency on a trivially small invocation is 50–100 ms even though no individual step does much work. Profile shows ThreadPoolExecutor.__init__ and related setup code.

Root cause: LangGraph builds a fresh ThreadPoolExecutor for every invoke call. Thread pool creation is slow — spawning workers, registering futures, allocating job queues. We measured this taking 58% of per-invocation wall clock on short graphs.

Fix: The shim’s executor caching keeps the pool alive across invocations. Flip it on with fast_langgraph.shim.patch_langgraph(). Typical result: 2.3× speedup on invocation overhead, contributing to a combined ~2.8× end-to-end improvement when stacked with the Rust apply_writes.

Cost math: On a workload at 10 QPS with 50 ms per invocation (half of which is executor setup), you’re burning 250 ms of CPU per second on thread pool churn alone. Eliminating it recovers ~25% of your compute budget immediately. On high-QPS services, this is often the single largest latency improvement.

Bottleneck 3: LLM redundancy

Symptom: LLM bills grow faster than traffic. You notice the same prompt getting sent to the API multiple times during a single request. Retries, branching, and reflection phases double or triple your cost per call.

Root cause: Graphs have loops, retries, and re-entrant paths. The same llm.invoke(prompt) call with identical arguments gets executed multiple times because no caching layer sits between your node code and the LLM client.

Fix: The @cached decorator. One line on top of your LLM call function. Content-addressed by arguments, LRU-evicted, sub-microsecond lookup cost.

Cost math: On a graph with 90% cache hit rate and 3 LLM calls per invocation, caching moves your LLM cost from 3x to 1.3x per invocation — a 57% reduction in API spend. Latency-wise, you go from 3 × 500 ms to roughly 1.3 × 500 ms = 650 ms instead of 1500 ms, a 2.3× improvement on the LLM-bound portion.

The compound effect

The three bottlenecks are orthogonal. Each one hits a different part of the pipeline, so the speedups stack multiplicatively in most cases:

  • Shim alone: ~2.8× on end-to-end wall clock
  • Plus RustSQLiteCheckpointer on a checkpoint-heavy workload: another 1.5–3×, depending on state size
  • Plus @cached on an LLM-heavy workload: another 1.5–2×, depending on cache hit rate

A production workload that was 2 seconds of p50 latency can reasonably end up at 400–600 ms after a full optimization pass. LLM bills drop 30–60%. Memory footprint drops 20–40%.

The order matters

If you’re doing this yourself, profile first (see profiling bottlenecks), then attack the biggest single line item. For most teams, that’s checkpoint serialization — because agents tend to accumulate state aggressively, and deepcopy scales the worst.

The wrong order (e.g., “let’s add caching first”) isn’t harmful, but it leaves value on the table. You want to eliminate fixed costs before reducing variable costs.

When you should call us

If your workload is small, your bills are flat, and your p95 is within budget — do nothing. Optimization is a cost that only pays back when you’re actually paying for the problem.

If you’re running at scale, your bills are growing, and you don’t have internal Rust expertise, we run production audits. Audit → optimize → hand off in under a month for most workloads.

Further reading

Frequently asked questions

When should I start thinking about LangGraph performance? +

When p95 latency exceeds your SLA budget, when LLM bills are growing faster than traffic, or when you notice memory footprint climbing with state size. Before that, don't bother — optimization without measurement is noise.

Is fast-langraph the only option? +

No. Teams sometimes solve these bottlenecks with a combination of state schema changes, checkpointer retention tuning, and custom caching layers. fast-langraph gives you turnkey answers to these problems, but nothing stops you from building your own. We measure outcomes, not library loyalty.

How much engineering time does a full optimization pass take? +

A focused optimization sprint is typically 1–3 weeks: one week of profiling and measurement, one to two weeks of implementation, plus production validation. The biggest variable is how much of your state schema needs redesign.