← Back to Knowledge
executorthread-poolinvocation-overhead

Executor churn: the 58% problem in LangGraph invocations

Neul Labs · · Category: performance
TL;DR

LangGraph creates a fresh ThreadPoolExecutor on every invoke call. On short graphs, we measured this consuming 58% of total wall clock time — the graph itself hasn't even started running yet. The shim's executor cache eliminates this and delivers a 2.3x speedup on the invocation path alone.

This is the finding that most surprised us when we started profiling real LangGraph workloads. If you’d asked us ahead of time where invocation overhead was, we’d have guessed: channel updates, state merging, checkpointer overhead, maybe Pydantic validation. All reasonable guesses. None of them came close.

The actual answer: thread pool construction.

What we measured

We ran a minimal LangGraph workload in a tight loop — a small state, a handful of nodes, no real LLM calls (we stubbed them), no checkpointing. Pure invocation overhead. The kind of profile you’d use to answer “how much does LangGraph itself cost, ignoring what my nodes do?”

On a 100-invocation hot loop, we got this breakdown:

Total wall: 4320 ms
  ThreadPoolExecutor setup   2506 ms   (58.0%)
  apply_writes                482 ms   (11.2%)
  Node dispatch              1055 ms   (24.4%)
  Channel bookkeeping         277 ms   ( 6.4%)

More than half the wall clock — 58% — was spent inside ThreadPoolExecutor.__init__ and the associated futures registration. The actual graph logic got the minority share.

This was not a synthetic microbenchmark meant to make a point. This was a profile of a LangGraph workload that a real team was running in production and couldn’t figure out why was slow.

Why this happens

LangGraph uses concurrent.futures.ThreadPoolExecutor to dispatch node work. On every call to graph.invoke(...), a fresh executor is constructed. That means:

  1. Thread workers are spawned (os.fork isn’t used on Linux here, but the Python-level setup has non-trivial fixed cost)
  2. Internal work queues are allocated
  3. Future objects are allocated
  4. Worker threads enter their wait loops

All of this is O(1) per invocation, but the constant is large relative to the work a short graph actually does. A 5-node graph might execute its nodes in 2 ms but spend 50 ms setting up the executor around them. That’s a 25× overhead ratio.

As graphs get longer, the overhead ratio falls. If your graph runs for 10 seconds, 50 ms of executor setup is 0.5%. But most production LangGraph workloads aren’t 10-second graphs — they’re sub-second responses to user requests, and that’s where the 58% number comes from.

Why the default isn’t “keep it alive”

Reasonable question: if reusing the executor is this much faster, why doesn’t LangGraph just keep it alive across invocations?

Short answer: shared-state safety. A cached executor means cancellation, error handling, and cleanup semantics all change subtly. If an invocation errors, should the shared pool recover cleanly? What if the user is running multi-tenant? What about memory leaks from accumulated future references? These are solvable problems but they’re invasive enough that LangGraph’s default — a fresh pool per call — is the conservative, correct choice for a general-purpose framework.

fast-langraph takes on the cost of solving them because we can afford to: our target is production workloads where the trade-off is clearly worth it.

How the shim fixes it

The shim replaces LangGraph’s executor construction with a call into a cached pool. The first invocation creates the pool; subsequent invocations reuse it. The pool lives for the process lifetime.

Safety guarantees:

  • Isolation between invocations via the standard Future API — individual futures don’t leak state
  • Bounded concurrency — the pool size is configurable; we default to a reasonable cpu-count-based value
  • Clean shutdown — on process exit, the pool is shut down cleanly via atexit
  • Observablefast_langgraph.shim.print_status() tells you whether the cached pool is active

On our benchmark workload, enabling the cached executor alone delivered a 2.3× speedup on the invocation path. Combined with Rust apply_writes (another 1.2×), the shim’s end-to-end win is ~2.8× on realistic workloads.

What this means for you

If your graphs are short and your per-invocation latency is dominated by “LangGraph itself” — i.e., the profiler shows a lot of time in executor and channel code and not much in your own nodes — the shim will be a disproportionately large win for you. Enable it and move on.

If your graphs are long or LLM-bound, the shim still helps, but it’s dwarfed by other savings you’ll get from RustSQLiteCheckpointer or @cached. Different bottleneck; different lever.

The broader pattern

The executor churn problem is a specific case of a general pattern: fixed costs per invocation matter disproportionately in systems that invoke often. LangGraph invocations are relatively expensive at the fixed-cost level because the framework is doing a lot of setup work (building execution graphs, registering channel listeners, spinning up workers) that the typical user neither sees nor thinks about. For teams running one-off graphs in development, this fixed cost is irrelevant. For teams running graphs thousands of times per minute in production, it’s the whole game.

Whenever you profile a system with high invocation frequency and find significant wall-clock in “setup” code, you’ve probably found an opportunity like this one. The fix is usually a cache with careful lifetime management.

See also

Frequently asked questions

Why does LangGraph recreate the executor every invocation? +

It's a safe default — starting fresh avoids shared-state issues between calls and matches the Python concurrent.futures conventions most developers expect. The downside is that executor construction is expensive relative to the work most graph nodes actually do, so short-lived invocations pay a disproportionate fixed cost.

Why doesn't the built-in executor reuse solve this? +

Python's concurrent.futures module has no built-in executor reuse across calls — you'd have to maintain the pool as a process-level singleton. LangGraph doesn't do this because passing a shared executor changes semantics around clean-up, cancellation, and error isolation. fast-langraph's shim takes on that responsibility explicitly.

Is the cached executor safe for concurrent invocations? +

Yes. The cached pool is a standard ThreadPoolExecutor sized to your workload, shared across invocations. Individual invocations get their own futures and don't share state. The only risk is if you were relying on the old semantics of getting a fresh pool per call — we haven't seen a production workload that does this intentionally.