Executor churn: the 58% problem in LangGraph invocations
LangGraph creates a fresh ThreadPoolExecutor on every invoke call. On short graphs, we measured this consuming 58% of total wall clock time — the graph itself hasn't even started running yet. The shim's executor cache eliminates this and delivers a 2.3x speedup on the invocation path alone.
This is the finding that most surprised us when we started profiling real LangGraph workloads. If you’d asked us ahead of time where invocation overhead was, we’d have guessed: channel updates, state merging, checkpointer overhead, maybe Pydantic validation. All reasonable guesses. None of them came close.
The actual answer: thread pool construction.
What we measured
We ran a minimal LangGraph workload in a tight loop — a small state, a handful of nodes, no real LLM calls (we stubbed them), no checkpointing. Pure invocation overhead. The kind of profile you’d use to answer “how much does LangGraph itself cost, ignoring what my nodes do?”
On a 100-invocation hot loop, we got this breakdown:
Total wall: 4320 ms
ThreadPoolExecutor setup 2506 ms (58.0%)
apply_writes 482 ms (11.2%)
Node dispatch 1055 ms (24.4%)
Channel bookkeeping 277 ms ( 6.4%)
More than half the wall clock — 58% — was spent inside ThreadPoolExecutor.__init__ and the associated futures registration. The actual graph logic got the minority share.
This was not a synthetic microbenchmark meant to make a point. This was a profile of a LangGraph workload that a real team was running in production and couldn’t figure out why was slow.
Why this happens
LangGraph uses concurrent.futures.ThreadPoolExecutor to dispatch node work. On every call to graph.invoke(...), a fresh executor is constructed. That means:
- Thread workers are spawned (
os.forkisn’t used on Linux here, but the Python-level setup has non-trivial fixed cost) - Internal work queues are allocated
- Future objects are allocated
- Worker threads enter their wait loops
All of this is O(1) per invocation, but the constant is large relative to the work a short graph actually does. A 5-node graph might execute its nodes in 2 ms but spend 50 ms setting up the executor around them. That’s a 25× overhead ratio.
As graphs get longer, the overhead ratio falls. If your graph runs for 10 seconds, 50 ms of executor setup is 0.5%. But most production LangGraph workloads aren’t 10-second graphs — they’re sub-second responses to user requests, and that’s where the 58% number comes from.
Why the default isn’t “keep it alive”
Reasonable question: if reusing the executor is this much faster, why doesn’t LangGraph just keep it alive across invocations?
Short answer: shared-state safety. A cached executor means cancellation, error handling, and cleanup semantics all change subtly. If an invocation errors, should the shared pool recover cleanly? What if the user is running multi-tenant? What about memory leaks from accumulated future references? These are solvable problems but they’re invasive enough that LangGraph’s default — a fresh pool per call — is the conservative, correct choice for a general-purpose framework.
fast-langraph takes on the cost of solving them because we can afford to: our target is production workloads where the trade-off is clearly worth it.
How the shim fixes it
The shim replaces LangGraph’s executor construction with a call into a cached pool. The first invocation creates the pool; subsequent invocations reuse it. The pool lives for the process lifetime.
Safety guarantees:
- Isolation between invocations via the standard
FutureAPI — individual futures don’t leak state - Bounded concurrency — the pool size is configurable; we default to a reasonable cpu-count-based value
- Clean shutdown — on process exit, the pool is shut down cleanly via
atexit - Observable —
fast_langgraph.shim.print_status()tells you whether the cached pool is active
On our benchmark workload, enabling the cached executor alone delivered a 2.3× speedup on the invocation path. Combined with Rust apply_writes (another 1.2×), the shim’s end-to-end win is ~2.8× on realistic workloads.
What this means for you
If your graphs are short and your per-invocation latency is dominated by “LangGraph itself” — i.e., the profiler shows a lot of time in executor and channel code and not much in your own nodes — the shim will be a disproportionately large win for you. Enable it and move on.
If your graphs are long or LLM-bound, the shim still helps, but it’s dwarfed by other savings you’ll get from RustSQLiteCheckpointer or @cached. Different bottleneck; different lever.
The broader pattern
The executor churn problem is a specific case of a general pattern: fixed costs per invocation matter disproportionately in systems that invoke often. LangGraph invocations are relatively expensive at the fixed-cost level because the framework is doing a lot of setup work (building execution graphs, registering channel listeners, spinning up workers) that the typical user neither sees nor thinks about. For teams running one-off graphs in development, this fixed cost is irrelevant. For teams running graphs thousands of times per minute in production, it’s the whole game.
Whenever you profile a system with high invocation frequency and find significant wall-clock in “setup” code, you’ve probably found an opportunity like this one. The fix is usually a cache with careful lifetime management.
See also
- Automatic shim mode — how to enable it
- Scaling LangGraph in production — the other two bottlenecks
- Profiling bottlenecks — how to check if this affects you
Frequently asked questions
Why does LangGraph recreate the executor every invocation? +
It's a safe default — starting fresh avoids shared-state issues between calls and matches the Python concurrent.futures conventions most developers expect. The downside is that executor construction is expensive relative to the work most graph nodes actually do, so short-lived invocations pay a disproportionate fixed cost.
Why doesn't the built-in executor reuse solve this? +
Python's concurrent.futures module has no built-in executor reuse across calls — you'd have to maintain the pool as a process-level singleton. LangGraph doesn't do this because passing a shared executor changes semantics around clean-up, cancellation, and error isolation. fast-langraph's shim takes on that responsibility explicitly.
Is the cached executor safe for concurrent invocations? +
Yes. The cached pool is a standard ThreadPoolExecutor sized to your workload, shared across invocations. Individual invocations get their own futures and don't share state. The only risk is if you were relying on the old semantics of getting a fresh pool per call — we haven't seen a production workload that does this intentionally.