← Back to Knowledge
profilingdiagnosticsperformance

Find LangGraph bottlenecks with GraphProfiler

Neul Labs · · Level: intermediate · Read: 5 min
TL;DR

Wrap a graph.invoke in profiler.profile_run(). Get a per-node breakdown of time spent in each step and channel update. Use it to decide which fast-langraph component to adopt next.

The best optimization is the one you didn’t need to make. Before swapping in RustSQLiteCheckpointer, enabling @cached, or flipping the shim, you should know where the time is actually going in your graph.

GraphProfiler is a low-overhead profiler built into fast-langraph that gives you exactly that.

Usage

from fast_langgraph.profiler import GraphProfiler

profiler = GraphProfiler()

with profiler.profile_run():
    result = graph.invoke(input_data)

profiler.print_report()

Output looks roughly like:

Graph Profile Report
--------------------
Total wall: 412.3 ms
  node_retrieve       118.2 ms  (28.7%)
  node_rerank           4.1 ms  ( 1.0%)
  node_generate       182.5 ms  (44.3%)
  checkpoint_put       87.4 ms  (21.2%)
  channel_update        9.3 ms  ( 2.3%)
  executor_setup       10.8 ms  ( 2.6%)

Already, you can see the picture: 45% is in the LLM node, 21% is in checkpoint persistence, and the rest is small change.

Interpreting the output

Use this decision tree once you have numbers:

  • Checkpoint dominates (>15% of wall clock) → adopt RustSQLiteCheckpointer. This is the single biggest lever in fast-langraph.
  • LLM calls dominate and have high duplication → wrap them in @cached. Check the hit rate afterwards.
  • Executor setup is >5% of wall clock → enable the shim. You’ll never see executor_setup again.
  • State update hot path is visible → look at langgraph_state_update for direct Rust state merging.
  • Your own node code dominates → fast-langraph won’t help. Profile your node internals with cProfile or py-spy instead.

Overhead

We measured the profiler at 1.6 μs per operation — well below the noise floor of anything you’d care about. On a 10,000-iteration microbenchmark we saw 16 ms of total profiling overhead (57% relative, because the underlying operation was already ~3 μs). In real graphs with 100+ ms per super-step, the overhead is invisible.

Run the profiler in your staging environment against a representative workload. Don’t run it in prod.

Common surprises

“Checkpoint is taking how much?” On large agent state (messages + tool outputs + memory scratchpad), we’ve seen teams discover they were spending 40% of wall clock on deepcopy alone. The profiler makes this visible — and then RustSQLiteCheckpointer erases it.

“I thought my LLM was the bottleneck.” Sometimes it is. Sometimes it’s 30% LLM, 30% checkpoint, 30% executor setup, and 10% everything else. You won’t know until you measure.

“Why is executor_setup so high on my short graph?” That’s the 58% executor churn problem. Shorter graphs spend a disproportionate fraction of their wall clock in thread pool construction because the actual work is cheap. The shim fixes it.

What’s next

Armed with a profile, pick the single biggest line item and adopt the targeted fix. Re-profile. Rinse and repeat. Most teams get their first 2–3× win from a single change.

If you’d rather have us do it, we offer production audits that include profiling, prioritization, and implementation.