Find LangGraph bottlenecks with GraphProfiler
Wrap a graph.invoke in profiler.profile_run(). Get a per-node breakdown of time spent in each step and channel update. Use it to decide which fast-langraph component to adopt next.
The best optimization is the one you didn’t need to make. Before swapping in RustSQLiteCheckpointer, enabling @cached, or flipping the shim, you should know where the time is actually going in your graph.
GraphProfiler is a low-overhead profiler built into fast-langraph that gives you exactly that.
Usage
from fast_langgraph.profiler import GraphProfiler
profiler = GraphProfiler()
with profiler.profile_run():
result = graph.invoke(input_data)
profiler.print_report()
Output looks roughly like:
Graph Profile Report
--------------------
Total wall: 412.3 ms
node_retrieve 118.2 ms (28.7%)
node_rerank 4.1 ms ( 1.0%)
node_generate 182.5 ms (44.3%)
checkpoint_put 87.4 ms (21.2%)
channel_update 9.3 ms ( 2.3%)
executor_setup 10.8 ms ( 2.6%)
Already, you can see the picture: 45% is in the LLM node, 21% is in checkpoint persistence, and the rest is small change.
Interpreting the output
Use this decision tree once you have numbers:
- Checkpoint dominates (>15% of wall clock) → adopt
RustSQLiteCheckpointer. This is the single biggest lever in fast-langraph. - LLM calls dominate and have high duplication → wrap them in
@cached. Check the hit rate afterwards. - Executor setup is >5% of wall clock → enable the shim. You’ll never see executor_setup again.
- State update hot path is visible → look at
langgraph_state_updatefor direct Rust state merging. - Your own node code dominates → fast-langraph won’t help. Profile your node internals with cProfile or py-spy instead.
Overhead
We measured the profiler at 1.6 μs per operation — well below the noise floor of anything you’d care about. On a 10,000-iteration microbenchmark we saw 16 ms of total profiling overhead (57% relative, because the underlying operation was already ~3 μs). In real graphs with 100+ ms per super-step, the overhead is invisible.
Run the profiler in your staging environment against a representative workload. Don’t run it in prod.
Common surprises
“Checkpoint is taking how much?”
On large agent state (messages + tool outputs + memory scratchpad), we’ve seen teams discover they were spending 40% of wall clock on deepcopy alone. The profiler makes this visible — and then RustSQLiteCheckpointer erases it.
“I thought my LLM was the bottleneck.” Sometimes it is. Sometimes it’s 30% LLM, 30% checkpoint, 30% executor setup, and 10% everything else. You won’t know until you measure.
“Why is executor_setup so high on my short graph?” That’s the 58% executor churn problem. Shorter graphs spend a disproportionate fraction of their wall clock in thread pool construction because the actual work is cheap. The shim fixes it.
What’s next
Armed with a profile, pick the single biggest line item and adopt the targeted fix. Re-profile. Rinse and repeat. Most teams get their first 2–3× win from a single change.
If you’d rather have us do it, we offer production audits that include profiling, prioritization, and implementation.