LangGraph retry and branching loops re-issue identical LLM calls
Graphs with retries, reflection, branching, or tool-routing routinely send the same prompt to an LLM multiple times during a single invocation. Without caching, that duplicates your API spend and latency. The fast-langraph @cached decorator is a one-line wrapper that eliminates the waste with ~1.4 μs lookup overhead per call.
The pain
LangGraph’s graph model encourages patterns that re-issue LLM calls:
- Retry policies re-invoke a failed node, which typically re-invokes its LLM call
- Branching reflection (e.g. “judge node” evaluating a draft) re-runs LLM critique on the same content
- Tool-routing nodes classify intent, and the same user input can end up re-classified across a fan-out
- Self-consistency sampling issues the same prompt multiple times to pick a majority answer
On a typical production graph, you might end up making 3–5 LLM calls per request where only 1–2 are logically distinct. The rest are the same prompt, running the same model, getting the same answer. You pay for all of them.
Why it’s structural, not a bug
LangGraph can’t automatically dedupe LLM calls — it has no visibility into what “the same call” means. Two call_llm(prompt) invocations might have subtly different prompts (different timestamps, session IDs, or streaming context) and produce legitimately different results. The framework has to assume every call is unique, because for many workloads it is.
But your graph knows which calls are idempotent. You’re the one deciding when to retry, when to reflect, when to self-consistency-sample. You’re also the one seeing the bills and the latency.
The impact
On a RAG workload with reflection and self-consistency enabled, we measured a 90% cache hit rate across a 100-call sequence. Without caching:
- 100 LLM calls
- ~50 seconds of cumulative LLM latency
- ~$0.50 of API spend (rough estimate at GPT-4 prices)
With @cached enabled:
- 10 actual LLM calls (10 unique prompts)
- ~5.5 seconds of cumulative LLM latency
- ~$0.05 of API spend
That’s a 9.78× speedup on the LLM-bound portion and a 90% cost reduction. For a single workload. Running thousands of times per day.
The fix
from fast_langgraph import cached
@cached(max_size=1000)
def call_llm(prompt: str, **kwargs) -> str:
return llm.invoke(prompt, **kwargs)
That’s all it takes. The cache is keyed by the argument tuple, LRU-evicted at max_size, and sub-microsecond per lookup.
Full guide with tuning tips: Caching LLM calls with @cached.
When to not cache
- If your prompts always include a fresh timestamp or ID, caching won’t help (hit rate zero)
- If you’re running
temperature > 0and you want variety across retries, caching defeats that - If your calls have significant contextual differences that are meaningful even when the “main” prompt is identical
Run call_llm.cache_stats() after a benchmark. If the hit rate is under 20%, remove the decorator and save yourself the complexity. If it’s over 50%, you’re seeing real cost savings and latency improvements.
Related
- Caching LLM calls with @cached — the full guide
- When not to use fast-langraph — including when caching doesn’t help
How fast-langraph addresses this
The @cached decorator wraps any LLM call function and returns identical prompts from a sub-microsecond in-process cache. On a 90%-hit-rate workload we measured 9.78x speedup on the cache-eligible path and a proportional cost reduction.