← Back to Knowledge
llmcachingcostlanggraph-issue

LangGraph retry and branching loops re-issue identical LLM calls

Neul Labs · · Severity: medium · Affects: all
TL;DR

Graphs with retries, reflection, branching, or tool-routing routinely send the same prompt to an LLM multiple times during a single invocation. Without caching, that duplicates your API spend and latency. The fast-langraph @cached decorator is a one-line wrapper that eliminates the waste with ~1.4 μs lookup overhead per call.

The pain

LangGraph’s graph model encourages patterns that re-issue LLM calls:

  • Retry policies re-invoke a failed node, which typically re-invokes its LLM call
  • Branching reflection (e.g. “judge node” evaluating a draft) re-runs LLM critique on the same content
  • Tool-routing nodes classify intent, and the same user input can end up re-classified across a fan-out
  • Self-consistency sampling issues the same prompt multiple times to pick a majority answer

On a typical production graph, you might end up making 3–5 LLM calls per request where only 1–2 are logically distinct. The rest are the same prompt, running the same model, getting the same answer. You pay for all of them.

Why it’s structural, not a bug

LangGraph can’t automatically dedupe LLM calls — it has no visibility into what “the same call” means. Two call_llm(prompt) invocations might have subtly different prompts (different timestamps, session IDs, or streaming context) and produce legitimately different results. The framework has to assume every call is unique, because for many workloads it is.

But your graph knows which calls are idempotent. You’re the one deciding when to retry, when to reflect, when to self-consistency-sample. You’re also the one seeing the bills and the latency.

The impact

On a RAG workload with reflection and self-consistency enabled, we measured a 90% cache hit rate across a 100-call sequence. Without caching:

  • 100 LLM calls
  • ~50 seconds of cumulative LLM latency
  • ~$0.50 of API spend (rough estimate at GPT-4 prices)

With @cached enabled:

  • 10 actual LLM calls (10 unique prompts)
  • ~5.5 seconds of cumulative LLM latency
  • ~$0.05 of API spend

That’s a 9.78× speedup on the LLM-bound portion and a 90% cost reduction. For a single workload. Running thousands of times per day.

The fix

from fast_langgraph import cached

@cached(max_size=1000)
def call_llm(prompt: str, **kwargs) -> str:
    return llm.invoke(prompt, **kwargs)

That’s all it takes. The cache is keyed by the argument tuple, LRU-evicted at max_size, and sub-microsecond per lookup.

Full guide with tuning tips: Caching LLM calls with @cached.

When to not cache

  • If your prompts always include a fresh timestamp or ID, caching won’t help (hit rate zero)
  • If you’re running temperature > 0 and you want variety across retries, caching defeats that
  • If your calls have significant contextual differences that are meaningful even when the “main” prompt is identical

Run call_llm.cache_stats() after a benchmark. If the hit rate is under 20%, remove the decorator and save yourself the complexity. If it’s over 50%, you’re seeing real cost savings and latency improvements.

How fast-langraph addresses this

The @cached decorator wraps any LLM call function and returns identical prompts from a sub-microsecond in-process cache. On a 90%-hit-rate workload we measured 9.78x speedup on the cache-eligible path and a proportional cost reduction.