← Back to Knowledge
cachingllmcost-optimization

Cache LLM calls with @cached for a 10x speedup

Neul Labs · · Level: beginner · Read: 5 min
TL;DR

Wrap your LLM call in @cached. Identical prompts return from an in-process cache in ~1.4 μs instead of hitting the API. On a 90%-hit-rate workload, that's a 9.78x speedup and a direct cost reduction.

LangGraph graphs have loops, retries, branching, and tool routing. The same prompts end up getting sent to the LLM over and over — especially during RAG, planning, and reflection phases. You’re burning latency and API spend on calls you’ve already made.

@cached is a one-line wrapper that fixes this.

The problem in one picture

from langgraph.graph import StateGraph

def call_llm(prompt: str) -> str:
    return llm.invoke(prompt)

# Inside your graph, called multiple times per invocation:
response = call_llm("Summarize the following transcript: ...")
# ... branch, retry, reflect ...
response = call_llm("Summarize the following transcript: ...")  # Same prompt!

Each call hits the LLM API: ~500 ms and ~$0.002 of GPT-4. Even a modest graph that invokes two such calls per request wastes half a second and 20% of its budget on redundancy.

The fix

from fast_langgraph import cached

@cached(max_size=1000)
def call_llm(prompt: str) -> str:
    return llm.invoke(prompt)

That’s the entire change. The decorator gives you:

  • Content-addressed lookups — hashed by the full argument tuple, so the same prompt returns the same result
  • LRU eviction once max_size is reached
  • Sub-microsecond overhead on cache hits (1.38 μs in our benchmarks)
  • Per-function cache_stats() so you can see exactly what’s hitting

Measure it

# First call — misses
response = call_llm("What is LangGraph?")   # ~500 ms

# Second call — hits
response = call_llm("What is LangGraph?")   # ~0.01 ms

print(call_llm.cache_stats())
# {'hits': 1, 'misses': 1, 'size': 1}

On our benchmark workload with a 90% cache hit rate (see benchmarks):

Without cacheWith cacheSpeedup
Total time108.48 ms11.09 ms9.78×

The speedup scales linearly with your hit rate. A graph with 50% redundant LLM calls will see ~2× on the LLM-bound portion.

What to cache (and what not to)

Good candidates:

  • Tool-routing prompts that classify user intent
  • RAG retrieval-augmentation prompts with stable passages
  • Reflection and self-critique passes on static content
  • Test and evaluation runs on fixed fixtures

Bad candidates:

  • Conversational turns where context always differs
  • Calls with timestamps, IDs, or random seeds baked into the prompt
  • Calls where the temperature is non-zero and you want variety

When in doubt, add @cached, run your test suite, and check cache_stats(). If the hit rate is near zero, it’s not helping — remove it.

Cache invalidation

@cached is an in-process LRU. Restart the process and the cache is gone. For cross-process sharing, wrap your LLM client in a Redis-backed layer first, then stack @cached on top for the hot path.

Combining with other fast-langraph features

The LLM cache composes cleanly with the Rust SQLite checkpointer — caching saves LLM round-trips; faster checkpointing saves state-persistence overhead. Together they attack the two biggest sources of production latency.

For a real deep-dive on measuring the impact of both, see scaling LangGraph in production.