Cache LLM calls with @cached for a 10x speedup
Wrap your LLM call in @cached. Identical prompts return from an in-process cache in ~1.4 μs instead of hitting the API. On a 90%-hit-rate workload, that's a 9.78x speedup and a direct cost reduction.
LangGraph graphs have loops, retries, branching, and tool routing. The same prompts end up getting sent to the LLM over and over — especially during RAG, planning, and reflection phases. You’re burning latency and API spend on calls you’ve already made.
@cached is a one-line wrapper that fixes this.
The problem in one picture
from langgraph.graph import StateGraph
def call_llm(prompt: str) -> str:
return llm.invoke(prompt)
# Inside your graph, called multiple times per invocation:
response = call_llm("Summarize the following transcript: ...")
# ... branch, retry, reflect ...
response = call_llm("Summarize the following transcript: ...") # Same prompt!
Each call hits the LLM API: ~500 ms and ~$0.002 of GPT-4. Even a modest graph that invokes two such calls per request wastes half a second and 20% of its budget on redundancy.
The fix
from fast_langgraph import cached
@cached(max_size=1000)
def call_llm(prompt: str) -> str:
return llm.invoke(prompt)
That’s the entire change. The decorator gives you:
- Content-addressed lookups — hashed by the full argument tuple, so the same prompt returns the same result
- LRU eviction once
max_sizeis reached - Sub-microsecond overhead on cache hits (1.38 μs in our benchmarks)
- Per-function
cache_stats()so you can see exactly what’s hitting
Measure it
# First call — misses
response = call_llm("What is LangGraph?") # ~500 ms
# Second call — hits
response = call_llm("What is LangGraph?") # ~0.01 ms
print(call_llm.cache_stats())
# {'hits': 1, 'misses': 1, 'size': 1}
On our benchmark workload with a 90% cache hit rate (see benchmarks):
| Without cache | With cache | Speedup | |
|---|---|---|---|
| Total time | 108.48 ms | 11.09 ms | 9.78× |
The speedup scales linearly with your hit rate. A graph with 50% redundant LLM calls will see ~2× on the LLM-bound portion.
What to cache (and what not to)
Good candidates:
- Tool-routing prompts that classify user intent
- RAG retrieval-augmentation prompts with stable passages
- Reflection and self-critique passes on static content
- Test and evaluation runs on fixed fixtures
Bad candidates:
- Conversational turns where context always differs
- Calls with timestamps, IDs, or random seeds baked into the prompt
- Calls where the temperature is non-zero and you want variety
When in doubt, add @cached, run your test suite, and check cache_stats(). If the hit rate is near zero, it’s not helping — remove it.
Cache invalidation
@cached is an in-process LRU. Restart the process and the cache is gone. For cross-process sharing, wrap your LLM client in a Redis-backed layer first, then stack @cached on top for the hot path.
Combining with other fast-langraph features
The LLM cache composes cleanly with the Rust SQLite checkpointer — caching saves LLM round-trips; faster checkpointing saves state-persistence overhead. Together they attack the two biggest sources of production latency.
For a real deep-dive on measuring the impact of both, see scaling LangGraph in production.