Skip to main content

Stop Paying Twice

Every time a user asks “What’s the capital of France?” you pay for a new API call. With Raptor’s semantic cache, you pay once.

How It Works

Request: "What is the capital of France?"
→ Compute embedding
→ Check cache for similar queries
→ HIT! Return cached response in ~5ms
Unlike traditional caches, we don’t require exact text matches:
QueryCache Result
”What is the capital of France?”Original query
”What’s France’s capital?”Cache HIT
”capital of france?”Cache HIT
”Tell me the capital of France”Cache HIT
All return the same cached response. That’s 75% fewer API calls just from query variations.

Typical Savings

Use CaseCache Hit RateMonthly Savings
FAQ bots60-80%~400per400 per 500 spend
Customer support40-60%~250per250 per 500 spend
Search/RAG30-50%~200per200 per 500 spend
Code assistants20-40%~150per150 per 500 spend

Response Headers

Check cache status in every response:
X-Raptor-Cache: hit              # or "miss"
X-Raptor-Latency-Ms: 5           # Total Raptor time
X-Raptor-Upstream-Latency-Ms: 0  # 0 on cache hit

Streaming Support

Cache hits work with streaming too. We replay cached responses with realistic timing (~15ms between words) so users still see the “typing” effect.
# This hits cache and streams just like the original
stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the capital of France?"}],
    stream=True
)

Cache Behavior

Request TypeCaches?Serves from Cache?
Non-streamingYesYes
StreamingYesYes (replayed)
With toolsNoNo
Image generationNoNo

Best Practices

  1. Normalize inputs - Remove extra whitespace, lowercase when possible
  2. Use specific system prompts - Helps cache recognize similar queries
  3. Monitor hit rates - Check your dashboard for optimization opportunities
Your first request is always a cache miss. Test with realistic query patterns to see actual savings.