Stop Paying Twice
Every time a user asks “What’s the capital of France?” you pay for a new API call. With Raptor’s semantic cache, you pay once.
How It Works
Request: "What is the capital of France?"
→ Compute embedding
→ Check cache for similar queries
→ HIT! Return cached response in ~5ms
Unlike traditional caches, we don’t require exact text matches:
| Query | Cache Result |
|---|
| ”What is the capital of France?” | Original query |
| ”What’s France’s capital?” | Cache HIT |
| ”capital of france?” | Cache HIT |
| ”Tell me the capital of France” | Cache HIT |
All return the same cached response. That’s 75% fewer API calls just from query variations.
Typical Savings
| Use Case | Cache Hit Rate | Monthly Savings |
|---|
| FAQ bots | 60-80% | ~400per500 spend |
| Customer support | 40-60% | ~250per500 spend |
| Search/RAG | 30-50% | ~200per500 spend |
| Code assistants | 20-40% | ~150per500 spend |
Check cache status in every response:
X-Raptor-Cache: hit # or "miss"
X-Raptor-Latency-Ms: 5 # Total Raptor time
X-Raptor-Upstream-Latency-Ms: 0 # 0 on cache hit
Streaming Support
Cache hits work with streaming too. We replay cached responses with realistic timing (~15ms between words) so users still see the “typing” effect.
# This hits cache and streams just like the original
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What's the capital of France?"}],
stream=True
)
Cache Behavior
| Request Type | Caches? | Serves from Cache? |
|---|
| Non-streaming | Yes | Yes |
| Streaming | Yes | Yes (replayed) |
| With tools | No | No |
| Image generation | No | No |
Best Practices
- Normalize inputs - Remove extra whitespace, lowercase when possible
- Use specific system prompts - Helps cache recognize similar queries
- Monitor hit rates - Check your dashboard for optimization opportunities
Your first request is always a cache miss. Test with realistic query patterns to see actual savings.