Built for Speed
Raptor is written in Rust using the Axum web framework and Tokio async runtime. Every component is optimized for minimal latency.
Request Flow
┌──────────────────────────────────────────────────────────────┐
│ Your App │
└───────────────────────────┬──────────────────────────────────┘
│ HTTPS request
▼
┌──────────────────────────────────────────────────────────────┐
│ Raptor Proxy │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Auth │ → │ Firewall │ → │ Cache │ │
│ │ ~0.5ms │ │ ~2ms │ │ ~1ms │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ Cache hit? │ │
│ ┌────────────────┼───────┐ │
│ │ Yes │ No │ │
│ ▼ ▼ │ │
│ ┌─────────────┐ ┌─────────────┐│ │
│ │ Return │ │ Forward to ││ │
│ │ cached │ │ upstream ││ │
│ └─────────────┘ └─────────────┘│ │
│ │ │ │
│ ┌───────────────────────────────────────────┼───────┘ │
│ │ Evidence logging (async, non-blocking) │ │
│ └───────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
│
▼
OpenAI / Anthropic / etc.
Latency Breakdown
| Stage | Time | Notes |
|---|
| Auth validation | ~0.5ms | API key lookup with connection pooling |
| Firewall check | ~2ms | ONNX embedding + cosine similarity |
| Cache lookup | ~1ms | Hot cache (memory) + Redis |
| Evidence logging | ~0ms | Async, doesn’t block response |
| Total overhead | ~5ms | Compared to 50-100ms for Python/Node |
Why This Matters
A typical GPT-4 request takes 500-2000ms. Adding 50-100ms of proxy overhead (common with Python/Node) is noticeable. Adding 5ms is not.
Without Raptor: ████████████████████████████████████ 500ms
With Raptor: █████████████████████████████████████ 505ms (+1%)
Python proxy: ██████████████████████████████████████████ 600ms (+20%)
Three-Tier Caching
┌──────────────────────────────────────────────────────┐
│ HOT CACHE (In-Memory LRU) │
│ • Response time: <1ms │
│ • Size: 10,000 entries (configurable) │
│ • Perfect for high-frequency queries │
└───────────────────────────┬──────────────────────────┘
│ Miss
▼
┌──────────────────────────────────────────────────────┐
│ REDIS CACHE (Distributed) │
│ • Response time: 1-5ms │
│ • Shared across instances │
│ • Promotes hits to hot cache │
└───────────────────────────┬──────────────────────────┘
│ Miss
▼
┌──────────────────────────────────────────────────────┐
│ UPSTREAM (OpenAI/Anthropic) │
│ • Response time: 200-2000ms │
│ • Response cached for next time │
└──────────────────────────────────────────────────────┘
Semantic Cache vs Exact Match
Traditional caches require exact matches. Raptor uses semantic hashing:
"What's the capital of France?" → hash: abc123
"What is the capital of France?" → hash: abc123 ✓ Same!
"Tell me France's capital city" → hash: abc123 ✓ Same!
We compute a vector embedding, quantize the first 64 dimensions, and hash the result. Semantically similar queries produce the same hash.
Firewall Architecture
The firewall runs before forwarding to upstream:
- Extract text from request body (messages, prompt, etc.)
- Compute embedding using local ONNX model (~1ms)
- Compare against threat patterns via cosine similarity
- Block/warn/log based on configured thresholds
// Simplified firewall check
if cosine_similarity(request_embedding, pattern_embedding) > 0.85 {
return Err(BlockedByFirewall);
}
For streaming responses, we also monitor the output and can terminate mid-stream if the AI starts generating policy-violating content.
Evidence Pipeline
All requests are logged asynchronously:
Request → MPSC Channel → Background Worker → PostgreSQL
│
└── Non-blocking, ~10,000 buffer
Evidence is never on the critical path. Your requests don’t wait for logging.
Tech Stack
| Component | Technology |
|---|
| Language | Rust 1.75+ |
| Web framework | Axum 0.7 |
| Async runtime | Tokio |
| Database | PostgreSQL + pgvector |
| Cache | Redis + in-memory LRU |
| Embeddings | ONNX Runtime |
| Deployment | Docker / Kubernetes |
Resilience
- Rate limiting: Per API key, configurable
- Circuit breakers: Automatic failover on upstream errors
- Connection pooling: Efficient database/Redis connections
- Graceful shutdown: In-flight requests complete
Raptor is designed to be invisible. If we add latency you notice, that’s a bug.