Documentation Index
Fetch the complete documentation index at: https://docs.raptordata.dev/llms.txt
Use this file to discover all available pages before exploring further.
Built for Speed
Raptor is written in Rust using the Axum web framework and Tokio async runtime. Every component is optimized for minimal latency.
Request Flow
┌──────────────────────────────────────────────────────────────┐
│ Your App │
└───────────────────────────┬──────────────────────────────────┘
│ HTTPS request
▼
┌──────────────────────────────────────────────────────────────┐
│ Raptor Proxy │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Auth │ → │ Firewall │ → │ Cache │ │
│ │ ~0.5ms │ │ ~2ms │ │ ~1ms │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ Cache hit? │ │
│ ┌────────────────┼───────┐ │
│ │ Yes │ No │ │
│ ▼ ▼ │ │
│ ┌─────────────┐ ┌─────────────┐│ │
│ │ Return │ │ Forward to ││ │
│ │ cached │ │ upstream ││ │
│ └─────────────┘ └─────────────┘│ │
│ │ │ │
│ ┌───────────────────────────────────────────┼───────┘ │
│ │ Evidence logging (async, non-blocking) │ │
│ └───────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
│
▼
OpenAI / Anthropic / etc.
Latency Breakdown
| Stage | Time | Notes |
|---|
| Auth validation | ~0.5ms | API key lookup with connection pooling |
| Firewall check | ~2ms | ONNX embedding + cosine similarity |
| Cache lookup | ~1ms | Hot cache (memory) + Redis |
| Evidence logging | ~0ms | Async, doesn’t block response |
| Total overhead | ~5ms | Compared to 50-100ms for Python/Node |
Why This Matters
A typical GPT-4 request takes 500-2000ms. Adding 50-100ms of proxy overhead (common with Python/Node) is noticeable. Adding 5ms is not.
Without Raptor: ████████████████████████████████████ 500ms
With Raptor: █████████████████████████████████████ 505ms (+1%)
Python proxy: ██████████████████████████████████████████ 600ms (+20%)
Three-Tier Caching
┌──────────────────────────────────────────────────────┐
│ HOT CACHE (In-Memory LRU) │
│ • Response time: <1ms │
│ • Size: 10,000 entries (configurable) │
│ • Perfect for high-frequency queries │
└───────────────────────────┬──────────────────────────┘
│ Miss
▼
┌──────────────────────────────────────────────────────┐
│ REDIS CACHE (Distributed) │
│ • Response time: 1-5ms │
│ • Shared across instances │
│ • Promotes hits to hot cache │
└───────────────────────────┬──────────────────────────┘
│ Miss
▼
┌──────────────────────────────────────────────────────┐
│ UPSTREAM (OpenAI/Anthropic) │
│ • Response time: 200-2000ms │
│ • Response cached for next time │
└──────────────────────────────────────────────────────┘
Semantic Cache vs Exact Match
Traditional caches require exact matches. Raptor uses semantic hashing:
"What's the capital of France?" → hash: abc123
"What is the capital of France?" → hash: abc123 ✓ Same!
"Tell me France's capital city" → hash: abc123 ✓ Same!
We compute a vector embedding, quantize the first 64 dimensions, and hash the result. Semantically similar queries produce the same hash.
Firewall Architecture
The firewall runs before forwarding to upstream:
- Extract text from request body (messages, prompt, etc.)
- Compute embedding using local ONNX model (~1ms)
- Compare against threat patterns via cosine similarity
- Block/warn/log based on configured thresholds
// Simplified firewall check
if cosine_similarity(request_embedding, pattern_embedding) > 0.85 {
return Err(BlockedByFirewall);
}
For streaming responses, we also monitor the output and can terminate mid-stream if the AI starts generating policy-violating content.
Evidence Pipeline
All requests are logged asynchronously:
Request → MPSC Channel → Background Worker → PostgreSQL
│
└── Non-blocking, ~10,000 buffer
Evidence is never on the critical path. Your requests don’t wait for logging.
Tech Stack
| Component | Technology |
|---|
| Language | Rust 1.75+ |
| Web framework | Axum 0.7 |
| Async runtime | Tokio |
| Database | PostgreSQL + pgvector |
| Cache | Redis + in-memory LRU |
| Embeddings | ONNX Runtime |
| Deployment | Docker / Kubernetes |
Resilience
- Rate limiting: Per API key, configurable
- Circuit breakers: Automatic failover on upstream errors
- Connection pooling: Efficient database/Redis connections
- Graceful shutdown: In-flight requests complete
Raptor is designed to be invisible. If we add latency you notice, that’s a bug.