Skip to main content

Built for Speed

Raptor is written in Rust using the Axum web framework and Tokio async runtime. Every component is optimized for minimal latency.

Request Flow

┌──────────────────────────────────────────────────────────────┐
│                        Your App                               │
└───────────────────────────┬──────────────────────────────────┘
                            │ HTTPS request

┌──────────────────────────────────────────────────────────────┐
│                     Raptor Proxy                              │
│                                                               │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │
│   │   Auth      │ → │  Firewall   │ → │   Cache     │       │
│   │   ~0.5ms    │   │   ~2ms      │   │   ~1ms      │       │
│   └─────────────┘   └─────────────┘   └─────────────┘       │
│                                               │               │
│                                    Cache hit? │               │
│                              ┌────────────────┼───────┐       │
│                              │ Yes            │ No    │       │
│                              ▼                ▼       │       │
│                      ┌─────────────┐  ┌─────────────┐│       │
│                      │ Return      │  │ Forward to  ││       │
│                      │ cached      │  │ upstream    ││       │
│                      └─────────────┘  └─────────────┘│       │
│                                               │       │       │
│   ┌───────────────────────────────────────────┼───────┘       │
│   │ Evidence logging (async, non-blocking)    │               │
│   └───────────────────────────────────────────┘               │
└───────────────────────────────────────────────────────────────┘


                   OpenAI / Anthropic / etc.

Latency Breakdown

StageTimeNotes
Auth validation~0.5msAPI key lookup with connection pooling
Firewall check~2msONNX embedding + cosine similarity
Cache lookup~1msHot cache (memory) + Redis
Evidence logging~0msAsync, doesn’t block response
Total overhead~5msCompared to 50-100ms for Python/Node

Why This Matters

A typical GPT-4 request takes 500-2000ms. Adding 50-100ms of proxy overhead (common with Python/Node) is noticeable. Adding 5ms is not.
Without Raptor:    ████████████████████████████████████ 500ms
With Raptor:       █████████████████████████████████████ 505ms  (+1%)
Python proxy:      ██████████████████████████████████████████ 600ms  (+20%)

Three-Tier Caching

┌──────────────────────────────────────────────────────┐
│ HOT CACHE (In-Memory LRU)                            │
│ • Response time: <1ms                                │
│ • Size: 10,000 entries (configurable)                │
│ • Perfect for high-frequency queries                 │
└───────────────────────────┬──────────────────────────┘
                            │ Miss

┌──────────────────────────────────────────────────────┐
│ REDIS CACHE (Distributed)                            │
│ • Response time: 1-5ms                               │
│ • Shared across instances                            │
│ • Promotes hits to hot cache                         │
└───────────────────────────┬──────────────────────────┘
                            │ Miss

┌──────────────────────────────────────────────────────┐
│ UPSTREAM (OpenAI/Anthropic)                          │
│ • Response time: 200-2000ms                          │
│ • Response cached for next time                      │
└──────────────────────────────────────────────────────┘

Semantic Cache vs Exact Match

Traditional caches require exact matches. Raptor uses semantic hashing:
"What's the capital of France?"     → hash: abc123
"What is the capital of France?"    → hash: abc123  ✓ Same!
"Tell me France's capital city"     → hash: abc123  ✓ Same!
We compute a vector embedding, quantize the first 64 dimensions, and hash the result. Semantically similar queries produce the same hash.

Firewall Architecture

The firewall runs before forwarding to upstream:
  1. Extract text from request body (messages, prompt, etc.)
  2. Compute embedding using local ONNX model (~1ms)
  3. Compare against threat patterns via cosine similarity
  4. Block/warn/log based on configured thresholds
// Simplified firewall check
if cosine_similarity(request_embedding, pattern_embedding) > 0.85 {
    return Err(BlockedByFirewall);
}
For streaming responses, we also monitor the output and can terminate mid-stream if the AI starts generating policy-violating content.

Evidence Pipeline

All requests are logged asynchronously:
Request → MPSC Channel → Background Worker → PostgreSQL

             └── Non-blocking, ~10,000 buffer
Evidence is never on the critical path. Your requests don’t wait for logging.

Tech Stack

ComponentTechnology
LanguageRust 1.75+
Web frameworkAxum 0.7
Async runtimeTokio
DatabasePostgreSQL + pgvector
CacheRedis + in-memory LRU
EmbeddingsONNX Runtime
DeploymentDocker / Kubernetes

Resilience

  • Rate limiting: Per API key, configurable
  • Circuit breakers: Automatic failover on upstream errors
  • Connection pooling: Efficient database/Redis connections
  • Graceful shutdown: In-flight requests complete
Raptor is designed to be invisible. If we add latency you notice, that’s a bug.