Built for Speed

Raptor is written in Rust using the Axum web framework and Tokio async runtime. Every component is optimized for minimal latency.

Request Flow

┌──────────────────────────────────────────────────────────────┐
│                        Your App                               │
└───────────────────────────┬──────────────────────────────────┘
                            │ HTTPS request
                            ▼
┌──────────────────────────────────────────────────────────────┐
│                     Raptor Proxy                              │
│                                                               │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │
│   │   Auth      │ → │  Firewall   │ → │   Cache     │       │
│   │   ~0.5ms    │   │   ~2ms      │   │   ~1ms      │       │
│   └─────────────┘   └─────────────┘   └─────────────┘       │
│                                               │               │
│                                    Cache hit? │               │
│                              ┌────────────────┼───────┐       │
│                              │ Yes            │ No    │       │
│                              ▼                ▼       │       │
│                      ┌─────────────┐  ┌─────────────┐│       │
│                      │ Return      │  │ Forward to  ││       │
│                      │ cached      │  │ upstream    ││       │
│                      └─────────────┘  └─────────────┘│       │
│                                               │       │       │
│   ┌───────────────────────────────────────────┼───────┘       │
│   │ Evidence logging (async, non-blocking)    │               │
│   └───────────────────────────────────────────┘               │
└───────────────────────────────────────────────────────────────┘
                            │
                            ▼
                   OpenAI / Anthropic / etc.

Latency Breakdown

Stage	Time	Notes
Auth validation	~0.5ms	API key lookup with connection pooling
Firewall check	~2ms	ONNX embedding + cosine similarity
Cache lookup	~1ms	Hot cache (memory) + Redis
Evidence logging	~0ms	Async, doesn’t block response
Total overhead	~5ms	Compared to 50-100ms for Python/Node

Why This Matters

A typical GPT-4 request takes 500-2000ms. Adding 50-100ms of proxy overhead (common with Python/Node) is noticeable. Adding 5ms is not.

Without Raptor:    ████████████████████████████████████ 500ms
With Raptor:       █████████████████████████████████████ 505ms  (+1%)
Python proxy:      ██████████████████████████████████████████ 600ms  (+20%)

Three-Tier Caching

┌──────────────────────────────────────────────────────┐
│ HOT CACHE (In-Memory LRU)                            │
│ • Response time: <1ms                                │
│ • Size: 10,000 entries (configurable)                │
│ • Perfect for high-frequency queries                 │
└───────────────────────────┬──────────────────────────┘
                            │ Miss
                            ▼
┌──────────────────────────────────────────────────────┐
│ REDIS CACHE (Distributed)                            │
│ • Response time: 1-5ms                               │
│ • Shared across instances                            │
│ • Promotes hits to hot cache                         │
└───────────────────────────┬──────────────────────────┘
                            │ Miss
                            ▼
┌──────────────────────────────────────────────────────┐
│ UPSTREAM (OpenAI/Anthropic)                          │
│ • Response time: 200-2000ms                          │
│ • Response cached for next time                      │
└──────────────────────────────────────────────────────┘

Semantic Cache vs Exact Match

Traditional caches require exact matches. Raptor uses semantic hashing:

"What's the capital of France?"     → hash: abc123
"What is the capital of France?"    → hash: abc123  ✓ Same!
"Tell me France's capital city"     → hash: abc123  ✓ Same!

We compute a vector embedding, quantize the first 64 dimensions, and hash the result. Semantically similar queries produce the same hash.

Firewall Architecture

The firewall runs before forwarding to upstream:

Extract text from request body (messages, prompt, etc.)
Compute embedding using local ONNX model (~1ms)
Compare against threat patterns via cosine similarity
Block/warn/log based on configured thresholds

// Simplified firewall check
if cosine_similarity(request_embedding, pattern_embedding) > 0.85 {
    return Err(BlockedByFirewall);
}

For streaming responses, we also monitor the output and can terminate mid-stream if the AI starts generating policy-violating content.

Evidence Pipeline

All requests are logged asynchronously:

Request → MPSC Channel → Background Worker → PostgreSQL
             │
             └── Non-blocking, ~10,000 buffer

Evidence is never on the critical path. Your requests don’t wait for logging.

Tech Stack

Component	Technology
Language	Rust 1.75+
Web framework	Axum 0.7
Async runtime	Tokio
Database	PostgreSQL + pgvector
Cache	Redis + in-memory LRU
Embeddings	ONNX Runtime
Deployment	Docker / Kubernetes

Resilience

Rate limiting: Per API key, configurable
Circuit breakers: Automatic failover on upstream errors
Connection pooling: Efficient database/Redis connections
Graceful shutdown: In-flight requests complete

Raptor is designed to be invisible. If we add latency you notice, that’s a bug.

Get Started

Integrations

Features

How It Works

Built for Speed

Request Flow

Latency Breakdown

Why This Matters

Three-Tier Caching

Semantic Cache vs Exact Match

Firewall Architecture

Evidence Pipeline

Tech Stack

Resilience

Get Started

Integrations

Features

​Built for Speed

​Request Flow

​Latency Breakdown

​Why This Matters

​Three-Tier Caching

​Semantic Cache vs Exact Match

​Firewall Architecture

​Evidence Pipeline

​Tech Stack

​Resilience

Built for Speed

Request Flow

Latency Breakdown

Why This Matters

Three-Tier Caching

Semantic Cache vs Exact Match

Firewall Architecture

Evidence Pipeline

Tech Stack

Resilience