Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.raptordata.dev/llms.txt

Use this file to discover all available pages before exploring further.

Built for Speed

Raptor is written in Rust using the Axum web framework and Tokio async runtime. Every component is optimized for minimal latency.

Request Flow

┌──────────────────────────────────────────────────────────────┐
│                        Your App                               │
└───────────────────────────┬──────────────────────────────────┘
                            │ HTTPS request

┌──────────────────────────────────────────────────────────────┐
│                     Raptor Proxy                              │
│                                                               │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │
│   │   Auth      │ → │  Firewall   │ → │   Cache     │       │
│   │   ~0.5ms    │   │   ~2ms      │   │   ~1ms      │       │
│   └─────────────┘   └─────────────┘   └─────────────┘       │
│                                               │               │
│                                    Cache hit? │               │
│                              ┌────────────────┼───────┐       │
│                              │ Yes            │ No    │       │
│                              ▼                ▼       │       │
│                      ┌─────────────┐  ┌─────────────┐│       │
│                      │ Return      │  │ Forward to  ││       │
│                      │ cached      │  │ upstream    ││       │
│                      └─────────────┘  └─────────────┘│       │
│                                               │       │       │
│   ┌───────────────────────────────────────────┼───────┘       │
│   │ Evidence logging (async, non-blocking)    │               │
│   └───────────────────────────────────────────┘               │
└───────────────────────────────────────────────────────────────┘


                   OpenAI / Anthropic / etc.

Latency Breakdown

StageTimeNotes
Auth validation~0.5msAPI key lookup with connection pooling
Firewall check~2msONNX embedding + cosine similarity
Cache lookup~1msHot cache (memory) + Redis
Evidence logging~0msAsync, doesn’t block response
Total overhead~5msCompared to 50-100ms for Python/Node

Why This Matters

A typical GPT-4 request takes 500-2000ms. Adding 50-100ms of proxy overhead (common with Python/Node) is noticeable. Adding 5ms is not.
Without Raptor:    ████████████████████████████████████ 500ms
With Raptor:       █████████████████████████████████████ 505ms  (+1%)
Python proxy:      ██████████████████████████████████████████ 600ms  (+20%)

Three-Tier Caching

┌──────────────────────────────────────────────────────┐
│ HOT CACHE (In-Memory LRU)                            │
│ • Response time: <1ms                                │
│ • Size: 10,000 entries (configurable)                │
│ • Perfect for high-frequency queries                 │
└───────────────────────────┬──────────────────────────┘
                            │ Miss

┌──────────────────────────────────────────────────────┐
│ REDIS CACHE (Distributed)                            │
│ • Response time: 1-5ms                               │
│ • Shared across instances                            │
│ • Promotes hits to hot cache                         │
└───────────────────────────┬──────────────────────────┘
                            │ Miss

┌──────────────────────────────────────────────────────┐
│ UPSTREAM (OpenAI/Anthropic)                          │
│ • Response time: 200-2000ms                          │
│ • Response cached for next time                      │
└──────────────────────────────────────────────────────┘

Semantic Cache vs Exact Match

Traditional caches require exact matches. Raptor uses semantic hashing:
"What's the capital of France?"     → hash: abc123
"What is the capital of France?"    → hash: abc123  ✓ Same!
"Tell me France's capital city"     → hash: abc123  ✓ Same!
We compute a vector embedding, quantize the first 64 dimensions, and hash the result. Semantically similar queries produce the same hash.

Firewall Architecture

The firewall runs before forwarding to upstream:
  1. Extract text from request body (messages, prompt, etc.)
  2. Compute embedding using local ONNX model (~1ms)
  3. Compare against threat patterns via cosine similarity
  4. Block/warn/log based on configured thresholds
// Simplified firewall check
if cosine_similarity(request_embedding, pattern_embedding) > 0.85 {
    return Err(BlockedByFirewall);
}
For streaming responses, we also monitor the output and can terminate mid-stream if the AI starts generating policy-violating content.

Evidence Pipeline

All requests are logged asynchronously:
Request → MPSC Channel → Background Worker → PostgreSQL

             └── Non-blocking, ~10,000 buffer
Evidence is never on the critical path. Your requests don’t wait for logging.

Tech Stack

ComponentTechnology
LanguageRust 1.75+
Web frameworkAxum 0.7
Async runtimeTokio
DatabasePostgreSQL + pgvector
CacheRedis + in-memory LRU
EmbeddingsONNX Runtime
DeploymentDocker / Kubernetes

Resilience

  • Rate limiting: Per API key, configurable
  • Circuit breakers: Automatic failover on upstream errors
  • Connection pooling: Efficient database/Redis connections
  • Graceful shutdown: In-flight requests complete
Raptor is designed to be invisible. If we add latency you notice, that’s a bug.