Skip to main content

Deduplication

Deduplication is Raptor’s core cost-saving feature that identifies reusable chunks between document versions, eliminating redundant embeddings.

The Problem

Traditional RAG systems re-embed entire documents on updates:
Document v1: 200 chunks → 200 embeddings
Document v2: 205 chunks (5 new pages) → 205 NEW embeddings

Total cost: 405 embeddings
Wasted: 195 duplicate embeddings (48%)

The Solution

Raptor performs sentence-level diff analysis to identify reusable chunks:
Document v1: 200 chunks → 200 embeddings
Document v2: 205 chunks
  - 180 exact matches → REUSE embeddings
  - 15 high reuse (>80%) → REUSE embeddings
  - 10 new content → 10 NEW embeddings

Total cost: 210 embeddings
Savings: 195 embeddings (93% reduction)

How It Works

Step 1: Version Linking

Documents must be linked (parent-child relationship):
// Auto-linking (recommended)
const v2 = await raptor.process(file_v2, {
  autoLink: true,
  versionLabel: 'v2.0'
});

// Manual linking
const v2 = await raptor.process(file_v2, {
  parentDocumentId: v1.document_id,
  versionLabel: 'v2.0'
});

Step 2: Sentence Extraction

Raptor extracts sentences from each chunk:
Chunk A (v1): [
  "The product includes three components.",
  "Each component serves a specific purpose.",
  "Installation takes approximately 30 minutes."
]

Chunk B (v2): [
  "The product includes three components.",
  "Each component serves a specific purpose.",
  "Installation takes approximately 45 minutes." // CHANGED
]

Step 3: Fuzzy Matching

Sentences are compared using fuzzy matching (Levenshtein distance + semantic similarity):
Sentence 1: "The product includes three components."
  v1 vs v2: 100% match → REUSE

Sentence 2: "Each component serves a specific purpose."
  v1 vs v2: 100% match → REUSE

Sentence 3: "Installation takes approximately 30 minutes."
           "Installation takes approximately 45 minutes."
  v1 vs v2: 92% match → HIGH SIMILARITY

Step 4: Reuse Scoring

Raptor assigns a reuse strategy based on similarity:
StrategyReuse RatioEmbedding RecommendationDescription
exact100%reuseIdentical content hash - definitely reuse
high_reuse80-100%reuse or consider_reuseMost sentences unchanged
partial_reuse50-80%consider_reuseSignificant overlap
mixed_content30-50%regenerateMixed old/new content
fuzzyBelow 30%regenerateMostly new content
new0%regenerateCompletely new chunk

Response Format

Each chunk includes deduplication metadata:
{
  id: "chunk-uuid",
  text: "Chunk content...",
  chunk_index: 5,
  tokens: 245,

  // Deduplication metadata
  dedup_strategy: "high_reuse",
  dedup_confidence: 0.92,
  is_reused: true,
  dedup_source_chunk_id: "source-chunk-uuid",

  // Sentence-level metrics
  total_sentences: 5,
  reused_sentences_count: 4,
  new_sentences_count: 1,
  content_reuse_ratio: 0.8,

  // Embedding recommendation
  embedding_recommendation: "reuse", // or "consider_reuse", "regenerate"
  recommendation_confidence: "high" // or "medium", "low"
}

Deduplication Summary

Get aggregate statistics for a variant:
const summary = await raptor.getDedupSummary(variantId);

console.log(summary);
Response:
{
  "variant_id": "variant-uuid",
  "total_chunks": 205,

  "chunk_breakdown": {
    "exact": 120,
    "high_reuse": 60,
    "partial_reuse": 15,
    "new": 10
  },

  "total_sentences": 1025,
  "reused_sentences": 850,
  "new_sentences": 175,
  "sentence_reuse_ratio": 0.83,

  "embedding_recommendations": {
    "reuse": 180,
    "consider_reuse": 15,
    "regenerate": 10
  },

  "parent_version_id": "parent-uuid",
  "has_parent": true
}

Implementing Cost Savings

Pattern 1: Simple Reuse

Only generate embeddings for new content:
const chunks = await raptor.getDocumentChunks(docId);

for (const chunk of chunks) {
  if (chunk.embedding_recommendation === 'reuse') {
    // Reuse existing embedding
    const sourceEmbedding = await db.getEmbedding(chunk.dedup_source_chunk_id);
    await db.storeEmbedding(chunk.id, sourceEmbedding);
  } else {
    // Generate new embedding
    const embedding = await openai.embeddings.create({
      input: chunk.text,
      model: 'text-embedding-3-small'
    });
    await db.storeEmbedding(chunk.id, embedding);
  }
}

Pattern 2: Conservative Approach

Only reuse exact matches:
for (const chunk of chunks) {
  if (chunk.dedup_strategy === 'exact') {
    // Reuse embedding for 100% identical chunks
    await reuseEmbedding(chunk);
  } else {
    // Regenerate for any changes
    await generateNewEmbedding(chunk);
  }
}

Pattern 3: Aggressive Reuse

Reuse embeddings for high similarity:
for (const chunk of chunks) {
  if (chunk.content_reuse_ratio >= 0.8) {
    // Reuse if 80%+ sentences are unchanged
    await reuseEmbedding(chunk);
  } else {
    await generateNewEmbedding(chunk);
  }
}

Real-World Examples

Example 1: Software Manual Update

// v1.0: 50 pages → 100 chunks
const v1 = await raptor.process(manual_v1);

// v2.0: Added 2 pages, updated 3 sections
const v2 = await raptor.process(manual_v2, {
  parentDocumentId: v1.document_id,
  versionLabel: 'v2.0'
});

const summary = await raptor.getDedupSummary(v2.variant_id);
// {
//   total_chunks: 105,
//   embedding_recommendations: {
//     reuse: 85,
//     consider_reuse: 10,
//     regenerate: 10
//   },
//   sentence_reuse_ratio: 0.89
// }

// Cost savings: 89% fewer embeddings needed
// Original contract
const contract = await raptor.process(contract_original);

// Revised contract (changed 2 clauses)
const revised = await raptor.process(contract_revised, {
  parentDocumentId: contract.document_id
});

const summary = await raptor.getDedupSummary(revised.variant_id);
// {
//   sentence_reuse_ratio: 0.96
// }

// 96% cost savings!

Example 3: Policy Document Timeline

// Upload multiple versions
const v1 = await raptor.process(policy_2022);
const v2 = await raptor.process(policy_2023, {
  parentDocumentId: v1.document_id
});
const v3 = await raptor.process(policy_2024, {
  parentDocumentId: v2.document_id
});

// Compare v3 to v1 (original)
const comparison = await raptor.compareDocuments(v1.document_id, v3.document_id);
console.log('Changes over 2 years:', comparison.changes);

Chunk-Level Metadata

dedup_metadata (JSONB)

For deep analysis, enable full metadata:
const chunks = await raptor.getDocumentChunks(docId, {
  includeFullMetadata: true
});

chunks.forEach(chunk => {
  console.log(chunk.dedup_metadata);
  // {
  //   matched_sentences: [
  //     { sentence: "...", source_chunk_id: "...", similarity: 0.98 },
  //     ...
  //   ],
  //   new_sentences: ["...", "..."],
  //   changed_sentences: [
  //     { old: "...", new: "...", similarity: 0.85 }
  //   ]
  // }
});
Full metadata can be large (5-50KB per chunk). Only enable when needed for debugging or detailed analysis.

Performance Considerations

Processing Time

Deduplication adds 2-5 seconds per document depending on:
  • Document size
  • Number of parent versions
  • Chunk count

Storage Impact

Metadata adds ~1KB per chunk:
  • 100 chunks = ~100KB metadata
  • 1000 chunks = ~1MB metadata

Accuracy vs Speed

Tune fuzzy matching threshold (internal parameter):
  • High accuracy (default): 90%+ similarity required for high_reuse
  • Fast processing: 80%+ similarity threshold

Limitations

When Deduplication Doesn’t Help

  1. First version: No parent to compare against
    const v1 = await raptor.process(first_doc);
    // All chunks marked as "new"
    
  2. Complete rewrites: No content overlap
    const v2 = await raptor.process(completely_different_doc, {
      parentDocumentId: v1.document_id
    });
    // All chunks marked as "new"
    
  3. Different chunking strategies: Chunk boundaries change
    const v1 = await raptor.process(doc, { chunkSize: 512 });
    const v2 = await raptor.process(doc, {
      chunkSize: 1024, // Different chunking
      parentDocumentId: v1.document_id
    });
    // Lower reuse ratio due to boundary misalignment
    

Best Practices

Keep chunking parameters consistent across versions:
const config = {
  chunkSize: 512,
  chunkOverlap: 50,
  strategy: 'semantic'
};

const v1 = await raptor.process(doc_v1, config);
const v2 = await raptor.process(doc_v2, { ...config, parentDocumentId: v1.document_id });
Track savings over time:
const summary = await raptor.getDedupSummary(variantId);

if (summary.sentence_reuse_ratio < 0.5) {
  console.warn('Low reuse ratio - check if documents are related');
} else {
  console.log(`${(summary.sentence_reuse_ratio * 100).toFixed(0)}% cost savings`);
}
Choose embedding reuse strategy based on your needs:
  • High accuracy RAG: Only reuse exact matches
  • Balanced: Reuse exact + high_reuse
  • Cost-optimized: Reuse all with content_reuse_ratio >= 0.8

Next Steps