Deduplication

Deduplication is Raptor’s core cost-saving feature that identifies reusable chunks between document versions, eliminating redundant embeddings.

The Problem

Traditional RAG systems re-embed entire documents on updates:

Document v1: 200 chunks → 200 embeddings
Document v2: 205 chunks (5 new pages) → 205 NEW embeddings

Total cost: 405 embeddings
Wasted: 195 duplicate embeddings (48%)

The Solution

Raptor performs sentence-level diff analysis to identify reusable chunks:

Document v1: 200 chunks → 200 embeddings
Document v2: 205 chunks
  - 180 exact matches → REUSE embeddings
  - 15 high reuse (>80%) → REUSE embeddings
  - 10 new content → 10 NEW embeddings

Total cost: 210 embeddings
Savings: 195 embeddings (93% reduction)

How It Works

Step 1: Version Linking

Documents must be linked (parent-child relationship):

// Auto-linking (recommended)
const v2 = await raptor.process(file_v2, {
  autoLink: true,
  versionLabel: 'v2.0'
});

// Manual linking
const v2 = await raptor.process(file_v2, {
  parentDocumentId: v1.document_id,
  versionLabel: 'v2.0'
});

Step 2: Sentence Extraction

Raptor extracts sentences from each chunk:

Chunk A (v1): [
  "The product includes three components.",
  "Each component serves a specific purpose.",
  "Installation takes approximately 30 minutes."
]

Chunk B (v2): [
  "The product includes three components.",
  "Each component serves a specific purpose.",
  "Installation takes approximately 45 minutes." // CHANGED
]

Step 3: Fuzzy Matching

Sentences are compared using fuzzy matching (Levenshtein distance + semantic similarity):

Sentence 1: "The product includes three components."
  v1 vs v2: 100% match → REUSE

Sentence 2: "Each component serves a specific purpose."
  v1 vs v2: 100% match → REUSE

Sentence 3: "Installation takes approximately 30 minutes."
           "Installation takes approximately 45 minutes."
  v1 vs v2: 92% match → HIGH SIMILARITY

Step 4: Reuse Scoring

Raptor assigns a reuse strategy based on similarity:

Strategy	Reuse Ratio	Embedding Recommendation	Description
exact	100%	`reuse`	Identical content hash - definitely reuse
high_reuse	80-100%	`reuse` or `consider_reuse`	Most sentences unchanged
partial_reuse	50-80%	`consider_reuse`	Significant overlap
mixed_content	30-50%	`regenerate`	Mixed old/new content
fuzzy	Below 30%	`regenerate`	Mostly new content
new	0%	`regenerate`	Completely new chunk

Response Format

Each chunk includes deduplication metadata:

{
  id: "chunk-uuid",
  text: "Chunk content...",
  chunk_index: 5,
  tokens: 245,

  // Deduplication metadata
  dedup_strategy: "high_reuse",
  dedup_confidence: 0.92,
  is_reused: true,
  dedup_source_chunk_id: "source-chunk-uuid",

  // Sentence-level metrics
  total_sentences: 5,
  reused_sentences_count: 4,
  new_sentences_count: 1,
  content_reuse_ratio: 0.8,

  // Embedding recommendation
  embedding_recommendation: "reuse", // or "consider_reuse", "regenerate"
  recommendation_confidence: "high" // or "medium", "low"
}

Deduplication Summary

Get aggregate statistics for a variant:

const summary = await raptor.getDedupSummary(variantId);

console.log(summary);

Response:

{
  "variant_id": "variant-uuid",
  "total_chunks": 205,

  "chunk_breakdown": {
    "exact": 120,
    "high_reuse": 60,
    "partial_reuse": 15,
    "new": 10
  },

  "total_sentences": 1025,
  "reused_sentences": 850,
  "new_sentences": 175,
  "sentence_reuse_ratio": 0.83,

  "embedding_recommendations": {
    "reuse": 180,
    "consider_reuse": 15,
    "regenerate": 10
  },

  "parent_version_id": "parent-uuid",
  "has_parent": true
}

Implementing Cost Savings

Pattern 1: Simple Reuse

Only generate embeddings for new content:

const chunks = await raptor.getDocumentChunks(docId);

for (const chunk of chunks) {
  if (chunk.embedding_recommendation === 'reuse') {
    // Reuse existing embedding
    const sourceEmbedding = await db.getEmbedding(chunk.dedup_source_chunk_id);
    await db.storeEmbedding(chunk.id, sourceEmbedding);
  } else {
    // Generate new embedding
    const embedding = await openai.embeddings.create({
      input: chunk.text,
      model: 'text-embedding-3-small'
    });
    await db.storeEmbedding(chunk.id, embedding);
  }
}

Pattern 2: Conservative Approach

Only reuse exact matches:

for (const chunk of chunks) {
  if (chunk.dedup_strategy === 'exact') {
    // Reuse embedding for 100% identical chunks
    await reuseEmbedding(chunk);
  } else {
    // Regenerate for any changes
    await generateNewEmbedding(chunk);
  }
}

Pattern 3: Aggressive Reuse

Reuse embeddings for high similarity:

for (const chunk of chunks) {
  if (chunk.content_reuse_ratio >= 0.8) {
    // Reuse if 80%+ sentences are unchanged
    await reuseEmbedding(chunk);
  } else {
    await generateNewEmbedding(chunk);
  }
}

Real-World Examples

Example 1: Software Manual Update

// v1.0: 50 pages → 100 chunks
const v1 = await raptor.process(manual_v1);

// v2.0: Added 2 pages, updated 3 sections
const v2 = await raptor.process(manual_v2, {
  parentDocumentId: v1.document_id,
  versionLabel: 'v2.0'
});

const summary = await raptor.getDedupSummary(v2.variant_id);
// {
//   total_chunks: 105,
//   embedding_recommendations: {
//     reuse: 85,
//     consider_reuse: 10,
//     regenerate: 10
//   },
//   sentence_reuse_ratio: 0.89
// }

// Cost savings: 89% fewer embeddings needed

Example 2: Legal Contract Revision

// Original contract
const contract = await raptor.process(contract_original);

// Revised contract (changed 2 clauses)
const revised = await raptor.process(contract_revised, {
  parentDocumentId: contract.document_id
});

const summary = await raptor.getDedupSummary(revised.variant_id);
// {
//   sentence_reuse_ratio: 0.96
// }

// 96% cost savings!

Example 3: Policy Document Timeline

// Upload multiple versions
const v1 = await raptor.process(policy_2022);
const v2 = await raptor.process(policy_2023, {
  parentDocumentId: v1.document_id
});
const v3 = await raptor.process(policy_2024, {
  parentDocumentId: v2.document_id
});

// Compare v3 to v1 (original)
const comparison = await raptor.compareDocuments(v1.document_id, v3.document_id);
console.log('Changes over 2 years:', comparison.changes);

Chunk-Level Metadata

dedup_metadata (JSONB)

For deep analysis, enable full metadata:

const chunks = await raptor.getDocumentChunks(docId, {
  includeFullMetadata: true
});

chunks.forEach(chunk => {
  console.log(chunk.dedup_metadata);
  // {
  //   matched_sentences: [
  //     { sentence: "...", source_chunk_id: "...", similarity: 0.98 },
  //     ...
  //   ],
  //   new_sentences: ["...", "..."],
  //   changed_sentences: [
  //     { old: "...", new: "...", similarity: 0.85 }
  //   ]
  // }
});

Full metadata can be large (5-50KB per chunk). Only enable when needed for debugging or detailed analysis.

Performance Considerations

Processing Time

Deduplication adds 2-5 seconds per document depending on:

Document size
Number of parent versions
Chunk count

Storage Impact

Metadata adds ~1KB per chunk:

100 chunks = ~100KB metadata
1000 chunks = ~1MB metadata

Accuracy vs Speed

Tune fuzzy matching threshold (internal parameter):

High accuracy (default): 90%+ similarity required for high_reuse
Fast processing: 80%+ similarity threshold

Limitations

When Deduplication Doesn’t Help

First version: No parent to compare against

const v1 = await raptor.process(first_doc);
// All chunks marked as "new"

Complete rewrites: No content overlap

const v2 = await raptor.process(completely_different_doc, {
  parentDocumentId: v1.document_id
});
// All chunks marked as "new"

Different chunking strategies: Chunk boundaries change

const v1 = await raptor.process(doc, { chunkSize: 512 });
const v2 = await raptor.process(doc, {
  chunkSize: 1024, // Different chunking
  parentDocumentId: v1.document_id
});
// Lower reuse ratio due to boundary misalignment

Best Practices

Always link versions

Deduplication requires parent-child relationships:

// Enable auto-linking globally
await raptor.updateAutoLinkSettings({
  autoLinkEnabled: true,
  autoLinkThreshold: 0.85
});

Use consistent chunking

Keep chunking parameters consistent across versions:

const config = {
  chunkSize: 512,
  chunkOverlap: 50,
  strategy: 'semantic'
};

const v1 = await raptor.process(doc_v1, config);
const v2 = await raptor.process(doc_v2, { ...config, parentDocumentId: v1.document_id });

Monitor dedup ratios

Track savings over time:

const summary = await raptor.getDedupSummary(variantId);

if (summary.sentence_reuse_ratio < 0.5) {
  console.warn('Low reuse ratio - check if documents are related');
} else {
  console.log(`${(summary.sentence_reuse_ratio * 100).toFixed(0)}% cost savings`);
}

Balance accuracy vs performance

Choose embedding reuse strategy based on your needs:

High accuracy RAG: Only reuse exact matches
Balanced: Reuse exact + high_reuse
Cost-optimized: Reuse all with content_reuse_ratio >= 0.8

Next Steps

Auto-Linking

Enable auto-linking to simplify version tracking

Version Control

Learn about document version management

SDK: Processing

Process documents with deduplication

API: Variants

Browse deduplication API endpoints

Get Started

Core Concepts

TypeScript SDK

API Reference

Deduplication

Deduplication

The Problem

The Solution

How It Works

Step 1: Version Linking

Step 2: Sentence Extraction

Step 3: Fuzzy Matching

Step 4: Reuse Scoring

Response Format

Deduplication Summary

Implementing Cost Savings

Pattern 1: Simple Reuse

Pattern 2: Conservative Approach

Pattern 3: Aggressive Reuse

Real-World Examples

Example 1: Software Manual Update

Example 2: Legal Contract Revision

Example 3: Policy Document Timeline

Chunk-Level Metadata

dedup_metadata (JSONB)

Performance Considerations

Processing Time

Storage Impact

Accuracy vs Speed

Limitations

When Deduplication Doesn’t Help

Best Practices

Next Steps

Auto-Linking

Version Control

SDK: Processing

API: Variants

Get Started

Core Concepts

TypeScript SDK

API Reference

​Deduplication

​The Problem

​The Solution

​How It Works

​Step 1: Version Linking

​Step 2: Sentence Extraction

​Step 3: Fuzzy Matching

​Step 4: Reuse Scoring

​Response Format

​Deduplication Summary

​Implementing Cost Savings

​Pattern 1: Simple Reuse

​Pattern 2: Conservative Approach

​Pattern 3: Aggressive Reuse

​Real-World Examples

​Example 1: Software Manual Update

​Example 2: Legal Contract Revision

​Example 3: Policy Document Timeline

​Chunk-Level Metadata

​dedup_metadata (JSONB)

​Performance Considerations

​Processing Time

​Storage Impact

​Accuracy vs Speed

​Limitations

​When Deduplication Doesn’t Help

​Best Practices

​Next Steps

Auto-Linking

Version Control

SDK: Processing

API: Variants

Deduplication

The Problem

The Solution

How It Works

Step 1: Version Linking

Step 2: Sentence Extraction

Step 3: Fuzzy Matching

Step 4: Reuse Scoring

Response Format

Deduplication Summary

Implementing Cost Savings

Pattern 1: Simple Reuse

Pattern 2: Conservative Approach

Pattern 3: Aggressive Reuse

Real-World Examples

Example 1: Software Manual Update

Example 2: Legal Contract Revision

Example 3: Policy Document Timeline

Chunk-Level Metadata

dedup_metadata (JSONB)

Performance Considerations

Processing Time

Storage Impact

Accuracy vs Speed

Limitations

When Deduplication Doesn’t Help

Best Practices

Next Steps