Deduplication
Deduplication is Raptor’s core cost-saving feature that identifies reusable chunks between document versions, eliminating redundant embeddings.
The Problem
Traditional RAG systems re-embed entire documents on updates:
Document v1: 200 chunks → 200 embeddings
Document v2: 205 chunks (5 new pages) → 205 NEW embeddings
Total cost: 405 embeddings
Wasted: 195 duplicate embeddings (48%)
The Solution
Raptor performs sentence-level diff analysis to identify reusable chunks:
Document v1: 200 chunks → 200 embeddings
Document v2: 205 chunks
- 180 exact matches → REUSE embeddings
- 15 high reuse (>80%) → REUSE embeddings
- 10 new content → 10 NEW embeddings
Total cost: 210 embeddings
Savings: 195 embeddings (93% reduction)
How It Works
Step 1: Version Linking
Documents must be linked (parent-child relationship):
// Auto-linking (recommended)
const v2 = await raptor . process ( file_v2 , {
autoLink: true ,
versionLabel: 'v2.0'
});
// Manual linking
const v2 = await raptor . process ( file_v2 , {
parentDocumentId: v1 . document_id ,
versionLabel: 'v2.0'
});
Raptor extracts sentences from each chunk:
Chunk A ( v1 ): [
"The product includes three components." ,
"Each component serves a specific purpose." ,
"Installation takes approximately 30 minutes."
]
Chunk B ( v2 ): [
"The product includes three components." ,
"Each component serves a specific purpose." ,
"Installation takes approximately 45 minutes." // CHANGED
]
Step 3: Fuzzy Matching
Sentences are compared using fuzzy matching (Levenshtein distance + semantic similarity):
Sentence 1: "The product includes three components."
v1 vs v2: 100% match → REUSE
Sentence 2: "Each component serves a specific purpose."
v1 vs v2: 100% match → REUSE
Sentence 3: "Installation takes approximately 30 minutes."
"Installation takes approximately 45 minutes."
v1 vs v2: 92% match → HIGH SIMILARITY
Step 4: Reuse Scoring
Raptor assigns a reuse strategy based on similarity:
Strategy Reuse Ratio Embedding Recommendation Description exact 100% reuseIdentical content hash - definitely reuse high_reuse 80-100% reuse or consider_reuseMost sentences unchanged partial_reuse 50-80% consider_reuseSignificant overlap mixed_content 30-50% regenerateMixed old/new content fuzzy Below 30% regenerateMostly new content new 0% regenerateCompletely new chunk
Each chunk includes deduplication metadata:
{
id : "chunk-uuid" ,
text : "Chunk content..." ,
chunk_index : 5 ,
tokens : 245 ,
// Deduplication metadata
dedup_strategy : "high_reuse" ,
dedup_confidence : 0.92 ,
is_reused : true ,
dedup_source_chunk_id : "source-chunk-uuid" ,
// Sentence-level metrics
total_sentences : 5 ,
reused_sentences_count : 4 ,
new_sentences_count : 1 ,
content_reuse_ratio : 0.8 ,
// Embedding recommendation
embedding_recommendation : "reuse" , // or "consider_reuse", "regenerate"
recommendation_confidence : "high" // or "medium", "low"
}
Deduplication Summary
Get aggregate statistics for a variant:
const summary = await raptor . getDedupSummary ( variantId );
console . log ( summary );
Response:
{
"variant_id" : "variant-uuid" ,
"total_chunks" : 205 ,
"chunk_breakdown" : {
"exact" : 120 ,
"high_reuse" : 60 ,
"partial_reuse" : 15 ,
"new" : 10
},
"total_sentences" : 1025 ,
"reused_sentences" : 850 ,
"new_sentences" : 175 ,
"sentence_reuse_ratio" : 0.83 ,
"embedding_recommendations" : {
"reuse" : 180 ,
"consider_reuse" : 15 ,
"regenerate" : 10
},
"parent_version_id" : "parent-uuid" ,
"has_parent" : true
}
Implementing Cost Savings
Pattern 1: Simple Reuse
Only generate embeddings for new content:
const chunks = await raptor . getDocumentChunks ( docId );
for ( const chunk of chunks ) {
if ( chunk . embedding_recommendation === 'reuse' ) {
// Reuse existing embedding
const sourceEmbedding = await db . getEmbedding ( chunk . dedup_source_chunk_id );
await db . storeEmbedding ( chunk . id , sourceEmbedding );
} else {
// Generate new embedding
const embedding = await openai . embeddings . create ({
input: chunk . text ,
model: 'text-embedding-3-small'
});
await db . storeEmbedding ( chunk . id , embedding );
}
}
Pattern 2: Conservative Approach
Only reuse exact matches:
for ( const chunk of chunks ) {
if ( chunk . dedup_strategy === 'exact' ) {
// Reuse embedding for 100% identical chunks
await reuseEmbedding ( chunk );
} else {
// Regenerate for any changes
await generateNewEmbedding ( chunk );
}
}
Pattern 3: Aggressive Reuse
Reuse embeddings for high similarity:
for ( const chunk of chunks ) {
if ( chunk . content_reuse_ratio >= 0.8 ) {
// Reuse if 80%+ sentences are unchanged
await reuseEmbedding ( chunk );
} else {
await generateNewEmbedding ( chunk );
}
}
Real-World Examples
Example 1: Software Manual Update
// v1.0: 50 pages → 100 chunks
const v1 = await raptor . process ( manual_v1 );
// v2.0: Added 2 pages, updated 3 sections
const v2 = await raptor . process ( manual_v2 , {
parentDocumentId: v1 . document_id ,
versionLabel: 'v2.0'
});
const summary = await raptor . getDedupSummary ( v2 . variant_id );
// {
// total_chunks: 105,
// embedding_recommendations: {
// reuse: 85,
// consider_reuse: 10,
// regenerate: 10
// },
// sentence_reuse_ratio: 0.89
// }
// Cost savings: 89% fewer embeddings needed
Example 2: Legal Contract Revision
// Original contract
const contract = await raptor . process ( contract_original );
// Revised contract (changed 2 clauses)
const revised = await raptor . process ( contract_revised , {
parentDocumentId: contract . document_id
});
const summary = await raptor . getDedupSummary ( revised . variant_id );
// {
// sentence_reuse_ratio: 0.96
// }
// 96% cost savings!
Example 3: Policy Document Timeline
// Upload multiple versions
const v1 = await raptor . process ( policy_2022 );
const v2 = await raptor . process ( policy_2023 , {
parentDocumentId: v1 . document_id
});
const v3 = await raptor . process ( policy_2024 , {
parentDocumentId: v2 . document_id
});
// Compare v3 to v1 (original)
const comparison = await raptor . compareDocuments ( v1 . document_id , v3 . document_id );
console . log ( 'Changes over 2 years:' , comparison . changes );
For deep analysis, enable full metadata:
const chunks = await raptor . getDocumentChunks ( docId , {
includeFullMetadata: true
});
chunks . forEach ( chunk => {
console . log ( chunk . dedup_metadata );
// {
// matched_sentences: [
// { sentence: "...", source_chunk_id: "...", similarity: 0.98 },
// ...
// ],
// new_sentences: ["...", "..."],
// changed_sentences: [
// { old: "...", new: "...", similarity: 0.85 }
// ]
// }
});
Full metadata can be large (5-50KB per chunk). Only enable when needed for debugging or detailed analysis.
Processing Time
Deduplication adds 2-5 seconds per document depending on:
Document size
Number of parent versions
Chunk count
Storage Impact
Metadata adds ~1KB per chunk:
100 chunks = ~100KB metadata
1000 chunks = ~1MB metadata
Accuracy vs Speed
Tune fuzzy matching threshold (internal parameter):
High accuracy (default): 90%+ similarity required for high_reuse
Fast processing : 80%+ similarity threshold
Limitations
When Deduplication Doesn’t Help
First version : No parent to compare against
const v1 = await raptor . process ( first_doc );
// All chunks marked as "new"
Complete rewrites : No content overlap
const v2 = await raptor . process ( completely_different_doc , {
parentDocumentId: v1 . document_id
});
// All chunks marked as "new"
Different chunking strategies : Chunk boundaries change
const v1 = await raptor . process ( doc , { chunkSize: 512 });
const v2 = await raptor . process ( doc , {
chunkSize: 1024 , // Different chunking
parentDocumentId: v1 . document_id
});
// Lower reuse ratio due to boundary misalignment
Best Practices
Deduplication requires parent-child relationships: // Enable auto-linking globally
await raptor . updateAutoLinkSettings ({
autoLinkEnabled: true ,
autoLinkThreshold: 0.85
});
Keep chunking parameters consistent across versions: const config = {
chunkSize: 512 ,
chunkOverlap: 50 ,
strategy: 'semantic'
};
const v1 = await raptor . process ( doc_v1 , config );
const v2 = await raptor . process ( doc_v2 , { ... config , parentDocumentId: v1 . document_id });
Track savings over time: const summary = await raptor . getDedupSummary ( variantId );
if ( summary . sentence_reuse_ratio < 0.5 ) {
console . warn ( 'Low reuse ratio - check if documents are related' );
} else {
console . log ( ` ${ ( summary . sentence_reuse_ratio * 100 ). toFixed ( 0 ) } % cost savings` );
}
Balance accuracy vs performance
Next Steps