Duplicates API
Raptor automatically detects duplicate uploads (same content hash) and provides zero-cost processing for duplicates. This API helps you identify and manage duplicate documents.
Get Document Duplicates
Get all duplicate documents of a specific document (documents with identical content hash).
Path Parameters
Response
Array of duplicate document info objects
Number of duplicates (including canonical)
Duplicate Document Info
Whether this is the canonical (original) document
{
"document_id": "doc-001",
"duplicates": [
{
"id": "doc-001",
"filename": "contract.pdf",
"is_canonical": true,
"created_at": "2024-01-15T10:00:00Z",
"latest_version_id": "version-001"
},
{
"id": "doc-002",
"filename": "contract_copy.pdf",
"is_canonical": false,
"created_at": "2024-01-15T11:30:00Z",
"latest_version_id": "version-002"
},
{
"id": "doc-003",
"filename": "contract_backup.pdf",
"is_canonical": false,
"created_at": "2024-01-15T14:20:00Z",
"latest_version_id": "version-003"
}
],
"count": 3
}
curl https://api.raptordata.dev/api/documents/doc-001/duplicates \
-H "Authorization: Bearer rd_live_xxx"
List All Duplicates
List all duplicate groups for the authenticated user.
Response
Array of duplicate group objects
Total number of duplicate groups
Total number of duplicate documents (excluding canonicals)
Duplicate Group Object
SHA-256 content hash for this group
ID of the canonical (original) document
Filename of canonical document
Array of duplicate documents (excluding canonical)
Total duplicates in this group
{
"duplicate_groups": [
{
"content_hash": "sha256-abc123def456...",
"canonical_document_id": "doc-001",
"canonical_filename": "contract.pdf",
"duplicates": [
{
"id": "doc-002",
"filename": "contract_copy.pdf",
"is_canonical": false,
"created_at": "2024-01-15T11:30:00Z",
"latest_version_id": "version-002"
},
{
"id": "doc-003",
"filename": "contract_backup.pdf",
"is_canonical": false,
"created_at": "2024-01-15T14:20:00Z",
"latest_version_id": "version-003"
}
],
"total_count": 2
},
{
"content_hash": "sha256-ghi789jkl012...",
"canonical_document_id": "doc-010",
"canonical_filename": "report.pdf",
"duplicates": [
{
"id": "doc-011",
"filename": "report_duplicate.pdf",
"is_canonical": false,
"created_at": "2024-01-20T09:00:00Z",
"latest_version_id": "version-011"
}
],
"total_count": 1
}
],
"total_groups": 2,
"total_duplicates": 3
}
curl https://api.raptordata.dev/api/documents/duplicates \
-H "Authorization: Bearer rd_live_xxx"
How Duplicate Detection Works
Raptor detects duplicates using content hash comparison:
- Upload: When you upload a file, Raptor calculates its SHA-256 hash
- Detection: If a file with the same hash exists, it’s marked as a duplicate
- Zero Cost: Duplicate documents don’t consume processing credits
- Variant Linking: Duplicate documents reference the canonical document’s chunks
Duplicate Upload Response
When you upload a duplicate file:
{
"variant_id": "variant-002",
"document_id": "doc-002",
"is_duplicate": true,
"canonical_document_id": "doc-001",
"processing_skipped": true,
"cost_saved": 15,
"status": "completed",
"chunks_count": 47
}
Set to true for duplicate uploads
ID of the original document
Processing was skipped (chunks reused)
Pages saved (not charged)
Cleanup Duplicates
Delete duplicate documents while keeping the canonical:
async function cleanupDuplicates() {
const allDuplicates = await raptor.listAllDuplicates();
console.log(`Found ${allDuplicates.total_groups} duplicate groups`);
for (const group of allDuplicates.duplicate_groups) {
console.log(`\nProcessing group: ${group.canonical_filename}`);
console.log(`Keeping canonical: ${group.canonical_document_id}`);
// Delete duplicates
for (const dup of group.duplicates) {
console.log(`Deleting duplicate: ${dup.filename}`);
await raptor.deleteDocument(dup.id);
}
}
console.log('\nCleanup complete!');
}
await cleanupDuplicates();
Find Specific Duplicates
Find duplicates of a specific document:
async function findDuplicatesOf(documentId: string) {
const result = await raptor.getDuplicates(documentId);
if (result.count === 1) {
console.log('No duplicates found');
return;
}
console.log(`Found ${result.count - 1} duplicates:`);
const canonical = result.duplicates.find(d => d.is_canonical);
const duplicates = result.duplicates.filter(d => !d.is_canonical);
console.log(`\nOriginal: ${canonical?.filename}`);
console.log(`Uploaded: ${canonical?.created_at}`);
console.log(`\nDuplicates:`);
duplicates.forEach((dup, index) => {
const timeDiff = new Date(dup.created_at).getTime() -
new Date(canonical!.created_at).getTime();
const hoursDiff = Math.floor(timeDiff / (1000 * 60 * 60));
console.log(`${index + 1}. ${dup.filename}`);
console.log(` Uploaded ${hoursDiff} hours after original`);
});
}
await findDuplicatesOf('doc-001');
Audit Storage Usage
Calculate storage saved by duplicate detection:
async function auditDuplicateSavings() {
const allDuplicates = await raptor.listAllDuplicates();
let totalDuplicates = 0;
let totalFileSize = 0;
for (const group of allDuplicates.duplicate_groups) {
const canonical = await raptor.getDocument(group.canonical_document_id);
group.duplicates.forEach(dup => {
totalDuplicates++;
totalFileSize += canonical.file_size_bytes;
});
}
const totalSizeMB = (totalFileSize / (1024 * 1024)).toFixed(2);
console.log('Duplicate Detection Savings:');
console.log(`Total duplicates: ${totalDuplicates}`);
console.log(`Storage saved: ${totalSizeMB} MB`);
console.log(`Processing cost saved: ${totalDuplicates * 10} credits`);
}
await auditDuplicateSavings();
Prevent Duplicates
Check for duplicates before uploading:
async function uploadWithDuplicateCheck(file: File) {
// Calculate client-side hash (requires crypto library)
const hash = await calculateFileHash(file);
// Check if hash exists
const allDocs = await raptor.listDocuments({ limit: 1000 });
for (const doc of allDocs) {
if (doc.content_hash === hash) {
console.log(`Duplicate found: ${doc.filename}`);
console.log('Upload skipped');
return { isDuplicate: true, existingDocId: doc.id };
}
}
// Upload new document
const result = await raptor.process(file);
return { isDuplicate: false, result };
}
Duplicate vs Version
Understanding the difference:
| Feature | Duplicate | Version |
|---|
| Content | Identical hash | Different hash |
| Relationship | No lineage link | Parent-child link |
| Processing | Zero cost (reuses chunks) | Full processing |
| Use Case | Same file uploaded twice | Updated file |
Example: Duplicate
// Upload same file twice
const upload1 = await raptor.process('contract.pdf');
const upload2 = await raptor.process('contract.pdf');
console.log(upload2.is_duplicate); // true
console.log(upload2.cost_saved); // 15 pages
Example: Version
// Upload original
const v1 = await raptor.process('contract_v1.pdf');
// Upload modified version
const v2 = await raptor.process('contract_v2.pdf', {
parentDocumentId: v1.documentId
});
console.log(v2.is_duplicate); // false
console.log(v2.deduplication_available); // true (chunk-level dedup)
Error Responses
Document not found or doesn’t belong to user
{
"detail": "Document not found",
"status_code": 404
}
Best Practices
Cleanup regularly: Run duplicate cleanup monthly to save storage and maintain organization.
Canonicals first: The first uploaded document becomes the canonical. All subsequent uploads with the same hash are marked as duplicates.
Deleting canonicals: If you delete the canonical document, Raptor automatically promotes the oldest duplicate to be the new canonical.