Skip to main content
GET
/
api
/
documents
/
{document_id}
/
duplicates
curl https://api.raptordata.dev/api/documents/doc-001/duplicates \
  -H "Authorization: Bearer rd_live_xxx"
{
  "document_id": "doc-001",
  "duplicates": [
    {
      "id": "doc-001",
      "filename": "contract.pdf",
      "is_canonical": true,
      "created_at": "2024-01-15T10:00:00Z",
      "latest_version_id": "version-001"
    },
    {
      "id": "doc-002",
      "filename": "contract_copy.pdf",
      "is_canonical": false,
      "created_at": "2024-01-15T11:30:00Z",
      "latest_version_id": "version-002"
    },
    {
      "id": "doc-003",
      "filename": "contract_backup.pdf",
      "is_canonical": false,
      "created_at": "2024-01-15T14:20:00Z",
      "latest_version_id": "version-003"
    }
  ],
  "count": 3
}

Duplicates API

Raptor automatically detects duplicate uploads (same content hash) and provides zero-cost processing for duplicates. This API helps you identify and manage duplicate documents.

Get Document Duplicates

Get all duplicate documents of a specific document (documents with identical content hash).

Path Parameters

document_id
string
required
Document ID

Response

document_id
string
Requested document ID
duplicates
array
Array of duplicate document info objects
count
integer
Number of duplicates (including canonical)

Duplicate Document Info

id
string
Document ID
filename
string
Document filename
is_canonical
boolean
Whether this is the canonical (original) document
created_at
string
ISO 8601 timestamp
latest_version_id
string
Latest version ID
{
  "document_id": "doc-001",
  "duplicates": [
    {
      "id": "doc-001",
      "filename": "contract.pdf",
      "is_canonical": true,
      "created_at": "2024-01-15T10:00:00Z",
      "latest_version_id": "version-001"
    },
    {
      "id": "doc-002",
      "filename": "contract_copy.pdf",
      "is_canonical": false,
      "created_at": "2024-01-15T11:30:00Z",
      "latest_version_id": "version-002"
    },
    {
      "id": "doc-003",
      "filename": "contract_backup.pdf",
      "is_canonical": false,
      "created_at": "2024-01-15T14:20:00Z",
      "latest_version_id": "version-003"
    }
  ],
  "count": 3
}
curl https://api.raptordata.dev/api/documents/doc-001/duplicates \
  -H "Authorization: Bearer rd_live_xxx"

List All Duplicates

List all duplicate groups for the authenticated user.

Response

duplicate_groups
array
Array of duplicate group objects
total_groups
integer
Total number of duplicate groups
total_duplicates
integer
Total number of duplicate documents (excluding canonicals)

Duplicate Group Object

content_hash
string
SHA-256 content hash for this group
canonical_document_id
string
ID of the canonical (original) document
canonical_filename
string
Filename of canonical document
duplicates
array
Array of duplicate documents (excluding canonical)
total_count
integer
Total duplicates in this group
{
  "duplicate_groups": [
    {
      "content_hash": "sha256-abc123def456...",
      "canonical_document_id": "doc-001",
      "canonical_filename": "contract.pdf",
      "duplicates": [
        {
          "id": "doc-002",
          "filename": "contract_copy.pdf",
          "is_canonical": false,
          "created_at": "2024-01-15T11:30:00Z",
          "latest_version_id": "version-002"
        },
        {
          "id": "doc-003",
          "filename": "contract_backup.pdf",
          "is_canonical": false,
          "created_at": "2024-01-15T14:20:00Z",
          "latest_version_id": "version-003"
        }
      ],
      "total_count": 2
    },
    {
      "content_hash": "sha256-ghi789jkl012...",
      "canonical_document_id": "doc-010",
      "canonical_filename": "report.pdf",
      "duplicates": [
        {
          "id": "doc-011",
          "filename": "report_duplicate.pdf",
          "is_canonical": false,
          "created_at": "2024-01-20T09:00:00Z",
          "latest_version_id": "version-011"
        }
      ],
      "total_count": 1
    }
  ],
  "total_groups": 2,
  "total_duplicates": 3
}
curl https://api.raptordata.dev/api/documents/duplicates \
  -H "Authorization: Bearer rd_live_xxx"

How Duplicate Detection Works

Raptor detects duplicates using content hash comparison:
  1. Upload: When you upload a file, Raptor calculates its SHA-256 hash
  2. Detection: If a file with the same hash exists, it’s marked as a duplicate
  3. Zero Cost: Duplicate documents don’t consume processing credits
  4. Variant Linking: Duplicate documents reference the canonical document’s chunks

Duplicate Upload Response

When you upload a duplicate file:
{
  "variant_id": "variant-002",
  "document_id": "doc-002",
  "is_duplicate": true,
  "canonical_document_id": "doc-001",
  "processing_skipped": true,
  "cost_saved": 15,
  "status": "completed",
  "chunks_count": 47
}
is_duplicate
boolean
Set to true for duplicate uploads
canonical_document_id
string
ID of the original document
processing_skipped
boolean
Processing was skipped (chunks reused)
cost_saved
integer
Pages saved (not charged)

Cleanup Duplicates

Delete duplicate documents while keeping the canonical:
async function cleanupDuplicates() {
  const allDuplicates = await raptor.listAllDuplicates();

  console.log(`Found ${allDuplicates.total_groups} duplicate groups`);

  for (const group of allDuplicates.duplicate_groups) {
    console.log(`\nProcessing group: ${group.canonical_filename}`);
    console.log(`Keeping canonical: ${group.canonical_document_id}`);

    // Delete duplicates
    for (const dup of group.duplicates) {
      console.log(`Deleting duplicate: ${dup.filename}`);
      await raptor.deleteDocument(dup.id);
    }
  }

  console.log('\nCleanup complete!');
}

await cleanupDuplicates();

Find Specific Duplicates

Find duplicates of a specific document:
async function findDuplicatesOf(documentId: string) {
  const result = await raptor.getDuplicates(documentId);

  if (result.count === 1) {
    console.log('No duplicates found');
    return;
  }

  console.log(`Found ${result.count - 1} duplicates:`);

  const canonical = result.duplicates.find(d => d.is_canonical);
  const duplicates = result.duplicates.filter(d => !d.is_canonical);

  console.log(`\nOriginal: ${canonical?.filename}`);
  console.log(`Uploaded: ${canonical?.created_at}`);

  console.log(`\nDuplicates:`);
  duplicates.forEach((dup, index) => {
    const timeDiff = new Date(dup.created_at).getTime() -
                     new Date(canonical!.created_at).getTime();
    const hoursDiff = Math.floor(timeDiff / (1000 * 60 * 60));

    console.log(`${index + 1}. ${dup.filename}`);
    console.log(`   Uploaded ${hoursDiff} hours after original`);
  });
}

await findDuplicatesOf('doc-001');

Audit Storage Usage

Calculate storage saved by duplicate detection:
async function auditDuplicateSavings() {
  const allDuplicates = await raptor.listAllDuplicates();

  let totalDuplicates = 0;
  let totalFileSize = 0;

  for (const group of allDuplicates.duplicate_groups) {
    const canonical = await raptor.getDocument(group.canonical_document_id);

    group.duplicates.forEach(dup => {
      totalDuplicates++;
      totalFileSize += canonical.file_size_bytes;
    });
  }

  const totalSizeMB = (totalFileSize / (1024 * 1024)).toFixed(2);

  console.log('Duplicate Detection Savings:');
  console.log(`Total duplicates: ${totalDuplicates}`);
  console.log(`Storage saved: ${totalSizeMB} MB`);
  console.log(`Processing cost saved: ${totalDuplicates * 10} credits`);
}

await auditDuplicateSavings();

Prevent Duplicates

Check for duplicates before uploading:
async function uploadWithDuplicateCheck(file: File) {
  // Calculate client-side hash (requires crypto library)
  const hash = await calculateFileHash(file);

  // Check if hash exists
  const allDocs = await raptor.listDocuments({ limit: 1000 });

  for (const doc of allDocs) {
    if (doc.content_hash === hash) {
      console.log(`Duplicate found: ${doc.filename}`);
      console.log('Upload skipped');
      return { isDuplicate: true, existingDocId: doc.id };
    }
  }

  // Upload new document
  const result = await raptor.process(file);
  return { isDuplicate: false, result };
}

Duplicate vs Version

Understanding the difference:
FeatureDuplicateVersion
ContentIdentical hashDifferent hash
RelationshipNo lineage linkParent-child link
ProcessingZero cost (reuses chunks)Full processing
Use CaseSame file uploaded twiceUpdated file

Example: Duplicate

// Upload same file twice
const upload1 = await raptor.process('contract.pdf');
const upload2 = await raptor.process('contract.pdf');

console.log(upload2.is_duplicate); // true
console.log(upload2.cost_saved); // 15 pages

Example: Version

// Upload original
const v1 = await raptor.process('contract_v1.pdf');

// Upload modified version
const v2 = await raptor.process('contract_v2.pdf', {
  parentDocumentId: v1.documentId
});

console.log(v2.is_duplicate); // false
console.log(v2.deduplication_available); // true (chunk-level dedup)

Error Responses

404
Not Found
Document not found or doesn’t belong to user
{
  "detail": "Document not found",
  "status_code": 404
}

Best Practices

Cleanup regularly: Run duplicate cleanup monthly to save storage and maintain organization.
Canonicals first: The first uploaded document becomes the canonical. All subsequent uploads with the same hash are marked as duplicates.
Deleting canonicals: If you delete the canonical document, Raptor automatically promotes the oldest duplicate to be the new canonical.