Variants API
Processing variants represent different chunking configurations of the same document version. This API lets you retrieve variant chunks and deduplication information.
Get Variant Chunks
Retrieve chunks from a specific processing variant with pagination.
Path Parameters
Query Parameters
Maximum chunks to return (1-1000)
Include full deduplication metadata JSONB (can be large)
Response
Total number of chunks available
Limit used for this request
Offset used for this request
Chunk Object
Page number (1-indexed, null if not applicable)
Array of page numbers this chunk spans
Nested section names (e.g., [“Chapter 1”, “Section 1.1”])
Token count for this chunk
Zero-indexed position in document
Type of chunk: text, table, image, code
Whether chunk contains a table
Table extraction metadata
Strategy used: semantic or recursive
Extracted section number (e.g., “1.2.3”)
AI-generated context for orphan tables
Parent chunk ID if hierarchical
Bounding box coordinates for images
Deduplication Fields
Deduplication strategy: exact, high_reuse, partial_reuse, mixed_content, fuzzy, new
Similarity score (0.0-1.0)
Whether chunk was exactly reused from parent version
Source chunk ID if deduplicated
Number of sentences reused from parent
Percentage of sentences reused (0.0-1.0)
Recommendation: reuse, consider_reuse, regenerate
recommendation_confidence
Confidence level: high, medium, low
Full deduplication JSONB (only if include_full_metadata=true)
{
"chunks": [
{
"id": "chunk-001",
"text": "This Agreement is entered into as of January 1, 2024...",
"page_number": 1,
"page_range": [1],
"section_hierarchy": ["Contract Terms", "Effective Date"],
"tokens": 512,
"chunk_index": 0,
"metadata": {},
"chunk_type": "text",
"contains_table": false,
"table_metadata": null,
"chunking_strategy": "semantic",
"section_number": "1.1",
"quality_score": 0.92,
"synthetic_context": null,
"parent_chunk_id": null,
"bounding_box": null,
"dedup_strategy": "high_reuse",
"dedup_confidence": 0.95,
"is_reused": false,
"dedup_source_chunk_id": "chunk-parent-001",
"total_sentences": 8,
"reused_sentences_count": 7,
"new_sentences_count": 1,
"content_reuse_ratio": 0.875,
"embedding_recommendation": "consider_reuse",
"recommendation_confidence": "high",
"dedup_metadata": null
}
],
"total": 47,
"limit": 100,
"offset": 0
}
curl https://api.raptordata.dev/api/documents/variants/variant-001/chunks?limit=50&offset=0 \
-H "Authorization: Bearer rd_live_xxx"
Get Deduplication Summary
Get aggregate deduplication statistics for a variant.
Path Parameters
Response
Breakdown by deduplication strategy
Total sentences across all chunks
Number of sentences reused from parent
Ratio of sentences reused (0.0-1.0)
embedding_recommendations
Breakdown of embedding recommendations
Parent version ID (null if no parent)
Whether variant has a parent version
{
"variant_id": "variant-001",
"total_chunks": 47,
"chunk_breakdown": {
"exact": 15,
"high_reuse": 18,
"partial_reuse": 8,
"new": 6
},
"total_sentences": 376,
"reused_sentences": 298,
"new_sentences": 78,
"sentence_reuse_ratio": 0.79,
"embedding_recommendations": {
"reuse": 15,
"consider_reuse": 18,
"regenerate": 14
},
"parent_version_id": "version-000",
"has_parent": true
}
curl https://api.raptordata.dev/api/documents/variants/variant-001/dedup-summary \
-H "Authorization: Bearer rd_live_xxx"
Retrieve all chunks with pagination:
async function getAllChunks(variantId: string) {
const allChunks = [];
let offset = 0;
const limit = 100;
while (true) {
const { chunks, total } = await raptor.getChunks(variantId, {
limit,
offset
});
allChunks.push(...chunks);
if (offset + chunks.length >= total) {
break;
}
offset += limit;
}
return allChunks;
}
const chunks = await getAllChunks('variant-001');
console.log(`Retrieved ${chunks.length} chunks`);
Filter by Quality
Get only high-quality chunks:
const { chunks } = await raptor.getChunks('variant-001', {
limit: 1000,
includeFullMetadata: false
});
const highQuality = chunks.filter(chunk =>
chunk.quality_score && chunk.quality_score >= 0.8
);
console.log(`${highQuality.length} high-quality chunks`);
Analyze Reuse
Analyze content reuse patterns:
const summary = await raptor.getDedupSummary('variant-001');
console.log('Chunk-level reuse:');
Object.entries(summary.chunk_breakdown).forEach(([strategy, count]) => {
const percent = (count / summary.total_chunks * 100).toFixed(1);
console.log(` ${strategy}: ${count} chunks (${percent}%)`);
});
console.log('\nSentence-level reuse:');
console.log(` Reused: ${summary.reused_sentences} sentences`);
console.log(` New: ${summary.new_sentences} sentences`);
console.log(` Ratio: ${(summary.sentence_reuse_ratio * 100).toFixed(1)}%`);
console.log('\nEmbedding recommendations:');
Object.entries(summary.embedding_recommendations).forEach(([rec, count]) => {
console.log(` ${rec}: ${count} chunks`);
});
Working with Tables
Extract table chunks:
const { chunks } = await raptor.getChunks('variant-001', {
limit: 1000
});
const tableChunks = chunks.filter(chunk => chunk.contains_table);
tableChunks.forEach(chunk => {
console.log(`Table on page ${chunk.page_number}:`);
if (chunk.table_metadata) {
console.log(` Rows: ${chunk.table_metadata.rows}`);
console.log(` Columns: ${chunk.table_metadata.columns}`);
}
if (chunk.synthetic_context) {
console.log(` Context: ${chunk.synthetic_context}`);
}
console.log(` Text: ${chunk.text.substring(0, 200)}...`);
});
Error Responses
Variant not found or doesn’t belong to user
Invalid pagination parameters (e.g., limit > 1000)
{
"detail": "Variant not found",
"status_code": 404
}