Skip to main content
POST
/
api
/
documents
curl -X POST https://api.raptordata.dev/api/documents \
  -H "Authorization: Bearer rd_live_xxx" \
  -F "file=@contract.pdf" \
  -F "chunk_size=512" \
  -F "strategy=semantic" \
  -F "version_label=v2.0"
{
  "variant_id": "550e8400-e29b-41d4-a716-446655440000",
  "document_id": "6ba7b810-9dad-11d1-80b4-00c04fd430c8",
  "version_id": "6ba7b811-9dad-11d1-80b4-00c04fd430c8",
  "version_number": 2,
  "is_new_document": false,
  "is_new_version": true,
  "is_new_variant": true,
  "is_duplicate": false,
  "existing_match": false,
  "status": "pending",
  "task_id": "abc-123-def-456",
  "estimated_pages": 15,
  "deduplication_available": true,
  "auto_linked": true,
  "auto_link_confidence": 0.92,
  "auto_link_explanation": [
    "High filename similarity: 0.95",
    "Upload time proximity: 2 hours apart",
    "Content similarity: 94% overlap"
  ],
  "auto_link_method": "metadata_and_content"
}

Documents API

The Documents API handles document upload, processing, version control, and retrieval.

Upload Document

Upload a document for processing with version control and auto-linking.

Request

file
file
required
Document file to process (PDF, DOCX, TXT, etc.)
store_content
boolean
default:"true"
Whether to store document content after processing. If false, content is auto-deleted while metadata is preserved.
chunk_size
integer
default:"512"
Target chunk size in tokens (128-2048)
chunk_overlap
integer
default:"50"
Overlap between chunks in tokens (0-512)
strategy
string
default:"semantic"
Chunking strategy: semantic or recursive
extract_section_numbers
boolean
default:"false"
Extract section numbers from legal documents
process_images
boolean
default:"true"
Create chunks for images and figures
calculate_quality_scores
boolean
default:"true"
Calculate quality scores for chunks
min_chunk_quality
number
default:"0.0"
Minimum quality threshold (0.0-1.0)
enable_smart_context
boolean
default:"true"
Generate AI context for orphan tables
table_extraction
boolean
default:"true"
Extract tables from documents
table_context_generation
boolean
default:"false"
Generate context for tables using AI
parent_document_id
string
Parent document ID for manual version linking
version_label
string
Human-readable version label (e.g., “v2.0”, “draft”)
Enable/disable auto-linking. null uses account setting
Confidence threshold for auto-linking (0.0-1.0)

Response

variant_id
string
Processing variant ID
document_id
string
Document ID (unique across versions)
version_id
string
Version ID
version_number
integer
Version number in lineage (starts at 1)
is_new_document
boolean
Whether this is a new document
is_new_version
boolean
Whether this is a new version of existing document
is_new_variant
boolean
Whether this is a new processing variant
is_duplicate
boolean
Whether document is a duplicate (exact content hash match)
canonical_document_id
string
ID of canonical document if duplicate
processing_skipped
boolean
Whether processing was skipped (duplicate or existing variant)
status
string
Processing status: pending, processing, completed, or failed
task_id
string
Celery task ID for async processing
chunks_count
integer
Number of chunks (if completed)
estimated_pages
integer
Estimated page count for billing
cost_saved
integer
Pages saved by duplicate detection
deduplication_available
boolean
Whether chunk deduplication is available
auto_linked
boolean
Whether auto-linking detected and linked to parent
Auto-link confidence score (0.0-1.0)
Human-readable reasons for auto-link decision
Detection method: metadata, metadata_and_content, or none
{
  "variant_id": "550e8400-e29b-41d4-a716-446655440000",
  "document_id": "6ba7b810-9dad-11d1-80b4-00c04fd430c8",
  "version_id": "6ba7b811-9dad-11d1-80b4-00c04fd430c8",
  "version_number": 2,
  "is_new_document": false,
  "is_new_version": true,
  "is_new_variant": true,
  "is_duplicate": false,
  "existing_match": false,
  "status": "pending",
  "task_id": "abc-123-def-456",
  "estimated_pages": 15,
  "deduplication_available": true,
  "auto_linked": true,
  "auto_link_confidence": 0.92,
  "auto_link_explanation": [
    "High filename similarity: 0.95",
    "Upload time proximity: 2 hours apart",
    "Content similarity: 94% overlap"
  ],
  "auto_link_method": "metadata_and_content"
}
curl -X POST https://api.raptordata.dev/api/documents \
  -H "Authorization: Bearer rd_live_xxx" \
  -F "file=@contract.pdf" \
  -F "chunk_size=512" \
  -F "strategy=semantic" \
  -F "version_label=v2.0"

Get Document

Retrieve document information and processing status.

Path Parameters

document_id
string
required
Document ID

Query Parameters

version
integer
Specific version number to retrieve (default: latest)
variant_id
string
Specific variant ID to retrieve

Response

id
string
Variant ID (for compatibility)
filename
string
Document filename
mime_type
string
MIME type (e.g., application/pdf)
file_size_bytes
integer
File size in bytes
status
string
Processing status: pending, processing, completed, or failed
chunks_count
integer
Number of processed chunks
total_tokens
integer
Total token count across all chunks
created_at
string
ISO 8601 timestamp
store_content
boolean
Whether original content is stored
deleted_at
string
Deletion timestamp (null if not deleted)
extractor_version
string
Version of extraction engine used
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "filename": "contract.pdf",
  "mime_type": "application/pdf",
  "file_size_bytes": 1048576,
  "status": "completed",
  "chunks_count": 47,
  "total_tokens": 23450,
  "created_at": "2024-01-15T10:30:00Z",
  "store_content": true,
  "deleted_at": null,
  "extractor_version": "1.0.0"
}

List Documents

List all documents for the authenticated user.

Query Parameters

limit
integer
default:"20"
Maximum number of results (1-100)
offset
integer
default:"0"
Pagination offset

Response

Returns an array of document objects.
[
  {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "filename": "contract.pdf",
    "mime_type": "application/pdf",
    "file_size_bytes": 1048576,
    "status": "completed",
    "chunks_count": 47,
    "total_tokens": 23450,
    "created_at": "2024-01-15T10:30:00Z",
    "store_content": true,
    "deleted_at": null
  }
]

Get Document Chunks

Retrieve processed chunks for a document.

Path Parameters

document_id
string
required
Document ID

Query Parameters

include_full_metadata
boolean
default:"false"
Include full deduplication metadata (can be large)

Response

Returns an array of chunk objects.
[
  {
    "id": "chunk-001",
    "text": "This is the content of the first chunk...",
    "page_number": 1,
    "page_range": [1],
    "section_hierarchy": ["Introduction", "Overview"],
    "tokens": 512,
    "chunk_index": 0,
    "metadata": {},
    "chunk_type": "text",
    "contains_table": false,
    "quality_score": 0.87,
    "is_reused": false
  }
]

Delete Document

Delete document with cascade deletion of all versions and variants.

Path Parameters

document_id
string
required
Document ID to delete

Response

id
string
Deleted document ID
type
string
Resource type (document)
status
string
Deletion status (deleted)
cascaded
object
Cascade deletion information
deleted_at
string
Deletion timestamp
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "document",
  "status": "deleted",
  "cascaded": {
    "versions": 3,
    "variants": 5,
    "chunks": 147
  },
  "deleted_at": "2024-01-15T15:30:00Z"
}

Compare Documents

Compare two documents and return detailed diff.

Query Parameters

doc1
string
required
First document ID
doc2
string
required
Second document ID

Response

summary
string
Human-readable summary of changes
similarityScore
number
Overall similarity score (0.0-1.0)
diff
object
Detailed diff information
{
  "summary": "Document updated with client revisions",
  "similarityScore": 0.87,
  "diff": {
    "addedCount": 12,
    "removedCount": 5,
    "modifiedCount": 8,
    "addedChunks": [...],
    "removedChunks": [...],
    "modifiedChunks": [...]
  }
}

Error Responses

400
Bad Request
Invalid parameters (e.g., chunk_size out of range)
401
Unauthorized
Invalid or missing API key
404
Not Found
Document not found
413
Payload Too Large
File size exceeds maximum (100MB)
429
Too Many Requests
Rate limit exceeded
500
Internal Server Error
Server error
{
  "detail": "chunk_size must be between 128 and 2048",
  "status_code": 400
}