Document Processing
The process() method is the core of the Raptor SDK. It handles document upload, chunking, and optional version control in a single API call.
Basic Usage
import Raptor from '@raptor-data/ts-sdk';
const raptor = new Raptor({ apiKey: process.env.RAPTOR_API_KEY });
// Process a document
const result = await raptor.process('document.pdf');
console.log(`Document ID: ${result.documentId}`);
console.log(`Chunks: ${result.chunks.length}`);
console.log(`Version: ${result.versionNumber}`);
Method Signature
raptor.process(source, options?)
Parameters
source
string | File | Blob
required
File path (Node.js) or File/Blob object (browser)
Processing configuration options
Processing Options
Chunking Options
Target chunk size in tokens (128-2048)
Overlap between chunks in tokens (0-512)
strategy
'semantic' | 'recursive'
default:"semantic"
Chunking strategy:
semantic: Content-aware chunking (recommended)
recursive: Fixed-size chunking
Advanced Processing Options
Create chunks for images and figures with captions
Extract section numbers from legal documents
Calculate quality scores for each chunk
Minimum quality threshold (0.0-1.0). Chunks below this score are filtered out
Generate AI context for orphan tables and figures
Extract tables from documents
Generate context for extracted tables using AI
Store original document content. Set to false to auto-delete after processing (metadata is preserved)
Polling Options
Wait for processing to complete before returning
Polling interval in milliseconds (1 second default)
Override global max polling attempts
Override global polling timeout in milliseconds
Version Control Options
Parent document ID for manual version linking
Human-readable version label (e.g., “v2.0”, “Final Draft”)
Auto-Linking Options
autoLink
boolean | null
default:"null"
Enable/disable auto-linking. null uses account setting
Confidence threshold for auto-linking (0.0-1.0). null uses account default
Response
interface ProcessResult {
documentId: string;
chunks: string[];
metadata: ChunkMetadata[];
// Version control
versionId: string;
versionNumber: number;
isNewDocument: boolean;
isNewVersion: boolean;
isNewVariant: boolean;
// Auto-linking
autoLinked?: boolean;
autoLinkConfidence?: number;
autoLinkExplanation?: string[];
autoLinkMethod?: 'metadata' | 'metadata_and_content' | 'none';
parentDocumentId?: string;
// Deduplication
deduplicationAvailable: boolean;
isDuplicate?: boolean;
canonicalDocumentId?: string;
processingSkipped?: boolean;
costSaved?: number;
}
Response Fields
Unique document identifier
Array of chunk text content
Array of chunk metadata objects
Version number in document lineage (starts at 1)
Whether auto-linking detected a parent document
Confidence score for auto-linking (0.0-1.0)
Whether chunk-level deduplication is available for this document
Examples
Basic Processing
const result = await raptor.process('contract.pdf');
console.log(`Processed ${result.chunks.length} chunks`);
result.chunks.forEach((chunk, i) => {
console.log(`Chunk ${i}: ${chunk.substring(0, 100)}...`);
});
Custom Chunking
const result = await raptor.process('document.pdf', {
chunkSize: 1024, // Larger chunks
chunkOverlap: 100, // More overlap
strategy: 'semantic' // Content-aware
});
Advanced Features
const result = await raptor.process('legal-contract.pdf', {
extractSectionNumbers: true, // Extract section numbers
calculateQualityScores: true, // Calculate quality
minChunkQuality: 0.5, // Filter low-quality chunks
processImages: true, // Include images
tableExtraction: true, // Extract tables
tableContextGeneration: true // Generate table context
});
// Check chunk quality
result.metadata.forEach(chunk => {
if (chunk.qualityScore) {
console.log(`Chunk ${chunk.chunkIndex}: Quality ${chunk.qualityScore.toFixed(2)}`);
}
});
Browser Upload
// In a browser with file input
function handleFileUpload(file: File) {
const result = await raptor.process(file, {
wait: true,
chunkSize: 512
});
console.log(`Uploaded ${file.name}`);
console.log(`Got ${result.chunks.length} chunks`);
}
Next.js API Route
// app/api/upload/route.ts
import Raptor from '@raptor-data/ts-sdk';
export async function POST(request: Request) {
const formData = await request.formData();
const file = formData.get('file') as File;
const raptor = new Raptor({ apiKey: process.env.RAPTOR_API_KEY });
const result = await raptor.process(file, {
wait: true,
versionLabel: formData.get('versionLabel') as string
});
return Response.json({
documentId: result.documentId,
chunks: result.chunks.length,
versionNumber: result.versionNumber
});
}
Async Processing
For large documents, you can process asynchronously and poll for status:
// Start processing (don't wait)
const result = await raptor.process('large-document.pdf', {
wait: false
});
console.log(`Processing started: ${result.documentId}`);
// Poll for completion
while (true) {
const variant = await raptor.getVariant(result.variantId);
if (variant.status === 'completed') {
console.log(`Completed! ${variant.chunksCount} chunks`);
break;
} else if (variant.status === 'failed') {
console.error(`Failed: ${variant.error}`);
break;
}
console.log(`Status: ${variant.status}`);
await new Promise(resolve => setTimeout(resolve, 2000));
}
Streaming Progress
for await (const progress of raptor.processStream('document.pdf')) {
console.log(`${progress.stage}: ${progress.percent}%`);
}
// Output:
// upload: 0%
// upload: 100%
// processing: 25%
// processing: 50%
// processing: 75%
// complete: 100%
Each chunk includes rich metadata:
interface ChunkMetadata {
id: string;
pageNumber: number | null;
pageRange: number[];
sectionHierarchy: string[];
tokens: number;
chunkIndex: number;
metadata: Record<string, any>;
// Advanced fields
chunkType: string | null;
containsTable: boolean;
tableMetadata: Record<string, any> | null;
chunkingStrategy: string | null;
sectionNumber: string | null;
qualityScore: number | null;
syntheticContext: string | null;
parentChunkId: string | null;
boundingBox: Record<string, any> | null;
// Deduplication metadata
dedupStrategy?: string | null;
dedupConfidence?: number | null;
isReused?: boolean;
totalSentences?: number | null;
reusedSentencesCount?: number | null;
newSentencesCount?: number | null;
contentReuseRatio?: number | null;
embeddingRecommendation?: string | null;
}
const result = await raptor.process('document.pdf');
result.metadata.forEach(chunk => {
console.log(`Chunk ${chunk.chunkIndex}:`);
console.log(` Page: ${chunk.pageNumber}`);
console.log(` Tokens: ${chunk.tokens}`);
console.log(` Section: ${chunk.sectionHierarchy.join(' > ')}`);
if (chunk.containsTable) {
console.log(' Contains table');
}
if (chunk.qualityScore) {
console.log(` Quality: ${chunk.qualityScore.toFixed(2)}`);
}
});
Best Practices
Choose the right chunk size: 512 tokens is optimal for most RAG applications. Use larger chunks (1024+) for semantic search, smaller chunks (256) for precise retrieval.
Set storeContent: false for privacy: If you don’t need to re-download the original file, set storeContent: false to automatically delete it after processing. Metadata and chunks are preserved.
Processing is idempotent: Uploading the same file multiple times with the same config reuses the existing variant (zero cost).