Skip to main content

Document Processing

The process() method is the core of the Raptor SDK. It handles document upload, chunking, and optional version control in a single API call.

Basic Usage

import Raptor from '@raptor-data/ts-sdk';

const raptor = new Raptor({ apiKey: process.env.RAPTOR_API_KEY });

// Process a document
const result = await raptor.process('document.pdf');

console.log(`Document ID: ${result.documentId}`);
console.log(`Chunks: ${result.chunks.length}`);
console.log(`Version: ${result.versionNumber}`);

Method Signature

raptor.process(source, options?)

Parameters

source
string | File | Blob
required
File path (Node.js) or File/Blob object (browser)
options
ProcessOptions
Processing configuration options

Processing Options

Chunking Options

chunkSize
number
default:"512"
Target chunk size in tokens (128-2048)
chunkOverlap
number
default:"50"
Overlap between chunks in tokens (0-512)
strategy
'semantic' | 'recursive'
default:"semantic"
Chunking strategy:
  • semantic: Content-aware chunking (recommended)
  • recursive: Fixed-size chunking

Advanced Processing Options

processImages
boolean
default:"false"
Create chunks for images and figures with captions
extractSectionNumbers
boolean
default:"true"
Extract section numbers from legal documents
calculateQualityScores
boolean
default:"true"
Calculate quality scores for each chunk
minChunkQuality
number
default:"0.0"
Minimum quality threshold (0.0-1.0). Chunks below this score are filtered out
enableSmartContext
boolean
default:"true"
Generate AI context for orphan tables and figures
tableExtraction
boolean
default:"true"
Extract tables from documents
tableContextGeneration
boolean
default:"false"
Generate context for extracted tables using AI
storeContent
boolean
default:"true"
Store original document content. Set to false to auto-delete after processing (metadata is preserved)

Polling Options

wait
boolean
default:"true"
Wait for processing to complete before returning
pollInterval
number
default:"1000"
Polling interval in milliseconds (1 second default)
maxPollAttempts
number
Override global max polling attempts
pollTimeout
number
Override global polling timeout in milliseconds

Version Control Options

parentDocumentId
string
Parent document ID for manual version linking
versionLabel
string
Human-readable version label (e.g., “v2.0”, “Final Draft”)

Auto-Linking Options

Enable/disable auto-linking. null uses account setting
Confidence threshold for auto-linking (0.0-1.0). null uses account default

Response

interface ProcessResult {
  documentId: string;
  chunks: string[];
  metadata: ChunkMetadata[];

  // Version control
  versionId: string;
  versionNumber: number;
  isNewDocument: boolean;
  isNewVersion: boolean;
  isNewVariant: boolean;

  // Auto-linking
  autoLinked?: boolean;
  autoLinkConfidence?: number;
  autoLinkExplanation?: string[];
  autoLinkMethod?: 'metadata' | 'metadata_and_content' | 'none';
  parentDocumentId?: string;

  // Deduplication
  deduplicationAvailable: boolean;
  isDuplicate?: boolean;
  canonicalDocumentId?: string;
  processingSkipped?: boolean;
  costSaved?: number;
}

Response Fields

documentId
string
Unique document identifier
chunks
string[]
Array of chunk text content
metadata
ChunkMetadata[]
Array of chunk metadata objects
versionNumber
number
Version number in document lineage (starts at 1)
autoLinked
boolean
Whether auto-linking detected a parent document
Confidence score for auto-linking (0.0-1.0)
deduplicationAvailable
boolean
Whether chunk-level deduplication is available for this document

Examples

Basic Processing

const result = await raptor.process('contract.pdf');

console.log(`Processed ${result.chunks.length} chunks`);
result.chunks.forEach((chunk, i) => {
  console.log(`Chunk ${i}: ${chunk.substring(0, 100)}...`);
});

Custom Chunking

const result = await raptor.process('document.pdf', {
  chunkSize: 1024,        // Larger chunks
  chunkOverlap: 100,      // More overlap
  strategy: 'semantic'    // Content-aware
});

Advanced Features

const result = await raptor.process('legal-contract.pdf', {
  extractSectionNumbers: true,     // Extract section numbers
  calculateQualityScores: true,    // Calculate quality
  minChunkQuality: 0.5,            // Filter low-quality chunks
  processImages: true,             // Include images
  tableExtraction: true,           // Extract tables
  tableContextGeneration: true     // Generate table context
});

// Check chunk quality
result.metadata.forEach(chunk => {
  if (chunk.qualityScore) {
    console.log(`Chunk ${chunk.chunkIndex}: Quality ${chunk.qualityScore.toFixed(2)}`);
  }
});

Browser Upload

// In a browser with file input
function handleFileUpload(file: File) {
  const result = await raptor.process(file, {
    wait: true,
    chunkSize: 512
  });

  console.log(`Uploaded ${file.name}`);
  console.log(`Got ${result.chunks.length} chunks`);
}

Next.js API Route

// app/api/upload/route.ts
import Raptor from '@raptor-data/ts-sdk';

export async function POST(request: Request) {
  const formData = await request.formData();
  const file = formData.get('file') as File;

  const raptor = new Raptor({ apiKey: process.env.RAPTOR_API_KEY });

  const result = await raptor.process(file, {
    wait: true,
    versionLabel: formData.get('versionLabel') as string
  });

  return Response.json({
    documentId: result.documentId,
    chunks: result.chunks.length,
    versionNumber: result.versionNumber
  });
}

Async Processing

For large documents, you can process asynchronously and poll for status:
// Start processing (don't wait)
const result = await raptor.process('large-document.pdf', {
  wait: false
});

console.log(`Processing started: ${result.documentId}`);

// Poll for completion
while (true) {
  const variant = await raptor.getVariant(result.variantId);

  if (variant.status === 'completed') {
    console.log(`Completed! ${variant.chunksCount} chunks`);
    break;
  } else if (variant.status === 'failed') {
    console.error(`Failed: ${variant.error}`);
    break;
  }

  console.log(`Status: ${variant.status}`);
  await new Promise(resolve => setTimeout(resolve, 2000));
}

Streaming Progress

for await (const progress of raptor.processStream('document.pdf')) {
  console.log(`${progress.stage}: ${progress.percent}%`);
}

// Output:
// upload: 0%
// upload: 100%
// processing: 25%
// processing: 50%
// processing: 75%
// complete: 100%

Chunk Metadata

Each chunk includes rich metadata:
interface ChunkMetadata {
  id: string;
  pageNumber: number | null;
  pageRange: number[];
  sectionHierarchy: string[];
  tokens: number;
  chunkIndex: number;
  metadata: Record<string, any>;

  // Advanced fields
  chunkType: string | null;
  containsTable: boolean;
  tableMetadata: Record<string, any> | null;
  chunkingStrategy: string | null;
  sectionNumber: string | null;
  qualityScore: number | null;
  syntheticContext: string | null;
  parentChunkId: string | null;
  boundingBox: Record<string, any> | null;

  // Deduplication metadata
  dedupStrategy?: string | null;
  dedupConfidence?: number | null;
  isReused?: boolean;
  totalSentences?: number | null;
  reusedSentencesCount?: number | null;
  newSentencesCount?: number | null;
  contentReuseRatio?: number | null;
  embeddingRecommendation?: string | null;
}

Using Metadata

const result = await raptor.process('document.pdf');

result.metadata.forEach(chunk => {
  console.log(`Chunk ${chunk.chunkIndex}:`);
  console.log(`  Page: ${chunk.pageNumber}`);
  console.log(`  Tokens: ${chunk.tokens}`);
  console.log(`  Section: ${chunk.sectionHierarchy.join(' > ')}`);

  if (chunk.containsTable) {
    console.log('  Contains table');
  }

  if (chunk.qualityScore) {
    console.log(`  Quality: ${chunk.qualityScore.toFixed(2)}`);
  }
});

Best Practices

Choose the right chunk size: 512 tokens is optimal for most RAG applications. Use larger chunks (1024+) for semantic search, smaller chunks (256) for precise retrieval.
Set storeContent: false for privacy: If you don’t need to re-download the original file, set storeContent: false to automatically delete it after processing. Metadata and chunks are preserved.
Processing is idempotent: Uploading the same file multiple times with the same config reuses the existing variant (zero cost).