Document Processing

The process() method is the core of the Raptor SDK. It handles document upload, chunking, and optional version control in a single API call.

Basic Usage

import Raptor from '@raptor-data/ts-sdk';

const raptor = new Raptor({ apiKey: process.env.RAPTOR_API_KEY });

// Process a document
const result = await raptor.process('document.pdf');

console.log(`Document ID: ${result.documentId}`);
console.log(`Chunks: ${result.chunks.length}`);
console.log(`Version: ${result.versionNumber}`);

Method Signature

raptor.process(source, options?)

Parameters

source

string | File | Blob

required

File path (Node.js) or File/Blob object (browser)

options

ProcessOptions

Processing configuration options

Processing Options

Chunking Options

chunkSize

number

default:"512"

Target chunk size in tokens (128-2048)

chunkOverlap

number

default:"50"

Overlap between chunks in tokens (0-512)

strategy

'semantic' | 'recursive'

default:"semantic"

Chunking strategy:

semantic: Content-aware chunking (recommended)
recursive: Fixed-size chunking

Advanced Processing Options

processImages

boolean

default:"false"

Create chunks for images and figures with captions

extractSectionNumbers

boolean

default:"true"

Extract section numbers from legal documents

calculateQualityScores

boolean

default:"true"

Calculate quality scores for each chunk

minChunkQuality

number

default:"0.0"

Minimum quality threshold (0.0-1.0). Chunks below this score are filtered out

enableSmartContext

boolean

default:"true"

Generate AI context for orphan tables and figures

tableExtraction

boolean

default:"true"

Extract tables from documents

tableContextGeneration

boolean

default:"false"

Generate context for extracted tables using AI

storeContent

boolean

default:"true"

Store original document content. Set to false to auto-delete after processing (metadata is preserved)

Polling Options

wait

boolean

default:"true"

Wait for processing to complete before returning

pollInterval

number

default:"1000"

Polling interval in milliseconds (1 second default)

maxPollAttempts

number

Override global max polling attempts

pollTimeout

number

Override global polling timeout in milliseconds

Version Control Options

parentDocumentId

string

Parent document ID for manual version linking

versionLabel

string

Human-readable version label (e.g., “v2.0”, “Final Draft”)

Auto-Linking Options

autoLink

boolean | null

default:"null"

Enable/disable auto-linking. null uses account setting

autoLinkThreshold

number | null

Confidence threshold for auto-linking (0.0-1.0). null uses account default

Response

interface ProcessResult {
  documentId: string;
  chunks: string[];
  metadata: ChunkMetadata[];

  // Version control
  versionId: string;
  versionNumber: number;
  isNewDocument: boolean;
  isNewVersion: boolean;
  isNewVariant: boolean;

  // Auto-linking
  autoLinked?: boolean;
  autoLinkConfidence?: number;
  autoLinkExplanation?: string[];
  autoLinkMethod?: 'metadata' | 'metadata_and_content' | 'none';
  parentDocumentId?: string;

  // Deduplication
  deduplicationAvailable: boolean;
  isDuplicate?: boolean;
  canonicalDocumentId?: string;
  processingSkipped?: boolean;
  costSaved?: number;
}

Response Fields

documentId

string

Unique document identifier

chunks

string[]

Array of chunk text content

metadata

ChunkMetadata[]

Array of chunk metadata objects

versionNumber

number

Version number in document lineage (starts at 1)

autoLinked

boolean

Whether auto-linking detected a parent document

autoLinkConfidence

number

Confidence score for auto-linking (0.0-1.0)

deduplicationAvailable

boolean

Whether chunk-level deduplication is available for this document

Examples

Basic Processing

const result = await raptor.process('contract.pdf');

console.log(`Processed ${result.chunks.length} chunks`);
result.chunks.forEach((chunk, i) => {
  console.log(`Chunk ${i}: ${chunk.substring(0, 100)}...`);
});

Custom Chunking

const result = await raptor.process('document.pdf', {
  chunkSize: 1024,        // Larger chunks
  chunkOverlap: 100,      // More overlap
  strategy: 'semantic'    // Content-aware
});

Advanced Features

const result = await raptor.process('legal-contract.pdf', {
  extractSectionNumbers: true,     // Extract section numbers
  calculateQualityScores: true,    // Calculate quality
  minChunkQuality: 0.5,            // Filter low-quality chunks
  processImages: true,             // Include images
  tableExtraction: true,           // Extract tables
  tableContextGeneration: true     // Generate table context
});

// Check chunk quality
result.metadata.forEach(chunk => {
  if (chunk.qualityScore) {
    console.log(`Chunk ${chunk.chunkIndex}: Quality ${chunk.qualityScore.toFixed(2)}`);
  }
});

Browser Upload

// In a browser with file input
function handleFileUpload(file: File) {
  const result = await raptor.process(file, {
    wait: true,
    chunkSize: 512
  });

  console.log(`Uploaded ${file.name}`);
  console.log(`Got ${result.chunks.length} chunks`);
}

Next.js API Route

// app/api/upload/route.ts
import Raptor from '@raptor-data/ts-sdk';

export async function POST(request: Request) {
  const formData = await request.formData();
  const file = formData.get('file') as File;

  const raptor = new Raptor({ apiKey: process.env.RAPTOR_API_KEY });

  const result = await raptor.process(file, {
    wait: true,
    versionLabel: formData.get('versionLabel') as string
  });

  return Response.json({
    documentId: result.documentId,
    chunks: result.chunks.length,
    versionNumber: result.versionNumber
  });
}

Async Processing

For large documents, you can process asynchronously and poll for status:

// Start processing (don't wait)
const result = await raptor.process('large-document.pdf', {
  wait: false
});

console.log(`Processing started: ${result.documentId}`);

// Poll for completion
while (true) {
  const variant = await raptor.getVariant(result.variantId);

  if (variant.status === 'completed') {
    console.log(`Completed! ${variant.chunksCount} chunks`);
    break;
  } else if (variant.status === 'failed') {
    console.error(`Failed: ${variant.error}`);
    break;
  }

  console.log(`Status: ${variant.status}`);
  await new Promise(resolve => setTimeout(resolve, 2000));
}

Streaming Progress

for await (const progress of raptor.processStream('document.pdf')) {
  console.log(`${progress.stage}: ${progress.percent}%`);
}

// Output:
// upload: 0%
// upload: 100%
// processing: 25%
// processing: 50%
// processing: 75%
// complete: 100%

Chunk Metadata

Each chunk includes rich metadata:

interface ChunkMetadata {
  id: string;
  pageNumber: number | null;
  pageRange: number[];
  sectionHierarchy: string[];
  tokens: number;
  chunkIndex: number;
  metadata: Record<string, any>;

  // Advanced fields
  chunkType: string | null;
  containsTable: boolean;
  tableMetadata: Record<string, any> | null;
  chunkingStrategy: string | null;
  sectionNumber: string | null;
  qualityScore: number | null;
  syntheticContext: string | null;
  parentChunkId: string | null;
  boundingBox: Record<string, any> | null;

  // Deduplication metadata
  dedupStrategy?: string | null;
  dedupConfidence?: number | null;
  isReused?: boolean;
  totalSentences?: number | null;
  reusedSentencesCount?: number | null;
  newSentencesCount?: number | null;
  contentReuseRatio?: number | null;
  embeddingRecommendation?: string | null;
}

Using Metadata

const result = await raptor.process('document.pdf');

result.metadata.forEach(chunk => {
  console.log(`Chunk ${chunk.chunkIndex}:`);
  console.log(`  Page: ${chunk.pageNumber}`);
  console.log(`  Tokens: ${chunk.tokens}`);
  console.log(`  Section: ${chunk.sectionHierarchy.join(' > ')}`);

  if (chunk.containsTable) {
    console.log('  Contains table');
  }

  if (chunk.qualityScore) {
    console.log(`  Quality: ${chunk.qualityScore.toFixed(2)}`);
  }
});

Best Practices

Choose the right chunk size: 512 tokens is optimal for most RAG applications. Use larger chunks (1024+) for semantic search, smaller chunks (256) for precise retrieval.

Set storeContent: false for privacy: If you don’t need to re-download the original file, set storeContent: false to automatically delete it after processing. Metadata and chunks are preserved.

Processing is idempotent: Uploading the same file multiple times with the same config reuses the existing variant (zero cost).

Version Control

Track document versions

Auto-Linking

Automatic version detection

Error Handling

Handle processing errors

React Hooks

Use in React applications

Get Started

Core Concepts

TypeScript SDK

API Reference

Document Processing

Document Processing

Basic Usage

Method Signature

Parameters

Processing Options

Chunking Options

Advanced Processing Options

Polling Options

Version Control Options

Auto-Linking Options

Response

Response Fields

Examples

Basic Processing

Custom Chunking

Advanced Features

Browser Upload

Next.js API Route

Async Processing

Streaming Progress

Chunk Metadata

Using Metadata

Best Practices

Version Control

Auto-Linking

Error Handling

React Hooks

Get Started

Core Concepts

TypeScript SDK

API Reference

​Document Processing

​Basic Usage

​Method Signature

​Parameters

​Processing Options

​Chunking Options

​Advanced Processing Options

​Polling Options

​Version Control Options

​Auto-Linking Options

​Response

​Response Fields

​Examples

​Basic Processing

​Custom Chunking

​Advanced Features

​Browser Upload

​Next.js API Route

​Async Processing

​Streaming Progress

​Chunk Metadata

​Using Metadata

​Best Practices

​Related

Version Control

Auto-Linking

Error Handling

React Hooks

Document Processing

Basic Usage

Method Signature

Parameters

Processing Options

Chunking Options

Advanced Processing Options

Polling Options

Version Control Options

Auto-Linking Options

Response

Response Fields

Examples

Basic Processing

Custom Chunking

Advanced Features

Browser Upload

Next.js API Route

Async Processing

Streaming Progress

Chunk Metadata

Using Metadata

Best Practices

Related