Introduction
Imagine uploading a batch of invoices, purchase orders, and shipping documents - and having an AI automatically understand what each document is, extract all the important fields, and create reusable schemas for future processing. That's exactly what I built with the Document Schema Builder.
But here's the unique part: before publishing any schema, users can test it against the original documents to verify that new fields can be extracted correctly. This "Test Document" feature bridges the gap between schema definition and validation, eliminating the guesswork in document processing workflows.
In this post, I'll walk through the architecture, key challenges, and the innovative solutions that make this system work.
The Problem
Traditional document processing systems require manual schema definition upfront. You need to:
- Know all the fields in advance
- Manually define field types and structures
- Set up extraction rules without validation
- Hope everything works when you deploy
This is slow, error-prone, and requires multiple iterations to get right. I wanted to build something smarter.
System Overview
The Document Schema Builder is a full-stack intelligent document processing system with three core capabilities:
- AI-Powered Discovery: Upload documents and let Gemini LLM automatically identify document types
- Interactive Schema Editor: Visual canvas with OCR overlay for precise field mapping
- Test-Before-Publish: Validate schema changes against original documents with real-time feedback
Key Statistics:
- Processes up to 30 pages in a single batch
- Supports 27 pre-configured document templates
- 31 REST API endpoints
- Real-time updates via WebSocket
- ~4,200 lines of backend code
- ~2,200 lines of frontend code
Architecture Deep Dive
High-Level Flow
User Upload (Batch) → Backend Processing → Schema Discovery →
Interactive Editor → Test Document → Publish
The system uses a 3-phase architecture:
Phase 1: Upload & AI Discovery
The Challenge: Process multiple files efficiently while automatically understanding document structure.
The Solution: True batch processing with a single OCR and AI call.
// Discovery pipeline
async discoverFromBatch(files: FileWithBuffer[]): Promise<DiscoveredSchemaDocument[]> {
// 1. Extract images from PDFs using pdftoppm
const pageImages = await extractAllPageImages(files);
// 2. Validate 30-page limit across all files
if (pageImages.length > 30) throw new Error('Batch too large');
// 3. Run OCR on ALL pages together (Google Document AI)
const ocrResults = await documentAI.batchProcessDocuments(pageImages);
// 4. Format OCR response with block structure
const formattedOCR = formatOCRBlocks(ocrResults);
// 5. Call Gemini LLM with discovery prompt
const discoveredSchemas = await gemini.discoverSchemas(formattedOCR);
// 6. Return schemas with instances
return discoveredSchemas;
}
Key Design Decision: Instead of processing each file separately, I batch everything into a single OCR call and a single AI call. This:
- Reduces API costs significantly
- Allows the AI to see all documents together for better type discovery
- Processes faster than sequential operations
Discovery Output Example:
[
{
"documentType": "Invoice",
"instances": [
{
"structure": {
"labels": {
"invoice_number": {
"value": "INV-001",
"pageNumber": 1,
"wordIds": ["113-121", "126-128"]
},
"total_amount": {
"value": "1000",
"pageNumber": 1,
"wordIds": ["133-140"]
}
},
"tables": {
"line_items": {
"headers": {
"description": { /* ... */ },
"amount": { /* ... */ }
},
"rows": { /* ... */ }
}
}
}
}
]
}
]
Notice the wordIds? These are critical for the Test Document feature - more on that later.
Phase 2: Smart Schema Merging
The Challenge: When processing multiple invoices together, they might have different fields. Some might have a "PO Number", others might not.
The Solution: Create a unified schema that's the union of all fields from all instances.
function mergeSchemaStructures(instances: any[]): {
labels: Record<string, any>
tables: Record<string, any>
} {
// Takes ALL fields from ALL instances
// Creates mega-schema with complete structure
const mergedLabels = {};
const mergedTables = {};
for (const instance of instances) {
// Union operation - add any new fields
Object.assign(mergedLabels, instance.labels);
Object.assign(mergedTables, instance.tables);
}
return { labels: mergedLabels, tables: mergedTables };
}
For extracted values, I select the "main instance" (the most complete one) to populate the schema. This ensures users see actual data, not just field definitions.
Phase 3: Interactive Visual Editor
The frontend is built with React + Konva.js for canvas rendering:
Key Components:
- DocumentViewer: Canvas with OCR text overlay and interactive bounding boxes
- LabelsPanel: Right sidebar for editing schema fields and values
- PagesPreview: Left sidebar with page thumbnails for navigation
- SchemaBuilderHeader: Top bar with publish, revert, and test actions
State Management with Zustand (1,752 lines):
interface DocumentReviewStore {
// Document data
documentData: DocumentData | null
ocrData: BatchOcrDataResponse | null
schemaId: string
// Canvas states per page
pageStates: Record<string, PageState>
// Actions
addNewLabelSchema: (params) => void
updateLabelContent: (params) => void
deleteSchemaLabel: (params) => void
// Canvas interactions
startDrawing: (pageId, pos) => void
updateDrawing: (pageId, pos) => void
finishDrawing: (pageId) => void
startResize: (pageId, labelId, corner, pos) => void
// ... more actions
}
Each page maintains its own state for zoom, pan, and interaction modes. This allows smooth navigation between multi-page documents.
Canvas Interactions:
// Drawing new bounding box
onMouseDown → startDrawing()
onMouseMove → updateDrawing()
onMouseUp → finishDrawing() → extractTextFromArea()
// Resizing existing box
onCornerMouseDown → startResize()
onMouseMove → updateResize()
onMouseUp → finishResize() → updateCoordinatesById()
All coordinate changes are immediately persisted to the backend via API calls.
The Innovation: Test Document Feature
This is where things get really interesting. The problem I wanted to solve:
How do you know if a new field you added to a schema can actually be extracted by the AI before you publish it to production?
Traditional approach: Publish and hope for the best. If it fails, rollback and try again.
My approach: Test it first.
How It Works
When a user clicks "Test Document", here's what happens:
async testSchema(
schemaId: string,
projectId: ProjectId,
socket: Socket,
log: FastifyBaseLogger
): Promise<{ success: boolean; message: string }> {
// 1. Retrieve draft schema and original document instance
const draftSchema = await this.getById(schemaId, projectId, 'draft');
const documentInstance = await getDocumentInstance(draftSchema.documentInstanceId);
// 2. Fetch stored OCR data and page images
const ocrData = await getOCRData(documentInstance.id);
const pageImages = await getPageImages(documentInstance.id);
// 3. Build wordId-to-bounding-box map from OCR data
const wordMap = buildWordIdToBboxMapFromFormatted(ocrData);
// 4. Create pageNumber→instancePageId mapping
const pageMapping = createPageMapping(documentInstance);
// 5. Call AI service with updated schema
const aiResults = await gemini.extractWithSchema(draftSchema, pageImages);
// 6. Update extracted labels with new values and coordinates
for (const [labelKey, extraction] of Object.entries(aiResults.labels)) {
// Calculate bounding box from AI-returned wordIds
const bbox = calculateBoundingBoxFromWordIds(
extraction.wordIds,
wordMap,
log
);
// Update database
await updateExtractedLabel(labelKey, {
text: extraction.value,
coordinates: bbox,
instancePageId: pageMapping[extraction.pageNumber]
});
}
// 7. Do the same for tables
for (const [tableName, table] of Object.entries(aiResults.tables)) {
for (const [rowId, row] of Object.entries(table.rows)) {
for (const [cellKey, cell] of Object.entries(row.cells)) {
const bbox = calculateBoundingBoxFromWordIds(cell.wordIds, wordMap, log);
await updateExtractedCell(rowId, cellKey, {
text: cell.value,
coordinates: bbox
});
}
}
}
// 8. Emit SCHEMA_TEST_COMPLETED socket event
await notifySchemaTestCompleted({
socket,
projectId,
schemaId,
success: true,
message: 'Test completed successfully'
});
return { success: true, message: 'Extraction validated' };
}
The Magic: Coordinate Calculation
The AI doesn't return exact pixel coordinates. It returns wordIds - references to words in the OCR output. Here's how I convert those to bounding boxes:
function calculateBoundingBoxFromWordIds(
wordIds: string[] | undefined,
wordMap: Map<string, BBox>,
log: Logger
): BBox | null {
if (!wordIds?.length) return null;
const foundBoxes: BBox[] = [];
for (const wordId of wordIds) {
// Try direct lookup first
if (wordMap.has(wordId)) {
foundBoxes.push(wordMap.get(wordId));
continue;
}
// Handle range IDs like "357-380"
const [start, end] = wordId.split('-').map(Number);
if (start && end) {
// Find all wordIds that overlap with this range
for (const [mapWordId, bbox] of wordMap.entries()) {
const [mapStart, mapEnd] = mapWordId.split('-').map(Number);
if (rangesOverlap(start, end, mapStart, mapEnd)) {
foundBoxes.push(bbox);
}
}
}
}
if (foundBoxes.length === 0) return null;
// Calculate minimum enclosing rectangle
const xMin = Math.min(...foundBoxes.map(b => b.xMin));
const yMin = Math.min(...foundBoxes.map(b => b.yMin));
const xMax = Math.max(...foundBoxes.map(b => b.xMax));
const yMax = Math.max(...foundBoxes.map(b => b.yMax));
return { xMin, yMin, xMax, yMax };
}
This approach is robust to OCR variations - even if the AI returns slightly different wordId ranges than the original extraction, we can still calculate accurate coordinates.
Real-Time Feedback
The test runs asynchronously, but users get immediate feedback via WebSocket:
// Frontend
socket.on(WebsocketClientEvent.SCHEMA_TEST_COMPLETED, (data) => {
if (data.schemaId === currentSchemaId) {
// Show notification
toast({
title: data.success ? 'Test Completed' : 'Test Failed',
description: data.message,
});
// Automatically refresh data
if (data.success) {
refetchSchemaData();
}
}
});
No polling, no page reloads. Just instant updates when the test completes.
Database Design
The system uses a hierarchical schema structure:
document_schema (parent)
├── document_schema_labels (child - 1:N)
├── document_schema_tables (child - 1:N)
│ └── document_schema_table_headers (child - 1:N)
Key Innovation: The fieldType column enables draft/published workflow:
type FieldVersionType = 'draft' | 'published';
// Same label can exist in both versions
{
id: 'label-123',
name: 'Invoice Number',
key: 'invoice_number',
fieldType: 'draft', // or 'published'
documentSchemaId: 'schema-456'
}
When you publish, the system:
- Deletes all old
fieldType='published'fields - Promotes all
fieldType='draft'fields to'published' - Returns success
Extracted Data Schema:
extracted_labels
├── labelKey: string
├── text: string (extracted value)
├── coordinatesId: FK → coordinates
├── instancePageId: FK → instance_page
coordinates
├── xMin, yMin, xMax, yMax: float
The instance_page table maps page numbers to specific document instances, ensuring labels appear on the correct pages even across multiple documents.
Strategic Indices:
idx_document_schema_name_project: UNIQUE(name, projectId) - prevent duplicate schemasidx_extracted_labels_instance: (documentInstanceId, labelKey) - fast label lookupsidx_coordinates: (id) - coordinate retrieval
API Design
I designed the API with flexibility and efficiency in mind:
Paginated Listing:
GET /v1/document/schemas/paginated?cursor=abc&limit=10&published=true
Response: {
data: DocumentSchema[],
next: 'cursor-xyz',
previous: 'cursor-def'
}
Batch Uploads:
POST /v1/document/schemas/batch-upload
Content-Type: multipart/form-data
{
file-0-invoice.pdf: File,
file-1-po.pdf: File,
projectId: 'proj-123',
length: 2
}
Test Endpoint:
POST /v1/document/schemas/:id/test
Response: {
success: true,
message: 'Test completed successfully'
}
The test endpoint returns immediately while processing happens asynchronously.
Draft Management:
// Create draft from published
POST /v1/document/schemas/:id/draft
// Modify draft
POST /v1/document/schemas/:id/draft/label
PATCH /v1/document/schemas/:id/draft/label/:labelKey
DELETE /v1/document/schemas/:id/draft/label/:labelKey
// Publish changes
POST /v1/document/schemas/:id/publish
// Discard changes
POST /v1/document/schemas/:id/revert
Performance Optimizations
1. Cursor-Based Pagination
Instead of offset pagination, I use cursor-based:
async listPaginated(params: {
cursor?: string,
limit: number,
orderBy: string,
order: 'ASC' | 'DESC'
}): Promise<SeekPage<DocumentSchema>> {
const query = this.createQueryBuilder('schema')
.where('schema.projectId = :projectId', { projectId })
.orderBy(`schema.${orderBy}`, order)
.limit(limit + 1); // Fetch one extra to determine if there's a next page
if (cursor) {
query.andWhere(`schema.${orderBy} > :cursor`, { cursor });
}
const results = await query.getMany();
const hasMore = results.length > limit;
if (hasMore) results.pop(); // Remove the extra item
return {
data: results,
next: hasMore ? results[results.length - 1][orderBy] : null,
previous: cursor
};
}
This scales to millions of records without performance degradation.
2. Lazy Loading OCR Data
OCR data is large (formatted blocks for every word). I store it separately and fetch only when needed:
// Schema retrieval doesn't include OCR
GET /v1/document/schemas/:id → { schema, labels, tables }
// OCR fetched separately when canvas loads
GET /v1/document/schemas/:id/ocr-data → { [pageId]: { blocks, width, height } }
This keeps the main API fast while allowing the canvas to load OCR on demand.
3. React Query Caching
const { data: ocrData } = useQuery({
queryKey: ['ocr-schema-data', batchId],
queryFn: () => documentTypesApi.fetchOcrData(batchId),
staleTime: 5 * 60 * 1000, // 5 minutes
cacheTime: 10 * 60 * 1000 // 10 minutes
});
OCR data rarely changes, so aggressive caching makes navigation between pages instant.
4. Debounced Coordinate Updates
const debouncedUpdateCoordinates = useMemo(
() => debounce((coordinateId, coordinates) => {
documentTypesApi.updateCoordinatesById(coordinateId, coordinates);
}, 500),
[]
);
When dragging bounding boxes, updates are debounced to avoid flooding the server.
Challenges & Solutions
Challenge 1: 30-Page Limit
Problem: Google Document AI has processing limits, and large batches cause timeouts.
Solution: Enforce a 30-page limit upfront and guide users to split larger batches. This keeps processing under 2 minutes while still being useful.
async validatePageLimit(files: FileWithBuffer[]): Promise<void> {
let totalPages = 0;
for (const file of files) {
if (file.mimetype === 'application/pdf') {
const pageCount = await countPDFPages(file.buffer);
totalPages += pageCount;
} else {
totalPages += 1; // Images are single page
}
}
if (totalPages > 30) {
throw new Error(`Batch has ${totalPages} pages. Maximum is 30.`);
}
}
Challenge 2: Coordinate Accuracy with WordId Ranges
Problem: The AI sometimes returns wordId ranges that don't exactly match the OCR output.
Solution: Implement range overlap detection:
function rangesOverlap(
start1: number, end1: number,
start2: number, end2: number
): boolean {
return start1 <= end2 && start2 <= end1;
}
This allows flexible matching even when wordIds are approximate.
Challenge 3: Page Mapping Across Instances
Problem: When merging multiple documents, page numbers restart for each document. How do you track which page a label belongs to?
Solution: Use an instance_page join table:
// Maps logical page numbers to instance-specific page IDs
instance_page {
id: string
documentInstanceId: string // Which document instance
pageId: string // Which OCR page
}
// Labels reference instance pages, not global pages
extracted_labels {
instancePageId: string // FK to instance_page
}
This allows accurate page tracking even across multi-document batches.
Challenge 4: Real-Time Updates Without Polling
Problem: Test Document can take 30+ seconds. Polling is inefficient and adds latency.
Solution: WebSocket notifications:
// Backend emits when test completes
await notifySchemaTestCompleted({
socket,
projectId,
schemaId,
success: true,
message: 'Test completed successfully'
});
// Frontend listens and auto-refreshes
socket.on('SCHEMA_TEST_COMPLETED', (data) => {
if (data.success) refetchSchemaData();
});
Users see results instantly without any manual refresh.
Lessons Learned
1. AI Integration Requires Careful Prompt Engineering
Getting Gemini to return consistent, structured schemas required multiple iterations:
const discoveryPrompt = `
You are analyzing documents to discover their types and structure.
For each document:
1. Identify the document type (e.g., "Invoice", "Purchase Order")
2. Extract all labels (key-value pairs)
3. Extract all tables with headers and rows
4. For each extracted value, provide the wordIds from the OCR data
Return JSON with this structure:
[
{
"documentType": "Invoice",
"instances": [...]
}
]
Rules:
- Use lowercase_with_underscores for keys
- Include page numbers for all extractions
- Provide wordIds for coordinate calculation
- Group similar documents together
`;
The prompt evolved significantly as I discovered edge cases.
2. Coordinate Calculation Is Harder Than It Looks
My first implementation just took the first wordId's bounding box. This failed when:
- The value spanned multiple lines
- The AI returned approximate ranges
- OCR word segmentation varied
The solution was to:
- Build a complete wordId→bbox map from OCR
- Handle range overlaps gracefully
- Calculate the minimum enclosing rectangle of all matched words
3. Draft/Published Workflow Prevents Production Incidents
Before adding this feature, schema changes went live immediately. This caused:
- Accidentally breaking production workflows
- No way to validate changes before deployment
- Difficult rollbacks when things went wrong
The two-phase workflow solved all of these issues.
4. WebSockets Are Essential for Long-Running Operations
Initially, I had the frontend poll for test results every 2 seconds. This:
- Added unnecessary server load
- Introduced 2-second latency in the best case
- Felt sluggish to users
Switching to WebSockets made the experience feel instant and eliminated polling overhead.
5. Database Indexing Matters at Scale
Early on, queries for large projects were slow. Adding proper indices:
CREATE INDEX idx_document_schema_name_project
ON document_schema(name, project_id);
CREATE INDEX idx_extracted_labels_instance
ON extracted_labels(document_instance_id, label_key);
Reduced query times from seconds to milliseconds.
Future Enhancements
1. Transaction Safety
Currently, publishing isn't atomic. If it fails halfway through, you can end up with partially published schemas. The fix:
await dataSource.transaction(async (manager) => {
// Delete old published fields
await manager.delete(DocumentLabelEntity, {
documentSchemaId: schemaId,
fieldType: 'published'
});
// Promote draft to published
await manager.update(DocumentLabelEntity, {
documentSchemaId: schemaId,
fieldType: 'draft'
}, {
fieldType: 'published'
});
});
2. Schema Versioning
Track historical changes:
document_schema_version {
id: string
documentSchemaId: string
version: number
snapshot: jsonb // Full schema at this version
publishedBy: string
publishedAt: timestamp
}
This would enable:
- Rollback to any previous version
- Audit trail of all changes
- Diff view between versions
3. Confidence Scores
The AI can return confidence scores for extractions. Surfacing these would help users identify fields that need manual review:
{
"invoice_number": {
"value": "INV-001",
"confidence": 0.95, // High confidence
"pageNumber": 1,
"wordIds": ["113-121"]
},
"po_number": {
"value": "PO-XYZ",
"confidence": 0.62, // Low confidence - needs review
"pageNumber": 1,
"wordIds": ["145-150"]
}
}
4. Collaborative Editing
Add locking to prevent concurrent edits:
document_schema_lock {
schemaId: string
lockedBy: string
lockedAt: timestamp
expiresAt: timestamp
}
When Alice is editing a schema, Bob sees a read-only view with a banner: "This schema is currently being edited by Alice."
5. Batch Testing
Allow testing multiple schemas at once:
POST /v1/document/schemas/batch-test
{
schemaIds: ['schema-1', 'schema-2', 'schema-3']
}
Useful when making changes to related schemas together.
Technical Metrics
Backend Performance:
- Average upload processing: 45-90 seconds for 30 pages
- Schema retrieval: <100ms
- Test Document: 20-40 seconds depending on page count
- Publish operation: <500ms
Frontend Performance:
- Canvas render: 60 FPS even with 100+ bounding boxes
- OCR overlay: Uses virtualization for large documents
- State updates: Optimistic UI with automatic rollback on errors
Code Quality:
- TypeScript strict mode enabled
- 100% type coverage
- Shared types between frontend and backend (
@docxster/shared) - Comprehensive error handling
Conclusion
Building the Document Schema Builder taught me that the best validation happens before you ship to production. The Test Document feature embodies this principle - it lets users verify their schemas against real documents before committing changes.
Key takeaways:
- AI integration requires thoughtful design: Prompt engineering, structured outputs, and error handling are critical
- Real-time feedback matters: WebSockets eliminate the waiting game and make UX feel instant
- Coordinate-based extraction is powerful: Storing bounding boxes enables visual validation and correction
- Draft/published workflows prevent incidents: Separate experimentation from production
- Batch processing is more efficient: Single API calls for multiple items reduce overhead
The system now processes hundreds of documents daily, automatically creating schemas that would have taken hours to define manually. And with the Test Document feature, users can confidently publish schemas knowing they'll work correctly.
If you're building a document processing system, I hope these patterns and lessons save you from some of the mistakes I made along the way.
Want to learn more about AI-powered document processing or the Docxster platform? Feel free to reach out.