Introduction
Docxster Drive is an enterprise-grade file storage and management system I built as a core component of the Docxster platform. It's designed to handle document-heavy workflows with features like OCR processing, full-text search, version control, and real-time collaboration - all while maintaining the performance and reliability expected from modern cloud applications.
In this post, I'll walk through the architecture decisions, implementation challenges, and lessons learned while building a system that handles file uploads, downloads, search, and collaboration for thousands of concurrent users.
System Overview
At its core, Docxster Drive is a multi-layered system consisting of:
- Piece Layer: A pluggable integration interface using the Docxster Pieces Framework
- API Layer: RESTful APIs built with Fastify and TypeScript
- Storage Layer: AWS S3 for object storage with signed URLs
- Search Layer: Meilisearch for full-text search across files and metadata
- Processing Layer: Background workers for OCR, thumbnail generation, and indexing
- Database Layer: PostgreSQL with TypeORM for relational data
The system supports hierarchical folder structures, soft deletes, bulk operations, and event-driven automation workflows.
Architecture Components
1. The Piece Layer: Modular Integration
The piece layer provides a clean, action-based interface for interacting with the drive:
export const docxsterDrive = createPiece({
displayName: "Docxster Drive",
auth: PieceAuth.None(),
logoUrl: "https://cdn.docxster.ai/pieces/docxster-drive.svg",
actions: [createfolder, uploadfile, deletefile, getfile],
triggers: [fileupload],
})
Key Actions:
- Create Folder: Hierarchical folder creation with metadata support
- Upload File: 3-step upload process with S3 integration
- Get File: Retrieves signed URLs for immediate download
- Delete File: Soft delete with permanent deletion option
Triggers:
- File Upload Webhook: Event-based notifications when files are uploaded
This modular design allows the drive to be easily integrated into automation workflows and third-party applications.
2. The 3-Step Upload Process
One of the most interesting challenges was designing a reliable file upload system that could handle large files while maintaining data integrity. I implemented a 3-step process:
Step 1: Initiate Upload
POST /v1/drive/items
{
operation: "upload_file",
name: "document.pdf",
size: 5242880,
mimeType: "application/pdf",
checksum: "md5-hash-here",
parentId: "folder-id"
}
The server:
- Validates storage quotas and constraints
- Creates a pending
DriveItemwithPENDING_UPLOADstatus - Generates a 15-minute presigned S3 URL with Content-MD5 validation
- Returns the upload session
Step 2: Direct Upload to S3
The client uploads directly to the presigned S3 URL, bypassing the application server entirely. This approach:
- Reduces server load
- Improves upload speed
- Leverages S3's reliability and bandwidth
Step 3: Confirm Upload
POST /v1/drive/items/confirm-upload
{
fileId: "item-id",
checksum: "md5-hash-here"
}
The server:
- Verifies the file exists in S3
- Validates the checksum matches
- Updates the item to
UPLOAD_CONFIRMED - Triggers background processing (OCR, thumbnail, indexing)
- Emits file upload events for workflows
This pattern ensures data integrity while maximizing upload performance.
3. Database Design: Hierarchical Storage
The core entity is DriveItem, which represents both files and folders:
{
id: string
resourceType: 'FILE' | 'FOLDER'
name: string
parentId: string | null
path: string | null // ID-based: "root.folder1.folder2"
materializedPath: string | null // Human-readable: "/Documents/Projects"
projectId: string
ownerId: string
mimeType: string | null
size: number
storageKey: string | null // S3 key
checksum: string | null // MD5 for integrity
isDeleted: boolean
isStarred: boolean
metadata: Record<string, unknown> // Flexible JSONB
thumbnailUrl: string | null
}
Key Design Decisions:
Dual Path Storage: I store both ID-based paths (root.folder1.folder2) and human-readable paths (/Documents/Projects). This allows:
- Fast hierarchical queries using the ID-based path
- User-friendly breadcrumb navigation
- Efficient parent-child relationship lookups
Self-Referential Relationships: Folders can contain other folders, creating a tree structure. TypeORM's self-referential relations make this elegant:
@ManyToOne(() => DriveItemEntity, item => item.children)
parent: DriveItemEntity
@OneToMany(() => DriveItemEntity, item => item.parent)
children: DriveItemEntity[]
Strategic Indices: Performance is critical, so I added several indices:
idx_drive_item_parent_project: Fast parent lookups within a projectidx_drive_item_project_deleted: Efficient queries for active itemsidx_drive_item_name_parent: UNIQUE constraint preventing duplicate names in the same folderidx_drive_item_path: Hierarchical path queries
4. Full-Text Search with Meilisearch
Search is a critical feature for document management. I integrated Meilisearch for lightning-fast full-text search across:
- File names
- OCR-extracted text from PDFs and images
- Owner information
- Folder paths
Index Configuration:
{
searchableAttributes: ['name', 'content', 'ownerName', 'ownerEmail', 'path'],
filterableAttributes: ['projectId', 'resourceType', 'isDeleted', 'isStarred'],
rankingRules: ['words', 'typo', 'proximity', 'attribute', 'sort', 'exactness']
}
The ranking rules ensure relevant results appear first, with typo tolerance and proximity matching.
Search Query Example:
GET /v1/drive/items/search?q=contract&filter=resourceType=FILE
This searches across all indexed content and returns results in milliseconds, even with thousands of documents.
5. Background Processing Pipeline
Large files and compute-intensive operations are handled asynchronously by background workers:
OCR Worker
Input: { driveItemId, storageKey, mimeType }
Process:
1. Download file from S3
2. Send to Google Document AI for text extraction
3. Store extracted text in metadata
4. Update status to READY
5. Index content in Meilisearch
Supports: PDF, PNG, JPEG, GIF, TIFF, WEBP, BMP
Thumbnail Worker
Input: { driveItemId, storageKey, mimeType }
Process:
1. Generate thumbnail from image/video/PDF
2. Upload to S3 under thumbnail/{storageKey}
3. Mark as generated in metadata
Download Worker
Input: { folderId, projectId, jobId }
Process:
1. Recursively fetch all items in folder
2. Create ZIP file maintaining hierarchy
3. Upload ZIP to S3
4. Emit progress via Socket.IO
5. Generate signed download URL
This worker handles bulk downloads of entire folder structures, creating properly organized ZIP files on-demand.
6. Event-Driven Architecture
The system emits events for all major operations, enabling automation workflows:
enum DriveEventName {
FILE_UPLOADED = 'drive.file.uploaded',
FILE_DELETED = 'drive.file.deleted',
FILE_RENAMED = 'drive.file.renamed',
FILE_MOVED = 'drive.file.moved',
FILE_SHARED = 'drive.file.shared',
FOLDER_CREATED = 'drive.folder.created',
}
Event Flow:
- Operation completes (e.g., file upload)
- Side effect triggered
- Query listeners for this event type
- Create job payloads with event data
- Queue jobs for each listener
- Execute automation flows
This allows users to build complex document workflows like:
- Auto-process invoices when uploaded to specific folders
- Send notifications when files are shared
- Archive old files to cold storage
- Extract data from PDFs and populate databases
Key Features
Soft Delete with Bin Management
Instead of immediately deleting files, they're moved to a recycle bin:
{
isDeleted: true,
deletedAt: timestamp,
deletedBy: userId
}
Benefits:
- Users can recover accidentally deleted files
- Audit trail for compliance
- Configurable retention period (default: 30 days)
- Auto-cleanup cron job runs daily at 2 AM
Version Control
Every file update creates a new version record:
{
versionNumber: 2,
storageKey: 's3-key-v2',
checksum: 'md5-hash',
size: 1048576,
createdBy: userId,
createdAt: timestamp
}
This enables:
- Rolling back to previous versions
- Audit trails of who changed what
- Storage optimization through deduplication
Bulk Operations
The API supports batching multiple operations in a single request:
PATCH /v1/drive/items
{
operations: [
{ type: 'move', itemIds: [...], targetParentId: '...' },
{ type: 'delete', itemIds: [...] },
{ type: 'star', itemIds: [...] }
]
}
This reduces network round trips and enables atomic-like processing of related changes.
Smart MIME Type Detection
Files are validated and categorized using binary signature detection:
function detectMimeType(buffer: Buffer): string | null {
if (buffer[0] === 0x89 && buffer.toString('ascii', 1, 4) === 'PNG') {
return 'image/png'
}
if (buffer[0] === 0xFF && buffer[1] === 0xD8 && buffer[2] === 0xFF) {
return 'image/jpeg'
}
if (buffer.toString('ascii', 0, 4) === '%PDF') {
return 'application/pdf'
}
// ... more signatures
}
This prevents malicious file uploads and ensures proper processing based on actual content, not just file extensions.
Technology Stack
Backend:
- Fastify (TypeScript) - High-performance web framework
- TypeORM - Type-safe ORM with PostgreSQL
- PostgreSQL - Relational database with JSONB support
- AWS S3 - Object storage with presigned URLs
- Meilisearch - Full-text search engine
- Google Document AI - OCR and text extraction
- Socket.IO - Real-time progress updates
Frontend/Integration:
- Docxster Pieces Framework - Modular integration layer
- TypeScript - Type safety across the stack
Design Patterns & Best Practices
1. Direct-to-S3 Uploads
By generating presigned URLs, clients upload directly to S3, bypassing the application server. This:
- Reduces server bandwidth costs
- Improves upload speed
- Scales naturally with S3's infrastructure
2. Checksum Validation
MD5 checksums are calculated client-side and validated:
- On S3 upload (via Content-MD5 header)
- On upload confirmation (server-side verification)
- On version creation (deduplication)
This ensures data integrity throughout the upload pipeline.
3. Graceful Degradation
Not all features are always available:
- OCR processing degrades gracefully if Google Document AI isn't configured
- Thumbnail generation skips unsupported formats
- Search falls back to database queries if Meilisearch is unavailable
This makes the system resilient to partial failures.
4. Multi-Tenancy
Data is strictly scoped by projectId:
- All queries filter by project
- Database indices include
projectId - Storage quotas are per-project
This ensures data isolation in a multi-tenant environment.
5. Async Processing with Status Tracking
Files progress through processing stages:
PENDING_UPLOAD → UPLOAD_CONFIRMED → PROCESSING_THUMBNAIL →
PROCESSING_TEXT → PROCESSING_OCR → READY
Users can track progress, and the system can retry failed stages independently.
Performance Considerations
Database Optimization
- Partial indices for sparse columns (e.g., only index non-deleted items)
- Composite indices for common query patterns
- JSONB indices for metadata queries
- Connection pooling with TypeORM
Search Optimization
- Async indexing (doesn't block upload confirmation)
- Incremental updates (only changed fields)
- Faceted search for aggregations
- Ranking rules tuned for relevance
Caching Strategy
- Signed URLs cached for 15 minutes
- Thumbnail URLs generated once and stored
- Owner information lazy-loaded when needed
- Folder hierarchies computed once per query
Security Features
Upload Validation
- File size limits per project plan
- MIME type whitelisting
- Checksum verification
- Malware scanning integration
Access Control
- Project-scoped authorization
- User-based permissions (VIEWER, EDITOR)
- Public link generation with passwords
- Audit trails for all operations
Data Protection
- Soft delete with recovery window
- Audit logs (who deleted what, when)
- Time-limited signed URLs
- Checksums for integrity verification
Challenges & Solutions
Challenge 1: Handling Large File Uploads
Problem: Large files (>100MB) timing out, consuming server resources.
Solution: 3-step upload process with direct-to-S3 uploads. The server only handles metadata, while S3 handles the actual file transfer.
Challenge 2: Keeping Search in Sync
Problem: Search index getting out of sync with database during high load.
Solution: Async indexing with job queue. Failed indexing jobs are retried automatically. Status tracking in metadata allows re-indexing if needed.
Challenge 3: Folder Download Performance
Problem: Downloading large folder structures was slow and memory-intensive.
Solution: Background worker that streams files into a ZIP archive incrementally, with real-time progress updates via Socket.IO.
Challenge 4: Duplicate File Names
Problem: Users uploading files with the same name to the same folder.
Solution: UNIQUE constraint on (name, parentId, projectId). The API returns a clear error, allowing the client to handle it (rename, replace, etc.).
Lessons Learned
1. Design for Observability from Day One
Adding structured logging, event emission, and status tracking early made debugging and monitoring much easier as the system scaled.
2. Embrace Async Processing
Moving compute-intensive operations (OCR, thumbnails) to background workers improved API response times and made the system more resilient to processing failures.
3. Use Presigned URLs Liberally
Direct-to-S3 uploads and downloads reduced server load significantly. The application server only handles metadata and orchestration.
4. Plan for Multi-Tenancy Early
Adding projectId to every entity and query from the start made data isolation natural and prevented many potential security issues.
5. Event-Driven Architecture Enables Extensibility
By emitting events for all major operations, I made it easy to add new features (workflows, webhooks, analytics) without modifying core business logic.
Future Enhancements
- Real-time Collaboration: Live editing of documents with operational transforms
- Smart Deduplication: Content-based deduplication across the entire project
- Advanced Permissions: Fine-grained ACLs with groups and roles
- Audit Logs: Comprehensive audit trail for compliance
- AI-Powered Search: Semantic search using vector embeddings
- Mobile SDK: Native mobile SDKs for iOS and Android
Conclusion
Building Docxster Drive taught me valuable lessons about distributed systems, storage architecture, and API design. The key takeaways:
- Simple, composable pieces are easier to reason about and maintain
- Direct integrations (like S3 presigned URLs) reduce complexity and improve performance
- Async processing makes systems more resilient and responsive
- Event-driven architecture enables extensibility without coupling
- Multi-tenancy must be designed in from the start
The system now handles thousands of file operations daily, providing reliable document management for the Docxster platform. If you're building a similar system, I hope these patterns and lessons help you avoid some of the pitfalls I encountered along the way.
Interested in learning more about distributed systems and storage architecture? Follow me for more deep dives into building scalable systems.