OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured results like JSON according to user-defined schemas while also providing readable formats like Markdown for human review or indexing. It includes real-time job progress updates via WebSockets, which makes it easier to integrate into UIs, dashboards, or ingestion systems where users need feedback on long-running document processing.
Features
- OCR pipeline using PaddleOCR-VL-0.9B for text extraction
- Schema-driven structured extraction that returns JSON outputs
- Queue-based processing designed for high-volume document workloads
- Type-safe TypeScript SDK including React hooks for integration
- Real-time WebSocket updates for job progress and completion
- Self-hostable deployment model built around Docker and Bun