OCRBase

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured results like JSON according to user-defined schemas while also providing readable formats like Markdown for human review or indexing. It includes real-time job progress updates via WebSockets, which makes it easier to integrate into UIs, dashboards, or ingestion systems where users need feedback on long-running document processing.

Features

OCR pipeline using PaddleOCR-VL-0.9B for text extraction
Schema-driven structured extraction that returns JSON outputs
Queue-based processing designed for high-volume document workloads
Type-safe TypeScript SDK including React hooks for integration
Real-time WebSocket updates for job progress and completion
Self-hostable deployment model built around Docker and Bun

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow OCRBase

OCRBase Web Site

Other Useful Business Software

Atera all-in-one platform IT management software with AI agents

Ideal for internal IT departments or managed service providers (MSPs)

Atera’s AI agents don’t just assist, they act. From detection to resolution, they handle incidents and requests instantly, taking your IT management from automated to autonomous.

Learn More

Rate This Project