OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured results like JSON according to user-defined schemas while also providing readable formats like Markdown for human review or indexing. It includes real-time job progress updates via WebSockets, which makes it easier to integrate into UIs, dashboards, or ingestion systems where users need feedback on long-running document processing.

Features

  • OCR pipeline using PaddleOCR-VL-0.9B for text extraction
  • Schema-driven structured extraction that returns JSON outputs
  • Queue-based processing designed for high-volume document workloads
  • Type-safe TypeScript SDK including React hooks for integration
  • Real-time WebSocket updates for job progress and completion
  • Self-hostable deployment model built around Docker and Bun

Project Samples

Project Activity

See All Activity >

Categories

PDF

License

MIT License

Follow OCRBase

OCRBase Web Site

Other Useful Business Software
Atera all-in-one platform IT management software with AI agents Icon
Atera all-in-one platform IT management software with AI agents

Ideal for internal IT departments or managed service providers (MSPs)

Atera’s AI agents don’t just assist, they act. From detection to resolution, they handle incidents and requests instantly, taking your IT management from automated to autonomous.
Learn More
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of OCRBase!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

TypeScript

Related Categories

TypeScript PDF Software

Registered

14 hours ago