OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured results like JSON according to user-defined schemas while also providing readable formats like Markdown for human review or indexing. It includes real-time job progress updates via WebSockets, which makes it easier to integrate into UIs, dashboards, or ingestion systems where users need feedback on long-running document processing.

Features

  • OCR pipeline using PaddleOCR-VL-0.9B for text extraction
  • Schema-driven structured extraction that returns JSON outputs
  • Queue-based processing designed for high-volume document workloads
  • Type-safe TypeScript SDK including React hooks for integration
  • Real-time WebSocket updates for job progress and completion
  • Self-hostable deployment model built around Docker and Bun

Project Samples

Project Activity

See All Activity >

Categories

PDF

License

MIT License

Follow OCRBase

OCRBase Web Site

Other Useful Business Software
8 Monitoring Tools in One APM. Install in 5 Minutes. Icon
8 Monitoring Tools in One APM. Install in 5 Minutes.

Errors, performance, logs, uptime, hosts, anomalies, dashboards, and check-ins. One interface.

AppSignal works out of the box for Ruby, Elixir, Node.js, Python, and more. 30-day free trial, no credit card required.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of OCRBase!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

TypeScript

Related Categories

TypeScript PDF Software

Registered

2026-01-27