OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured results like JSON according to user-defined schemas while also providing readable formats like Markdown for human review or indexing. It includes real-time job progress updates via WebSockets, which makes it easier to integrate into UIs, dashboards, or ingestion systems where users need feedback on long-running document processing.

Features

  • OCR pipeline using PaddleOCR-VL-0.9B for text extraction
  • Schema-driven structured extraction that returns JSON outputs
  • Queue-based processing designed for high-volume document workloads
  • Type-safe TypeScript SDK including React hooks for integration
  • Real-time WebSocket updates for job progress and completion
  • Self-hostable deployment model built around Docker and Bun

Project Samples

Project Activity

See All Activity >

Categories

PDF

License

MIT License

Follow OCRBase

OCRBase Web Site

Other Useful Business Software
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
Try Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of OCRBase!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

TypeScript

Related Categories

TypeScript PDF Software

Registered

2026-01-27