OCRBase

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured results like JSON according to user-defined schemas while also providing readable formats like Markdown for human review or indexing. It includes real-time job progress updates via WebSockets, which makes it easier to integrate into UIs, dashboards, or ingestion systems where users need feedback on long-running document processing.

Features

OCR pipeline using PaddleOCR-VL-0.9B for text extraction
Schema-driven structured extraction that returns JSON outputs
Queue-based processing designed for high-volume document workloads
Type-safe TypeScript SDK including React hooks for integration
Real-time WebSocket updates for job progress and completion
Self-hostable deployment model built around Docker and Bun

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow OCRBase

OCRBase Web Site

Other Useful Business Software

$300 Free Credits for Your Google Cloud Projects

Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial

Rate This Project

User Reviews

Be the first to post a review of OCRBase!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

TypeScript

Related Categories

TypeScript PDF Software

Registered

2026-01-27

Similar Business Software

Titan

Titan is the all-in-one, Salesforce-first platform for building customer-facing workflows directly on Salesforce. Create portals, forms, surveys, document generation, eSignatures, and contract processes that write back in real time, keeping Salesforce as your system of record. Titan AI turns...

See Software
Nutrient SDK

Nutrient is the comprehensive solution for all your PDF needs, offering tools that effortlessly integrate and operate PDF functionality across any platform. 1. SDK PRODUCTS Integrate robust PDF functionality into iOS, Android, Windows, web (JavaScript), or any cross-platform technology,...

See Software
MobiPDF

MobiPDF (formerly PDF Extra) is an intuitive and powerful PDF editor and reader designed for today’s modern user - the cost-efficient alternative to Adobe Acrobat Pro you’ve been looking for. FEATURES OVERVIEW: PDF Viewer and Reader: Switch between page views or use "Read Mode" for...

See Software
PDFCreator

PDFCreator automates document output for Windows-based business environments, handling the full creation pipeline from conversion to delivery. It converts print output from any application into PDF, JPG, PNG, or TIF using a virtual printer driver, with no changes to existing workflows required....

See Software
MobiOffice

MobiOffice (formerly OfficeSuite) is an easy-to-use office suite alternative, featuring MobiDocs, MobiSheets, and MobiSlides. It allows you to handle text documents, spreadsheets, and presentations efficiently. MobiOffice supports all major file formats, including Microsoft Office (DOCX,...

See Software
RAD PDF

Add a fully functional PDF editor to your ASP.NET website in minutes! Compatible with 99% of desktop & mobile browsers, from Internet Explorer 6 through the latest iOS Safari release, RAD PDF simply works. No plugins or other software needed. RAD PDF natively supports the most commonly...

See Software