MD/.JSON Document OCR and structured data extraction API
OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput.
...It also supports clickable links so generated documents can include interactive URLs, and it can create multi-page documents with custom page sizes. A notable convenience is built-in markdown-to-PDF conversion for common structures like headers and lists, letting you go from formatted text to a PDF layout quickly.