DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files, and other document sources while preserving privacy and control over the processing flow. One of its key differentiators is deployment flexibility: it offers a cloud API for managed usage as well as a fully private offline mode that runs locally on a GPU. The platform also supports synchronous extraction, streaming responses, and asynchronous processing for larger documents, which makes it adaptable to both interactive workflows and heavier back-end pipelines.
Features
- Extraction from PDFs, images, Word files, Excel files, PowerPoint files, and URLs
- Output generation in Markdown, JSON, CSV, and HTML formats
- End-to-end OCR, layout analysis, and table extraction pipeline
- Private offline GPU mode in addition to managed cloud API access
- Streaming support for real-time extraction results
- Asynchronous processing for larger multi-page documents