NuMarkdown-8B-Thinking is the first reasoning OCR vision-language model (VLM) designed to convert documents into clean Markdown optimized for retrieval-augmented generation (RAG). Built on Qwen 2.5-VL-7B and fine-tuned with synthetic Doc → Reasoning → Markdown examples, it generates thinking tokens before producing the final Markdown to better handle complex layouts and tables. It uses a two-phase training process: supervised fine-tuning (SFT) followed by reinforcement learning (GRPO) with a layout-centric reward for accuracy on challenging documents. The model excels at non-standard layouts and complex table structures, outperforming non-reasoning OCR systems like GPT-4o and OCRFlux, and competing with large closed-source reasoning models like Gemini 2.5. Thinking token usage can range from 20% to 500% of the final answer, depending on task difficulty. NuMarkdown-8B-Thinking is released under the MIT license and supports vLLM and Transformers for deployment.
Features
- 8.29B parameter reasoning-enabled OCR VLM
- Converts documents to clean, structured Markdown
- Generates thinking tokens to plan before output
- Excels at complex layouts and merged table cells
- Trained with SFT + RL (GRPO) and layout rewards
- Outperforms GPT-4o and OCRFlux in markdown tasks
- MIT license for unrestricted use
- Supports vLLM and Transformers for inference Preguntar a ChatGPT