Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. For broader format support, the system combines its Rust core with ahead-of-time compiled Apache Tika shared libraries, which allows it to extend parsing coverage while still avoiding traditional server-based overhead. It also supports OCR for images and scanned documents through Tesseract, making it useful for document ingestion pipelines that include image-based or scanned inputs.

Features

  • Fast text and metadata extraction
  • Automatic document type detection
  • Support for many file formats
  • OCR for images and scanned documents
  • Rust core with Python bindings
  • Local execution without external extraction services

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Extractous

Extractous Web Site

Other Useful Business Software
Gemini 3 and 200+ AI Models on One Platform Icon
Gemini 3 and 200+ AI Models on One Platform

Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Extractous!

Additional Project Details

Programming Language

Rust

Related Categories

Rust Large Language Models (LLM)

Registered

2026-03-06