Showing 2020 open source projects for "documents"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    Chandra

    Chandra

    OCR model for complex documents with layout-aware structured outputs

    Chandra is an advanced OCR model designed to extract and structure information from complex documents such as tables, forms, handwritten notes, and mathematical content. It focuses on preserving full document layout, meaning that extracted text is accompanied by positional metadata like bounding boxes for each element. Chandra supports multiple output formats including Markdown, HTML, and JSON, making it suitable for downstream processing and integration into data pipelines.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    PageIndex

    PageIndex

    Document Index for Vectorless, Reasoning-based RAG

    ...Rather than chunking text and embedding it into a vector database, PageIndex constructs a tree-structured index — similar to a detailed, AI-enhanced table of contents — that a large language model can traverse to locate the most relevant sections of long documents. This reasoning-driven retrieval aligns more naturally with how humans explore complex texts, improving relevance and traceability, especially in professional domains like financial reports, legal contracts, and technical manuals. The project includes example notebooks, scripts for tree generation and search, and support for multiple document formats including PDF and markdown, with tools designed to preserve context and semantic boundaries.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    TaxHacker

    TaxHacker

    Self-hosted AI accounting app. LLM analyzer for receipts

    TaxHacker is an open-source, self-hosted accounting application that uses artificial intelligence to automate financial record management for freelancers, independent developers, and small businesses. The system is designed to simplify bookkeeping by automatically processing financial documents such as receipts, invoices, and transaction records. It integrates large language models to analyze these documents, extract relevant financial information, and categorize expenses or income based on configurable rules. Users can deploy the application on their own infrastructure, ensuring that financial data remains private and under their control rather than being processed by external services. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 4
    MegaParse

    MegaParse

    File Parser optimised for LLM Ingestion with no loss

    MegaParse is a file parser optimized for Large Language Model (LLM) ingestion, ensuring no loss of information. It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 5
    Doctrine JSON ODM

    Doctrine JSON ODM

    An object document mapper for Doctrine ORM using JSON

    ...Did you ever dream of a tool creating powerful data models mixing traditional, efficient relational mappings with modern schema-less and NoSQL-like ones? With Doctrine JSON ODM, it's now possible to create and query such hybrid data models with ease. Thanks to modern JSON types of RDBMS, querying schema-less documents is easy, powerful and fast as hell (similar in performance to a MongoDB database)! You can even define indexes for those documents. Doctrine JSON ODM allows to store PHP objects as JSON documents in modern, dynamic columns of an RDBMS. It works with JSON and JSONB columns of PostgreSQL (>= 9.4) and the JSON column type of MySQL.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    unioffice

    unioffice

    Pure go library for creating and processing Office Word documents

    unioffice is a library for creation of Office Open XML documents (.docx, .xlsx and .pptx). Its goal is to be the most compatible and highest-performance Go library for the creation and editing of docx/xlsx/pptx files. Every release of our libraries is automatically tested against known vulnerabilities and do not pass unless everything is remediated. All changes are carefully reviewed by our team.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    Pimcore

    Pimcore

    Open Source Data & Experience Management Platform

    No matter if you're dealing with unstructured web documents or structured data for MDM/PIM, you define the UI design (web documents by a template and structured data with an intuitive graphical editor), Pimcore knows how to persist the data efficiently and optimized for fast access. Due to the framework approach, Pimcore is very flexible and adapts perfectly to your needs. Built on top of the well-known Symfony Framework you have a solid and modern foundation for your project. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    ContextGem

    ContextGem

    ContextGem: Effortless LLM extraction from documents

    ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Umi-OCR

    Umi-OCR

    OCR software, free and offline

    ...It includes a highly efficient offline OCR engine with built-in multilingual recognition libraries, so users can extract text across multiple languages with high accuracy directly on their machines. The software supports flexible usage patterns including screenshot capture OCR, batch processing of large sets of images or documents, PDF parsing, QR code detection, and layout-aware paragraph output. Users can interact with Umi-OCR through a graphical interface, command-line options, or HTTP interfaces, making it adaptable to both casual desktop usage and programmatic automation. Because the project is open source, developers can inspect, modify, and extend its capabilities, and plugins allow for different recognition engines or enhanced features.
    Downloads: 49 This Week
    Last Update:
    See Project
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 10
    Local File Organizer

    Local File Organizer

    An AI-powered file management tool that ensures privacy

    ...The project focuses on privacy-first file organization by performing all processing locally rather than sending data to external cloud services. It uses language and vision models to understand the contents of documents, images, and other file types so that files can be grouped intelligently according to their meaning or context. The system scans directories, extracts relevant information from files, and restructures folder hierarchies to make content easier to locate and manage. Through AI-driven analysis, the software can detect themes, topics, and metadata in files, allowing it to organize information in ways that traditional rule-based file managers cannot achieve. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 11
    DeepSeek-OCR

    DeepSeek-OCR

    Contexts Optical Compression

    ...It supports local deployment, enabling organizations concerned about privacy or latency to run the pipeline on-premises rather than send sensitive documents to third-party cloud services. The codebase is written in Python with a focus on modularity: you can swap preprocessing, recognition, and post-processing components as needed for custom workflows.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 12
    Typst

    Typst

    A new markup-based typesetting system that is powerful and easy

    ...Typst supercharges templates: They react to your content and format everything instantly while you type. Select from a wide range of community templates or create your own. Store shared documents in team workspaces to bring everyone in your working group on the same page. Whether in the classroom, the faculty office, or at home. Typst runs in your browser, so everyone on the team can just start writing.
    Downloads: 21 This Week
    Last Update:
    See Project
  • 13
    OpenAPI Generator

    OpenAPI Generator

    OpenAPI Generator allows generation of API client libraries

    ...Some generators support Inversion of Control, allowing you to iterate on design via your OpenAPI document without worrying about blowing away your entire domain layer when you regenerate code. Ever wanted to iteratively design a MySQL database, but writing table declarations was too tedious? OpenAPI documents allow you to convert the metadata about your API into some other format.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 14
    Tongyi DeepResearch

    Tongyi DeepResearch

    Tongyi Deep Research, the Leading Open-source Deep Research Agent

    DeepResearch (Tongyi DeepResearch) is an open-source “deep research agent” developed by Alibaba’s Tongyi Lab designed for long-horizon, information-seeking tasks. It’s built to act like a research agent: synthesizing, reasoning, retrieving information via the web and documents, and backing its outputs with evidence. The model is about 30.5 billion parameters in size, though at any given token only ~3.3B parameters are active. It uses a mix of synthetic data generation, fine-tuning and reinforcement learning; supports benchmarks like web search, document understanding, question answering, “agentic” tasks; provides inference tools, evaluation scripts, and “web agent” style interfaces. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 15
    Automatic text summarizer

    Automatic text summarizer

    Module for automatic summarization of text documents and HTML pages

    Sumy is an automatic text summarization library that provides multiple algorithms for extracting key content from documents and articles. Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains a simple evaluation framework for text summaries. Implemented summarization methods are described in the documentation. I also maintain a list of alternative implementations of the summarizers in various programming languages.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    DevOps Basics

    DevOps Basics

    Practical and document place for DevOps toolchain

    You are new to DevOps or want to learn some DevOps tools, or you are already a DevOps engineer, and you are looking for DevOps documents and a place to practice DevOps tools? This repository will assist you in enhancing your DevOps skills and serve as a bookmark for documents related to DevOps.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    ShareDB

    ShareDB

    Realtime database backend based on Operational Transformation (OT)

    ShareDB is a realtime database backend based on Operational Transformation (OT) of JSON documents. It is the realtime backend for the DerbyJS web application framework. Realtime synchronization of any JSON document. Synchronous editing API with asynchronous eventual consistency. Projections to select desired fields from documents and operations. Middleware for implementing access control and custom extensions. Ideal for use in browsers or on the server.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    eXist-db

    eXist-db

    eXist Native XML Database and Application Platform

    eXist-db is an open-source, native XML database and application platform that provides a powerful environment for storing, querying, and managing XML documents. It is designed for complex data management needs, offering XQuery, XSLT, and RESTful web services for interacting with structured data.
    Downloads: 12 This Week
    Last Update:
    See Project
  • 19
    OpenSign

    OpenSign

    The free & Open Source DocuSign alternative

    Welcome to OpenSign, the premier open source docusign alternative - document e-signing solution designed to provide a secure, reliable and free alternative to commercial esign platforms like DocuSign, PandaDoc, SignNow, Adobe Sign, Smartwaiver, SignRequest, HelloSign & Zoho sign. Our mission is to democratize the document signing process, making it accessible and straightforward for everyone.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 20
    LongWriter

    LongWriter

    Unleashing 10,000+ Word Generation from Long Context LLMs

    ...LongWriter addresses this challenge by introducing a specialized dataset and training approach that encourages models to produce longer responses. The system uses an agent-based pipeline called AgentWrite that decomposes large writing tasks into smaller subtasks, allowing the model to produce long documents section by section. Researchers also created the LongWriter-6k dataset containing thousands of examples with outputs ranging from a few thousand to tens of thousands of words.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    Scribe.js

    Scribe.js

    JavaScript OCR and text extraction for images and PDFs

    Scribe.js is a JavaScript library that provides Optical Character Recognition (OCR) and text extraction capabilities for both images and PDF documents, aimed at developers who want to build OCR features directly into their applications. The library can take image files (such as PNG or JPEG) and recognize the text they contain, and it can also extract text from PDF files that either already contain text or are image-based scans, using modern web standards and WebAssembly under the hood. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22
    DocuSeal

    DocuSeal

    Open source DocuSign alternative

    ...Use embeddable code snippets to seamlessly implement the document signing workflows directly on your website or app. Build fillable document forms using our pixel-perfect HTML API, reducing the time for creating personalized documents. DocuSeal Pro users can connect their Gmail or Outlook accounts to send all signing invitation emails from their connected email address under a custom domain. All email messages sent via DocuSeal are customizable. With our step-by-step form users have a smooth signing experience on small mobile screens. Also, you can upload documents and request signatures directly from your device.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    crwlr

    crwlr

    Library for Rapid (Web) Crawler and Scraper Development

    ...For most real-world use cases, those two things go hand in hand, which is why this library helps with and combines both. A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could just load actually all links it is finding (and is allowed to load according to the robots.txt file), then it would just load the whole internet (if the URL(s) it starts with are no dead end). Or it can be restricted to load only links matching certain criteria (on same domain/host, URL path starts with "/foo",...) or only to a certain depth. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    GROBID

    GROBID

    A machine learning software for extracting information

    GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby. In 2011 the tool has been made available in open source. Work on GROBID has been steady as a side project since the beginning and is expected to continue as such.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    Better BibTeX for Zotero

    Better BibTeX for Zotero

    Make Zotero effective for us LaTeX holdouts

    Better BibTeX (BBT) is a plugin for Zotero and Juris-M that makes it easier to manage bibliographic data, especially for people authoring documents using text-based toolchains (e.g. based on LaTeX / Markdown). Zotero does all its work in UTF-8 Unicode, which is absolutely the right thing to do. Unfortunately, for those shackled to BibTeX and who cannot (yet) move to BibLaTeX, unicode is a major PITA. Also, Zotero supports some simple HTML markup in your items that Bib(La)TeX won’t understand.
    Downloads: 38 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB