Showing 169 open source projects for "documents"

View related business solutions
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • Atera all-in-one platform IT management software with AI agents Icon
    Atera all-in-one platform IT management software with AI agents

    Ideal for internal IT departments or managed service providers (MSPs)

    Atera’s AI agents don’t just assist, they act. From detection to resolution, they handle incidents and requests instantly, taking your IT management from automated to autonomous.
    Learn More
  • 1
    PromethAI

    PromethAI

    Open-source framework that gives you AI Agents

    PromethAI-Backend is a backend framework for AI-driven automation and knowledge extraction. It is designed to integrate with large language models (LLMs) to provide AI-enhanced workflows, including content generation, summarization, and data analysis.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    textacy

    textacy

    NLP, before and after spaCy

    textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spaCy library. With the fundamentals, tokenization, part-of-speech tagging, dependency parsing, etc., delegated to another library, textacy focuses primarily on the tasks that come before and follow after.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Clarity AI

    Clarity AI

    A Perplexity clone

    ...The codebase (TypeScript) leverages LLMs / embeddings to process user queries, retrieve relevant data or context, and respond conversationally; this makes it useful as a personal knowledge assistant, research helper, or Q&A front end over arbitrary datasets or web-available info. Because Clarity AI is open-source, developers can adapt the backend or retrieval logic, integrate their own data sources (databases, documents, APIs), and build custom assistants or knowledge bots tailored to their needs. It can serve as a starting platform for building AI-powered internal tools, knowledge bases, or public-facing “smart search” features.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    Common Resource Grep - crgrep

    Common Resource Grep - crgrep

    Common Resource Grep

    CRGREP searches for matching text in databases, various document formats, archives and other difficult to access resources. A command line tool for name and content text matching in database tables, plain files, MS Office documents, PDF, archives, MP3 audio, image meta-data, scanned documents, maven dependencies and web resources. CRGREP will search resources within resources of any arbitrary combination or depth, so text within a document within a zip archive, and so on. Here you will find binary downloads and discussion (https://sourceforge.net/p/crgrep/discussion/) . ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • Outgrown Windows Task Scheduler? Icon
    Outgrown Windows Task Scheduler?

    Free diagnostic identifies where your workflow is breaking down—with instant analysis of your scheduling environment.

    Windows Task Scheduler wasn't built for complex, cross-platform automation. Get a free diagnostic that shows exactly where things are failing and provides remediation recommendations. Interactive HTML report delivered in minutes.
    Download Free Tool
  • 5
    File-sharing-Bot

    File-sharing-Bot

    Telegram Bot to store Posts and Documents

    Telegram Bot to store posts and documents and it can be accessed by special links.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Automatic text summarizer

    Automatic text summarizer

    Module for automatic summarization of text documents and HTML pages

    Sumy is an automatic text summarization library that provides multiple algorithms for extracting key content from documents and articles. Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains a simple evaluation framework for text summaries. Implemented summarization methods are described in the documentation. I also maintain a list of alternative implementations of the summarizers in various programming languages.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    OpenKM Document Management - DMS

    OpenKM Document Management - DMS

    Document Management System and Content Management System

    OpenKM is a electronic document management system and record management system EDRMS ( DMS, RMS, CMS ). It provides modern and flexible architecture that meet today's IT demands, based on open technology (Java, Tomcat, GWT, Lucene, Hibernate, Spring and jBPM), powerful and scalable multiplatform application. OpenKM is a Web 2.0 application that works with Internet Explorer, Firefox, Safari and Opera. Can be configured in major DMBS like Oracle, PostgreSQL and MySQL among...
    Leader badge
    Downloads: 547 This Week
    Last Update:
    See Project
  • 8
    PRESENTA Lib

    PRESENTA Lib

    The javascript presentation library for the automation era

    PRESENTA Lib is a config-driven presentation library that creates modern web documents for the automation era. PRESENTA Lib requires a serializable object on purpose, to facilitate interoperability, and data transformation as well as fostering novel tools to create presentational documents. PRESENTA Lib is a javascript library without external dependencies. It comes as UMD, thus, you can install it in several ways.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    e-Dokyumento

    e-Dokyumento

    e-Dokyumento is web-based Document Management System (DMS)

    e-Dokyumento is opensource web-based Document Management System (DMS) A Document Management which automates the basic office document workflow such as receiving, filing, routing, and approving through capturing (scanning), digitizing (OCR Reading), storing, tagging, and electronically routing and approving (e-signature) of electronic documents. # Demo : https://e-dokyumento.herokuapp.com/ https://edokyu.seillig.com/ (refer to Readme.md for the accounts) #Dockerhub: https://hub.docker.com/r/nelsonmaligro/edokyumento # Install using the ISO: 1. Download: https://sourceforge.net/projects/e-dokyumento/files/Releases/e-DokyuV3.iso/download 2. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • Rezku Point of Sale Icon
    Rezku Point of Sale

    Designed for Real-World Restaurant Operations

    Rezku is an all-inclusive ordering platform and management solution for all types of restaurant and bar concepts. You can now get a fully custom branded downloadable smartphone ordering app for your restaurant exclusively from Rezku.
    Learn More
  • 10
    gImageReader

    gImageReader

    A graphical frontend to tesseract-ocr

    gImageReader is a simple Gtk/Qt front-end to tesseract. Features include: - Import PDF documents and images from disk, scanning devices, clipboard and screenshots - Process multiple images and documents in one go - Manual or automatic recognition area definition - Recognize to plain text or to hOCR documents - Recognized text displayed directly next to the image - Post-process the recognized text, including spellchecking - Generate PDF documents from hOCR documents **Note**: This page is only a mirror for the downloads. ...
    Leader badge
    Downloads: 218 This Week
    Last Update:
    See Project
  • 11

    MITRE Annotation Toolkit

    A toolkit for managing and manipulating text annotations

    The MITRE Annotation Toolkit (MAT) is a suite of tools which can be used for automated and human tagging of annotations. Annotation is a process, used mostly by researchers in natural language processing, of enhancing documents with information about the various phrase types the documents contain. MAT supports both UI interaction and command-line interaction, and provides various levels of control over the overall annotation process. It can be customized for specific tasks (e.g., named entity identification, de-identification of medical records). ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 12
    Next Generation Programming

    Next Generation Programming

    Compose Software Without Writing Any Programing Code

    ...Please start by looking at examples from the website first. In this way, you can learn the features of the software and how to use the software in a very short time. More Information (Documents, Videos, Examples ...) : negep.epizy.com
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    Paperless-ng

    Paperless-ng

    A supercharged version of paperless, scan, index and archive docs

    ...I feed documents right from the post box into the scanner and then shred them. Perhaps you might find it useful too. Paperless-ng is a fork of the original paperless project. It changes many things both on the surface and under the hood. Paperless-ng was created because I feel that these changes are too big to be pushed into the main repository right away.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    DrQA

    DrQA

    Reading Wikipedia to Answer Open-Domain Questions

    ...It follows a two-stage pipeline: a fast document retriever first narrows down candidate articles, and a neural machine reader then predicts the exact answer span from those passages. The retriever relies on classic IR features (like TF-IDF and n-gram statistics) to remain lightweight and scalable to millions of documents. The reader is a neural model trained on supervised QA data to estimate start and end positions within a paragraph, and it can be adapted to new domains through fine-tuning or distant supervision. The repository includes scripts to build the Wikipedia index, train the reader, and evaluate end-to-end performance. DrQA popularized a practical recipe for combining IR and neural reading, and it remains a strong baseline for open-domain QA research and production prototypes.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Universal Data Tool

    Universal Data Tool

    Collaborate & label any type of data, images, text, or documents etc.

    An open-source tool and library for creating and labeling datasets of images, audio, text, documents and video in an open data format. The Universal Data Tool can be used by anyone on your team, no data or programming skills needed. Simplicity without sacrificing any powerful developer features and integrations. Use the Universal Data Tool directly from a web browser or with a Windows, Mac or Linux desktop application. Join a link to a collaborative session and see dataset samples from team members complete in real-time. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Vector AI

    Vector AI

    A platform for building vector based applications

    Vector AI is a framework designed to make the process of building production-grade vector-based applications as quick and easily as possible. Create, store, manipulate, search and analyze vectors alongside json documents to power applications such as neural search, semantic search, personalized recommendations etc. Image2Vec, Audio2Vec, etc (Any data can be turned into vectors through machine learning). Store your vectors alongside documents without having to do a db lookup for metadata about the vectors. Enable searching of vectors and rich multimedia with vector similarity search. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Parsr

    Parsr

    Transforms PDF, Documents and Images into Enriched Structured Data

    Parsr is an open-source document parsing tool that converts PDFs, scanned images, and other structured documents into structured, machine-readable data formats.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    CC-Net

    CC-Net

    Tools to download and cleanup Common Crawl data

    ...It includes pipelines to fetch snapshots, extract text, de-duplicate, identify language, and apply quality filtering based on heuristics and language models. The outputs are intended for pretraining language models and for creating standardized corpora that can be reproduced or updated with new crawls. The repository documents practical concerns like HTTP failures, snapshot differences, and stats JSONs, reflecting community use across many languages. While powerful, the repo has been archived and is read-only, so users should expect to run it as-is or fork for maintenance. Even in archived state, issues and releases pages remain useful references for implementation details and dataset lineage.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    GluonNLP

    GluonNLP

    NLP made easy

    GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you load the text data, process the text data, and train models. To facilitate both the engineers and researchers, we provide command-line-toolkits for downloading and processing the NLP datasets. Gluon NLP makes it easy to evaluate and train word embeddings. Here are examples to evaluate the pre-trained embeddings included in the Gluon NLP toolkit as well as example scripts for training...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    fastText

    fastText

    Library for fast text classification and representation

    ...In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels).
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    perkun

    perkun

    two experimental AI languages + zubr

    Two experimental AI languages - Perkun and its successor Wlodkowic. Attempt to maximize the expected value of the payoff function by appropriate choosing the actions (output variables values). The package contains also a tool called zubr - a Java code generator based on Perkun. Take also a look at my blog: http://pawel-biernacki.blogspot.fi/ For Windows users there is an installer: http://www.pawelbiernacki.net/perkun.msi
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22

    Safe Harbor Deidentification

    Safe Harbor Deidentification for medical documents

    Phalanx - Deidentify Safe Harbor Deidentification Mode of Phalanx is an abridged pipeline of NLP annotators culminating in NER annotators which write output of text offsets. It uses the Safe Harbor deidentification method.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    TIES

    TIES

    A smart search engine for medical documents

    TIES (Text Information Extraction System) is a clinical text search engine that uses Natural Language Processing techniques to extract medical concepts from free text clinical reports. It provides secure de-identified access to this information and has in built collaboration tools and honest broker functionality. It is licensed for academic use under the BSD license. For commercial use please contact Nexi at http://nexihub.com *** NOTICE: this software and forum are no longer...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    TEXT2DATA

    TEXT2DATA

    Text Analytics Platform

    Bring Text Analytics Platform that uses NLP (Natural Language Processing) and Machine Learning to your work environment. Extract essential information from your text documents and let Artificial Intelligence save your time. Get detailed and agile reports on your unstructured data.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25

    Arabic Corpus

    Text categorization, arabic language processing, language modeling

    The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on Arabic Corpora,JOURNAL OF DIGITAL INFORMATION MANAGEMENT,vol. 9, N. 5, pp.185-192. 2) For Khaleej-2004 corpus --------------------------------- M. ...
    Leader badge
    Downloads: 3 This Week
    Last Update:
    See Project