Showing 24 open source projects for "semantic documents"

View related business solutions
  • Stop Storing Third-Party Tokens in Your Database Icon
    Stop Storing Third-Party Tokens in Your Database

    Auth0 Token Vault handles secure token storage, exchange, and refresh for external providers so you don't have to build it yourself.

    Rolling your own OAuth token storage can be a security liability. Token Vault securely stores access and refresh tokens from federated providers and handles exchange and renewal automatically. Connected accounts, refresh exchange, and privileged worker flows included.
    Try Auth0 for Free
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 1
    Semantra

    Semantra

    Multi-tool for semantic search

    Semantra is an open-source semantic search tool designed to help users explore large collections of documents by meaning rather than simple keyword matching. The software analyzes text and PDF documents stored locally and creates embeddings that allow queries to retrieve results based on conceptual similarity. It is primarily intended for individuals who need to extract insights from large document collections, including researchers, journalists, students, and historians. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    ChatGPT Retrieval Plugin

    ChatGPT Retrieval Plugin

    The ChatGPT Retrieval Plugin lets you easily find personal documents

    The chatgpt-retrieval-plugin repository implements a semantic retrieval backend that lets ChatGPT (or GPT-powered tools) access private or organizational documents in natural language by combining vector search, embedding models, and plugin infrastructure. It can serve as a custom GPT plugin or function-calling backend so that a chat session can “look up” relevant documents based on user queries, inject those results into context, and respond more knowledgeably about a private knowledge base. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    LEANN

    LEANN

    Local RAG engine for private multimodal knowledge search on devices

    ...By recomputing embeddings during queries and using compact graph-based indexing structures, LEANN can maintain high search accuracy while minimizing disk usage. It aims to act as a unified personal knowledge layer that connects different types of data such as documents, code, images, and other local files into a searchable context for language models.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Haystack

    Haystack

    Haystack is an open source NLP framework to interact with your data

    Apply the latest NLP technology to your own data with the use of Haystack's pipeline architecture. Implement production-ready semantic search, question answering, summarization and document ranking for a wide range of NLP applications. Evaluate components and fine-tune models. Ask questions in natural language and find granular answers in your documents using the latest QA models with the help of Haystack pipelines. Perform semantic search and retrieve ranked documents according to meaning, not just keywords! ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Application Monitoring That Won't Slow Your App Down Icon
    Application Monitoring That Won't Slow Your App Down

    AppSignal's Rust-based agent is lightweight and stable. Already running in thousands of production apps.

    Full APM with errors, performance, logs, and uptime monitoring. 99.999% uptime SLA on the platform itself.
    Start Free
  • 5
    txtai

    txtai

    Build AI-powered semantic search applications

    txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications. Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords. Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    docext

    docext

    An on-premises, OCR-free unstructured data extraction

    ...This allows the system to detect and extract structured elements such as tables, signatures, key fields, and layout information while maintaining semantic understanding of the document content. The toolkit can also convert complex documents into structured markdown representations that preserve formatting and contextual relationships.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    DeepSeek-OCR 2

    DeepSeek-OCR 2

    Visual Causal Flow

    DeepSeek-OCR-2 is the second-generation optical character recognition system developed to improve document understanding by introducing a “visual causal flow” mechanism, enabling the encoder to reorder visual tokens in a way that better reflects semantic structure rather than strict raster scan order. It is designed to handle complex layouts and noisy documents by giving the model causal reasoning capabilities that mimic human visual scanning behavior, enhancing OCR performance on documents with rich spatial structure. The repository provides model code and inference scripts that let researchers and developers run and benchmark the system on both images and PDFs, with support for batch evaluation and optimized pipelines leveraging vLLM and transformers.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 8
    RAG API

    RAG API

    ID-based RAG FastAPI: Integration with Langchain and PostgreSQL

    rag_api is an open-source REST API for building Retrieval-Augmented Generation (RAG) systems using LLMs like GPT. It lets users index documents, search semantically, and retrieve relevant content for use in generative AI workflows. Designed for rapid prototyping, it is ideal for chatbot development, document assistants, and knowledge-based LLM apps.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Paperless-AI

    Paperless-AI

    AI-powered document analysis and tagging for Paperless-ngx

    Paperless-AI is an AI-powered extension designed to enhance document management within Paperless-ngx by automating analysis, classification, and organization tasks. It continuously monitors incoming documents and processes them using various AI backends, enabling automatic assignment of titles, tags, document types, and correspondents. It integrates with multiple OpenAI-compatible services as well as local models, giving users flexibility in how document intelligence is handled. A key capability is its use of retrieval-augmented generation, which enables semantic search and natural language interaction across an entire document archive. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    WeKnora

    WeKnora

    LLM framework for document understanding and semantic retrieval

    WeKnora is an open source framework developed for deep document understanding and semantic information retrieval using large language models. It focuses on analyzing complex and heterogeneous documents by combining multiple processing stages such as multimodal document parsing, vector indexing, and intelligent retrieval. It follows the Retrieval-Augmented Generation (RAG) paradigm, where relevant document segments are retrieved and used by language models to generate accurate, context-aware responses. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    Pixeltable

    Pixeltable

    Data Infrastructure providing an approach to multimodal AI workloads

    Pixeltable is an open-source Python data infrastructure framework designed to support the development of multimodal AI applications. The system provides a declarative interface for managing the entire lifecycle of AI data pipelines, including storage, transformation, indexing, retrieval, and orchestration of datasets. Unlike traditional architectures that require multiple tools such as databases, vector stores, and workflow orchestrators, Pixeltable unifies these functions within a...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    kg-gen

    kg-gen

    Knowledge Graph Generation from Any Text

    ...This allows the generated graphs to be denser, more coherent, and easier to use for downstream tasks such as retrieval-augmented generation, semantic search, and reasoning systems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    FlagEmbedding

    FlagEmbedding

    Retrieval and Retrieval-augmented LLMs

    ...It also includes reranker models that refine search results by re-evaluating candidate documents using cross-encoder architectures, improving retrieval accuracy in complex queries.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Controllable-RAG-Agent

    Controllable-RAG-Agent

    This repository provides an advanced RAG

    Controllable-RAG-Agent is an advanced Retrieval-Augmented Generation (RAG) system designed specifically for complex, multi-step question answering over your own documents. Instead of relying solely on simple semantic search, it builds a deterministic control graph that acts as the “brain” of the agent, orchestrating planning, retrieval, reasoning, and verification across many steps. The pipeline ingests PDFs, splits them into chapters, cleans and preprocesses text, then constructs vector stores for fine-grained chunks, chapter summaries, and book quotes to support nuanced queries. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    UForm

    UForm

    Multi-Modal Neural Networks for Semantic Search, based on Mid-Fusion

    UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space! It comes with a set of homonymous pre-trained networks available on HuggingFace portal and extends the transfromers package to support Mid-fusion Models. Late-fusion models encode each modality independently, but into one shared vector space. Due to independent encoding late-fusion models are good at capturing coarse-grained...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    PageIndex

    PageIndex

    Document Index for Vectorless, Reasoning-based RAG

    PageIndex is an innovative open-source framework that reimagines retrieval-augmented generation (RAG) by eliminating conventional vector similarity search and instead building hierarchical semantic indexes that mirror a document’s natural structure. Rather than chunking text and embedding it into a vector database, PageIndex constructs a tree-structured index — similar to a detailed, AI-enhanced table of contents — that a large language model can traverse to locate the most relevant sections of long documents. This reasoning-driven retrieval aligns more naturally with how humans explore complex texts, improving relevance and traceability, especially in professional domains like financial reports, legal contracts, and technical manuals. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    SAG

    SAG

    SQL-Driven RAG Engine

    ...Instead of relying on a static knowledge graph prepared in advance, the system automatically builds relational structures between entities while processing user queries. Documents are first decomposed into atomic semantic events, which are then represented using multidimensional natural language vectors. These vectors allow the system to identify relationships between concepts and construct a graph representation of knowledge at runtime. The architecture also includes a three-stage retrieval pipeline consisting of recall, expansion, and reranking steps to improve search accuracy. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    LOTUS

    LOTUS

    AI-Powered Data Processing: Use LOTUS to process all of your datasets

    LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Cherche

    Cherche

    Neural Search

    Cherche allows the creation of efficient neural search pipelines using retrievers and pre-trained language models as rankers. Cherche's main strength is its ability to build diverse and end-to-end pipelines from lexical matching, semantic matching, and collaborative filtering-based models. Cherche provides modules dedicated to summarization and question answering. These modules are compatible with Hugging Face's pre-trained models and fully integrated into neural search pipelines. Search is fully compatible with the collaborative filtering library Implicit. It is advantageous if you have a history associated with users and you want to retrieve / re-rank documents based on user preferences.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    DeepSearcher

    DeepSearcher

    Open Source Deep Research Alternative to Reason and Search

    ...It is designed around the idea that high-quality answers require more than top-k retrieval, so it orchestrates multi-step search, evidence collection, and synthesis into a comprehensive response. The project integrates with vector databases (including Milvus and related options) so organizations can index internal documents and query them with semantic retrieval. It also supports flexible embeddings, making it easier to choose different embedding models depending on domain requirements, latency targets, or accuracy goals. The overall workflow aims to minimize hallucinations by grounding outputs in retrieved material and then applying structured reasoning over that evidence before generating a final report.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    marqo

    marqo

    Tensor search for humans

    ...Marqo is a versatile and robust search and analytics engine that can be integrated into any website or application. Due to horizontal scalability, Marqo provides lightning-fast query times, even with millions of documents. Marqo helps you configure deep-learning models like CLIP to pull semantic meaning from images. It can seamlessly handle image-to-image, image-to-text and text-to-image search and analytics. Marqo adapts and stores your data in a fully schemaless manner. It combines tensor search with a query DSL that provides efficient pre-filtering. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    LexiFinder

    LexiFinder

    AI-powered semantic indexing: automating the creation of book indexes

    LexiFinder is a tool to generate analytic indexes from documents automatically. Given one or more source documents and a set of keywords, it extracts all nouns, compares them semantically to the keywords using a pretrained NLP model, and produces a structured, hierarchical index ready to be included in a book or manuscript. LexiFinder works in two ways: as a command-line tool for scripting, automation, and batch processing, and as a graphical application for a guided, point-and-click...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Vector AI

    Vector AI

    A platform for building vector based applications

    Vector AI is a framework designed to make the process of building production-grade vector-based applications as quick and easily as possible. Create, store, manipulate, search and analyze vectors alongside json documents to power applications such as neural search, semantic search, personalized recommendations etc. Image2Vec, Audio2Vec, etc (Any data can be turned into vectors through machine learning). Store your vectors alongside documents without having to do a db lookup for metadata about the vectors. Enable searching of vectors and rich multimedia with vector similarity search. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Crow - Computational Representation Of Whatever. A platform for the integration and mining of complex and distributed data. Represents cross-linked semantic web documents as a network of software objects and offers easy ways to filter, and sort them.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB