Join/Login
Business Software
Open Source Software
For Vendors
Blog
About
More

For Vendors Help Create Join Login

Business Software

Open Source Software

SourceForge Podcast

Resources

Articles
Case Studies
Blog

Menu

Help
Create
Join
Login

Home
Open Source Software
Search Results

Search Results for "indexing documents"

x

Sort By:

Relevance

OS

Linux 59
Windows 55
Mac 52
More...
BSD 38
ChromeOS 32
Desktop Operating Systems 1
Game Consoles 1

Category

Artificial Intelligence 27
Software Development 16
Internet 14
Business 8
Database 8
System 6
Formats and Protocols 5
Communications 3
Multimedia 3
Text Editors 3
Security 2
Scientific/Engineering 1

License

OSI-Approved Open Source 61
Creative Commons Attribution License 3
Other License 1

Translations

English 16
German 6
Finnish 2
French 2
More...
Italian 2
Chinese (Simplified) 1
Chinese (Traditional) 1
Greek 1
Japanese 1
Korean 1
Lithuanian 1
Polish 1
Portuguese 1
Spanish 1
Turkish 1
Vietnamese 1

Programming Language

Python 19
Java 15
TypeScript 8
C++ 7
More...
JavaScript 7
PHP 7
Unix Shell 6
Go 4
C 2
C# 2
PL/SQL 2
Rust 2
Fortran 1
Kotlin 1
Lazarus 1
Objective C 1
Perl 1
Prolog 1
Ruby 1
Visual Basic 1
Visual Basic .NET 1
XSL (XSLT/XPath/XSL-FO) 1

Status

Production/Stable 10
Beta 8
Alpha 7
Pre-Alpha 4
More...
Planning 3
Mature 1

Showing 72 open source projects for "indexing documents"

View related business solutions

Go From AI Idea to AI App Fast
One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free
Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
1

fess

Open source enterprise search server for websites, files, and data

...It enables organizations to quickly deploy a scalable search environment without requiring deep knowledge of underlying search technologies. Fess is built on top of OpenSearch and offers an integrated solution for crawling, indexing, and searching documents from websites, file systems, and various data stores. Fess includes a built-in crawler that can collect content from sources such as databases, CSV files, and shared storage, making it suitable for centralized knowledge discovery. It supports indexing and searching across many document formats including office documents, PDFs, and compressed archives. ...

Downloads: 2 This Week

Last Update: 2026-04-18
See Project
2

Cognita

Open source RAG framework for building scalable modular AI apps

...Cognita provides reusable components such as parsers, data loaders, embedders, retrievers, and query controllers, allowing teams to customize each stage of the RAG pipeline independently. It includes both a backend service and a frontend interface, enabling users to upload documents, experiment with configurations, and perform question-answering tasks interactively. Cognita supports incremental indexing, meaning it processes only new or updated data to reduce computational overhead and improve efficiency.

Downloads: 3 This Week

Last Update: 1 day ago
See Project
3

BuntDB

Database for Go with custom indexing and geospatial support

BuntDB is an embeddable, in-memory key/value database written in Go, with optional persistence to disk. It is built for scenarios where you want a lightweight, fast store (reads and writes in memory) but also durability (via append-only file format) and transactional semantics (ACID with single-writer, multiple-reader locking). Among its distinguishing features are support for custom indexing (even within JSON values), spatial (geospatial) indexes with support up to 20 dimensions, flexible...

Downloads: 0 This Week

Last Update: 2025-11-23
See Project
4

SemTools

Semantic search and document parsing tools for the command line

SemTools is an open-source command-line toolkit designed for document parsing, semantic indexing, and semantic search workflows. The project focuses on enabling developers and AI agents to process large document collections and extract meaningful semantic representations that can be searched efficiently. Built with Rust for performance and reliability, the toolchain provides fast processing of text and structured documents while maintaining low system overhead.

Downloads: 1 This Week

Last Update: 2026-03-13
See Project
AI-generated apps that pass security review
Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.

Try Retool free
5

bleve

A modern text indexing library for go

Import one package, build an index with three lines of code, query for documents with another three lines. Bleve includes general-purpose analyzers as well as pre-built text analyzers for the following languages, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Thai, and Turkish. Support for aggregating facet information across search results.

Downloads: 0 This Week

Last Update: 2025-12-16
See Project
6

LEANN

Local RAG engine for private multimodal knowledge search on devices

...LEANN introduces a storage-efficient approximate nearest neighbor index combined with on-the-fly embedding recomputation to avoid storing large embedding vectors. By recomputing embeddings during queries and using compact graph-based indexing structures, LEANN can maintain high search accuracy while minimizing disk usage. It aims to act as a unified personal knowledge layer that connects different types of data such as documents, code, images, and other local files into a searchable context for language models.

Downloads: 0 This Week

Last Update: 2026-03-13
See Project
7

Morphia

MongoDB object-document mapper in Java

MongoDB Object Document Mapping for the JVM. Bidirectional mapping to and from the database. Transparently map your Java entities to MongoDB documents and back.

Downloads: 0 This Week

Last Update: 2025-04-27
See Project
8

Open Semantic Search

Open source semantic search and text analytics for large document sets

Open Semantic Search is an open source research and analytics platform designed for searching, analyzing, and exploring large collections of documents using semantic search technologies. It provides an integrated search server combined with a document processing pipeline that supports crawling, text extraction, and automated analysis of content from many different sources. Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and...

Downloads: 3 This Week

Last Update: 4 days ago
See Project
9

RAG API

ID-based RAG FastAPI: Integration with Langchain and PostgreSQL

rag_api is an open-source REST API for building Retrieval-Augmented Generation (RAG) systems using LLMs like GPT. It lets users index documents, search semantically, and retrieve relevant content for use in generative AI workflows. Designed for rapid prototyping, it is ideal for chatbot development, document assistants, and knowledge-based LLM apps.

Downloads: 3 This Week

Last Update: 4 days ago
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
10

WeKnora

LLM framework for document understanding and semantic retrieval

WeKnora is an open source framework developed for deep document understanding and semantic information retrieval using large language models. It focuses on analyzing complex and heterogeneous documents by combining multiple processing stages such as multimodal document parsing, vector indexing, and intelligent retrieval. It follows the Retrieval-Augmented Generation (RAG) paradigm, where relevant document segments are retrieved and used by language models to generate accurate, context-aware responses. This approach enables the system to provide more reliable answers by grounding model reasoning in the content of uploaded documents. ...

Downloads: 0 This Week

Last Update: 2026-04-15
See Project
11

MiniRAG

Making RAG Simpler with Small and Open-Sourced Language Models

MiniRAG is a lightweight retrieval-augmented generation tool designed to bring the benefits of RAG workflows to smaller datasets, edge environments, and constrained compute settings by simplifying embedding, indexing, and retrieval. It extracts text from documents, codes, or other structured inputs and converts them into embeddings using efficient models, then stores these vectors for fast nearest-neighbor search without requiring huge databases or separate vector servers. When a query is issued, MiniRAG retrieves the most relevant contexts and feeds them into a generative model to produce an answer that is grounded in the source material rather than hallucinated. ...

Downloads: 0 This Week

Last Update: 2026-02-03
See Project
12

Scribe.js

JavaScript OCR and text extraction for images and PDFs

...In addition to simple text extraction, Scribe.js supports writing or injecting a high-quality invisible text layer back into PDFs, effectively making them searchable and improving usability for indexing or accessibility. It is written in modern ECMAScript Modules (ESM), so it can be imported in both browser and Node.js environments without a build step, though browser usage requires same-origin hosting of the files.

Downloads: 2 This Week

Last Update: 2026-03-14
See Project
13

Elastica

Elastica is a PHP client for elasticsearch

Elastica is a PHP client for Elasticsearch, providing a rich object-oriented API to interact with and query Elasticsearch indexes. It simplifies the process of building, updating, and querying search indexes, making Elasticsearch integration more accessible to PHP developers. Elastica is suitable for applications requiring full-text search, analytics, and real-time data exploration.

Downloads: 0 This Week

Last Update: 2025-10-01
See Project
14

Pixeltable

Data Infrastructure providing an approach to multimodal AI workloads

Pixeltable is an open-source Python data infrastructure framework designed to support the development of multimodal AI applications. The system provides a declarative interface for managing the entire lifecycle of AI data pipelines, including storage, transformation, indexing, retrieval, and orchestration of datasets. Unlike traditional architectures that require multiple tools such as databases, vector stores, and workflow orchestrators, Pixeltable unifies these functions within a...

Downloads: 0 This Week

Last Update: 2026-04-17
See Project
15

mgrep

A calm, CLI-native way to semantically grep everything, like code

This project is a modern, semantic search tool that brings the simplicity of traditional command-line grep to the world of natural language and multimodal content, enabling users to search across codebases, documents, PDFs, and even images using meaning-aware queries. Built with a focus on calm CLI experiences, it lets you index and query your local files with semantic understanding, delivering results that are relevant to your intent rather than simple pattern matches, which is especially powerful in large or diverse projects. It also includes features such as background indexing to keep your search index up to date without interrupting your workflow and web search integration to expand the scope of queries beyond local files. ...

Downloads: 0 This Week

Last Update: 2026-04-14
See Project
16

PageIndex

Document Index for Vectorless, Reasoning-based RAG

PageIndex is an innovative open-source framework that reimagines retrieval-augmented generation (RAG) by eliminating conventional vector similarity search and instead building hierarchical semantic indexes that mirror a document’s natural structure. Rather than chunking text and embedding it into a vector database, PageIndex constructs a tree-structured index — similar to a detailed, AI-enhanced table of contents — that a large language model can traverse to locate the most relevant sections...

Downloads: 0 This Week

Last Update: 2026-04-08
See Project
17

Haystack

Haystack is an open source NLP framework to interact with your data

Apply the latest NLP technology to your own data with the use of Haystack's pipeline architecture. Implement production-ready semantic search, question answering, summarization and document ranking for a wide range of NLP applications. Evaluate components and fine-tune models. Ask questions in natural language and find granular answers in your documents using the latest QA models with the help of Haystack pipelines. Perform semantic search and retrieve ranked documents according to meaning,...

Downloads: 3 This Week

Last Update: 5 days ago
See Project
18

LangChain-ChatGLM-Webui

Automatic question answering for local knowledge bases based on LLM

LangChain-ChatGLM-Webui is an open-source web interface that integrates the ChatGLM large language model with the LangChain framework to create an interactive conversational AI platform. The project provides a graphical interface that allows users to interact with language models through chat sessions while also connecting those models to external knowledge sources. It supports retrieval-augmented generation workflows that enable the system to answer questions based on local documents or...

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
19

Supermemory

Memory engine and app that is extremely fast, scalable

Supermemory is an ambitious and extensible AI-powered personal knowledge management system that aims to help users capture, organize, retrieve, and reason over information in a manner that mimics human memory structures. The platform allows individuals to ingest text, documents, and other content forms, then uses advanced retrieval and embedding techniques to index and relate information intelligently so that users can recall relevant knowledge in context rather than just by keyword match....

Downloads: 1 This Week

Last Update: 2 days ago
See Project
20

QMD

mini cli search engine for your docs, knowledge bases, etc.

QMD is a powerful and lightweight command-line tool that acts as an on-device search engine for your personal knowledge base, allowing you to index and search files like Markdown notes, meeting transcripts, technical documentation, and other text collections without depending on cloud services. Designed to keep all search activity local, it combines classic full-text search techniques with modern semantic features such as vector similarity and hybrid ranking so that queries return not just...

Downloads: 3 This Week

Last Update: 2026-04-05
See Project
21

Everything cURL

The book documenting the curl project, the curl tool, libcurl

Everything curl is an extensive, continuously maintained book that documents the entire curl ecosystem: the curl command-line tool, the libcurl library, the project’s history and development practices, and practical guidance for using and contributing to curl. The project is written as an open source book (CC-BY-4.0) and is available in multiple formats and locations, including an online website, PDF, and ePub so readers can pick the format that suits them. Content ranges from...

Downloads: 2 This Week

Last Update: 1 day ago
See Project
22

txtai

Build AI-powered semantic search applications

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications. Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords. Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings). Innovation is happening at a rapid...

Downloads: 0 This Week

Last Update: 2026-03-17
See Project
23

Kernel Memory

Research project. A Memory solution for users, teams, and applications

Kernel Memory is an open-source reference architecture developed by Microsoft to help developers build memory systems for AI applications powered by large language models. The project focuses on enabling applications to store, index, and retrieve information so that AI systems can incorporate external knowledge when generating responses. It supports scenarios such as document ingestion, semantic search, and retrieval-augmented generation, allowing language models to answer questions using...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
24

OCRBase

MD/.JSON Document OCR and structured data extraction API

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured results like JSON according to user-defined schemas while also providing readable formats like Markdown for human review or indexing. It includes real-time job progress updates via WebSockets, which makes it easier to integrate into UIs, dashboards, or ingestion systems where users need feedback on long-running document processing.

Downloads: 1 This Week

Last Update: 2026-04-16
See Project
25

tldw Server

Your Personal Research Multi-Tool

tldw-server (mirror) is a mirrored distribution of an open-source backend service designed to store, process, and serve summarized information extracted from long pieces of content. The name “tldw” reflects the phrase “too long; didn’t watch,” which refers to tools that condense lengthy videos, articles, or documents into concise summaries. The server component typically acts as the core infrastructure that manages summaries, metadata, and retrieval operations for client applications or user...

Downloads: 0 This Week

Last Update: 2026-03-15
See Project

Previous
You're on page 1
2
3
Next

Related Searches

gcc compiler

rag

curl

semantic search

tf idf

text match

web database server

everything

python

audio transcription

Related Categories

Artificial Intelligence

Software Development

Internet

Business

Database

SourceForge

Create a Project
Open Source Software
Business Software
Top Downloaded Projects

Company

About
Team
SourceForge Headquarters
1320 Columbia Street Suite 310
San Diego, CA 92101
+1 (858) 422-6466

Resources

Support
Site Documentation
Site Status
SourceForge Reviews

© 2026 Slashdot Media. All Rights Reserved.

Terms Privacy Opt Out Advertise