Search Results for "pdf document text search engine"

Sort By:

Showing 96 open source projects for "pdf document text search engine"

View related business solutions

$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
AI-generated apps that pass security review
Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.

Try Retool free
1

Open Semantic Search

Open source semantic search and text analytics for large document sets

Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and enrich the data with extracted information such as named entities and metadata. It also supports optical character recognition to extract text from images and scanned documents, including images embedded inside PDF files. It integrates text mining and analytics capabilities that allow users to examine relationships, topics, and structured data within document collections.

Downloads: 3 This Week

Last Update: 5 days ago
See Project
2

Search-Index

A persistent, network resilient, full text search library

Search-Index is a lightweight and fast JavaScript-based search engine that enables full-text search indexing and retrieval for web applications.

Downloads: 0 This Week

Last Update: 2025-03-12
See Project
3

Nano PDF Editor

Edit PDF files with Nano Banana

Nano PDF Editor is a minimalist, portable PDF viewer and toolkit that focuses on simplicity, speed, and ease of integration for applications that need basic PDF rendering without heavy dependencies. It provides core functionality such as page navigation, zooming, text selection, and rendering directly to native graphics surfaces, making it suitable for lightweight PDF viewing scenarios on desktop or embedded platforms. Designed to be easily embedded into larger software projects, Nano-PDF...

Downloads: 22 This Week

Last Update: 2026-02-05
See Project
4

AnyTXT Searcher

A Powerful Desktop Full-Text Search Engine, Just Like Local Google.

AnyTXT Searcher is a powerful file full-text search engine, a desktop search application for fast document retrieval. Just like a local disk Google search engine, much faster than Windows Search, it is your ideal desktop file content full-text search engine. It has a powerful document parsing engine built in, which extracts the text of commonly used file formats without installing any other software, and combines the built-in high-speed indexing system to store the metadata of the text. ...

14 Reviews

Downloads: 5,746 This Week

Last Update: 2025-06-19
See Project
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
5

Semantra

Multi-tool for semantic search

Semantra is an open-source semantic search tool designed to help users explore large collections of documents by meaning rather than simple keyword matching. The software analyzes text and PDF documents stored locally and creates embeddings that allow queries to retrieve results based on conceptual similarity. It is primarily intended for individuals who need to extract insights from large document collections, including researchers, journalists, students, and historians. ...

Downloads: 1 This Week

Last Update: 2026-03-11
See Project
6

Papermerge

Open Source Document Management System for Digital Archives

...OCR technology is vital part of Papermerge. It extracts text information from scanned documents, PDF, JPEG, TIFF files.

Downloads: 20 This Week

Last Update: 2025-07-24
See Project
7

PaperQA2

High accuracy RAG for answering questions from scientific documents

PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs or text files, with a focus on the scientific literature. See our recent 2024 paper to see examples of PaperQA2's superhuman performance in scientific tasks like question answering, summarization, and contradiction detection. In this example we take a folder of research paper PDFs, magically get their metadata - including citation counts and a retraction check, then parse and cache PDFs into a...

Downloads: 1 This Week

Last Update: 2026-03-18
See Project
8

PageIndex

Document Index for Vectorless, Reasoning-based RAG

...This reasoning-driven retrieval aligns more naturally with how humans explore complex texts, improving relevance and traceability, especially in professional domains like financial reports, legal contracts, and technical manuals. The project includes example notebooks, scripts for tree generation and search, and support for multiple document formats including PDF and markdown, with tools designed to preserve context and semantic boundaries.

Downloads: 0 This Week

Last Update: 2026-04-08
See Project
9

ripgrep

Regex pattern directory search tool that respects your .gitignore

ripgrep is a line-oriented search tool that actively searches the directory you're currently in for a regex pattern. By default, ripgrep will ignore your .gitignore and skip hidden files or directories and binary files automatically. ripgrep has first class support on Windows, macOS and Linux, with binary downloads available for every release. ripgrep is similar to other popular search tools like The Silver Searcher, ack and grep. ripgrep supports arbitrary input preprocessing filters which...

Downloads: 83 This Week

Last Update: 2025-10-22
See Project
Go From AI Idea to AI App Fast
One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free
10

Paperless-ngx

A community-supported supercharged version of paperless

Paperless-ngx is a community-supported open-source document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper.

Downloads: 18 This Week

Last Update: 2026-04-14
See Project
11

abogen

Generate audiobooks from EPUBs, PDFs and text with captions

abogen is a tool designed to generate audiobooks (or speech narrations) from textual sources such as EPUBs, PDFs, or plain text, with synchronized captions. In other words, it automates the pipeline of reading a digital book (or document), converting its text into speech via a TTS engine, and packaging the result into an audiobook format — likely along with timestamped captions or subtitles that align with the spoken audio. This can be very useful for accessibility, content consumption on...

Downloads: 6 This Week

Last Update: 2026-02-06
See Project
12

pdf-to-podcast

PDF to Podcast transforms any PDF document into a podcast-ready audio

PDF to Podcast transforms any PDF document into a podcast-ready audio episode using advanced AI text-to-speech (TTS) providers. Upload a PDF, select your preferred voice and provider, and receive an MP3 and a ready-to-use RSS feed for your podcast app.

Downloads: 0 This Week

Last Update: 2025-08-20
See Project
13

shuyuan

Reading book source

shuyuan is a project oriented around reading and knowledge consumption, especially targeting large-scale text content such as books, articles, or educational material. The name suggests “academy” or “study hall,” and the tool aims to help users ingest, organize, and manage reading content — possibly offering features like text parsing, annotation, metadata generation, translation, or storage for later reference. The repository is set up to support document ingestion, indexing, and maybe some...

Downloads: 1 This Week

Last Update: 2025-11-28
See Project
14

marqo

Tensor search for humans

A tensor-based search and analytics engine that seamlessly integrates with your applications, websites, and workflows. Marqo is a versatile and robust search and analytics engine that can be integrated into any website or application. Due to horizontal scalability, Marqo provides lightning-fast query times, even with millions of documents. Marqo helps you configure deep-learning models like CLIP to pull semantic meaning from images. It can seamlessly handle image-to-image, image-to-text and...

Downloads: 0 This Week

Last Update: 2026-04-02
See Project
15

SAG

SQL-Driven RAG Engine

SAG is an open-source SQL-driven retrieval-augmented generation engine that dynamically constructs knowledge graphs during query processing. Instead of relying on a static knowledge graph prepared in advance, the system automatically builds relational structures between entities while processing user queries. Documents are first decomposed into atomic semantic events, which are then represented using multidimensional natural language vectors. These vectors allow the system to identify...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
16

fess

Open source enterprise search server for websites, files, and data

Fess is an open source enterprise search server designed to provide powerful full-text search capabilities across multiple data sources. It enables organizations to quickly deploy a scalable search environment without requiring deep knowledge of underlying search technologies. Fess is built on top of OpenSearch and offers an integrated solution for crawling, indexing, and searching documents from websites, file systems, and various data stores. Fess includes a built-in crawler that can...

Downloads: 2 This Week

Last Update: 2026-04-18
See Project
17

SILE

The SILE Typesetter — Simon’s Improved Layout Engine

SILE is a typesetting system; its job is to produce beautiful printed documents. Conceptually, SILE is similar to TeX—from which it borrows some concepts and even syntax and algorithms—but the similarities end there. Rather than being a derivative of the TeX family SILE is a new typesetting and layout engine written from the ground up using modern technologies and borrowing some ideas from graphical systems such as InDesign.

Downloads: 0 This Week

Last Update: 2025-05-31
See Project
18

Laila.Pdf

A .NET6 WPF Pdfium-based viewer control and printer object.

Experience seamless PDF viewing, printing, and interaction with this .NET 6 Pdfium-powered solution! Enjoy: ✅ Ultra-smooth scrolling for effortless navigation ✅ Precision text selection & copying ✅ Powerful search capabilities to find what you need instantly ✅ Basic PDF form support for interactive documents ✅ Reliable .NET 6 PDF printing for crisp, professional output Built on an enhanced version of PDFiumSharp, featuring added PDF form support for a more complete document experience...

Downloads: 0 This Week

Last Update: 2025-04-06
See Project
19

PandaWiki

AI-powered open source platform for building intelligent wiki bases

PandaWiki is an open source knowledge base system designed to help users build intelligent documentation platforms powered by large language models. It combines traditional wiki functionality with modern AI capabilities, allowing teams and individuals to create and manage product documentation, technical manuals, FAQs, and blog-style knowledge resources. PandaWiki provides tools for managing knowledge bases through an administrative interface while also generating public-facing wiki sites...

Downloads: 0 This Week

Last Update: 2026-04-08
See Project
20

OpenKM Document Management - DMS

Document Management System and Content Management System

OpenKM Community Edition is a free Document Management System (DMS) that helps businesses control the production, storage, management and distribution of electronic documents, boosting effectiveness and productivity. It integrates document management, collaboration and advanced search into one easy-to-use solution, including administration tools for user roles, access control, security levels, activity logs and automation setup. With OpenKM Community Edition you can: Collect information...

32 Reviews

Downloads: 631 This Week

Last Update: 2026-04-17
See Project
21

ChatGPT Academic

ChatGPT extension for scientific research work

ChatGPT extension for scientific research work, specially optimized academic paper polishing experience, supports custom shortcut buttons, supports custom function plug-ins, supports markdown table display, double display of Tex formulas, complete code display function, new local Python/C++/Go project tree Analysis function/Project source code self-translation ability, newly added PDF and Word document batch summary function/PDF paper full-text translation function. All buttons are...

Downloads: 0 This Week

Last Update: 2024-12-19
See Project
22

Create Index from PDF

PDF Indexing Script: Searches PDF for words, records page numbers

This Python script helps automate the process of creating an index for a PDF document. It reads a list of words from a text file, searches through each page of the PDF, and records the page numbers where each word appears. The script accounts for the first 24 pages of the PDF that use Roman numerals (i-xxiv) and adjusts the page numbers accordingly. It is designed to be case-insensitive, ensuring that variations in capitalization do not affect the search results. ...

Downloads: 2 This Week

Last Update: 2025-03-03
See Project
23

PdfgrepGui

This is a simple GUI for the command line tool grep and pdfgrep

THIS PROJECT HAS MOVED TO: https://sourceforge.net/projects/documentgrep/ This program is a GUI for the command line tool grep and pdfgrep. Pdfgrep search text in multiple PDF files and grep can serach text in multiple text files. You can use regular expressions for the search (https://en.wikipedia.org/wiki/Regular_expression). This GUI and the command line tools work without indexing. The following options are used: -i (ignore case) and -F (fixed strings), -n (Print page number or...

Downloads: 8 This Week

Last Update: 2026-01-13
See Project
24

DocumentGrep

Search text or a regular expression in multiple documents

This is a GUI for the command line tools grep, pdfgrep, pdftotext, unrtf, odt2txt, antiword,docx2txt, html2text and libreoffice. DocumentGrep search text in multiple files types. You can use regular expressions for the search (https://en.wikipedia.org/wiki/Regular_expression). This GUI and the command line tools work without indexing. Either the document is converted into text and processed by the RegExpr libary of Andrey V. Sorokin or handeled by the cli command itself (like...

Downloads: 5 This Week

Last Update: 2026-01-13
See Project
25

rqlite

The lightweight, distributed relational database built on SQLite

rqlite is an easy-to-use, lightweight, distributed relational database, which uses SQLite as its storage engine. rqlite is simple to deploy, operating it is very straightforward, and its clustering capabilities provide you with fault-tolerance and high availability. rqlite is available for Linux, macOS, and Microsoft Windows. rqlite gives you the functionality of a rock solid, fault-tolerant, replicated relational database, but with very easy installation, deployment, and operation. With it...

Downloads: 1 This Week

Last Update: 2026-03-10
See Project