Search Results for "pdf to text" - Page 2

Sort By:

Showing 94 open source projects for "pdf to text"

View related business solutions

Python Clear Filters & Widen Search

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
Enterprise-grade ITSM, for every business
Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free
1

LlamaParse

Parse files for optimal RAG

LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.

Downloads: 5 This Week

Last Update: 2026-02-13
See Project
2

kb

A minimalist command line knowledge base manager

kb is a minimalist command-line knowledge base manager that gives users a fast, organized way to collect, store, search, and retrieve notes, documents, cheatsheets, procedures, and other artifacts directly from the terminal. It was created to solve the common problem of having scattered text files or reference materials on disk that are hard to search or categorize, and it surfaces a simple CLI interface with intuitive commands for adding, viewing, editing, and deleting knowledge items. Each...

Downloads: 0 This Week

Last Update: 2026-02-16
See Project
3

Google Open Source Project Style Guide

Chinese version of Google open source project style guide

...If the project you are modifying originates from Google, you may be directed to the English version of the project page to understand the style used by the project. The Chinese version of the project uses reStructuredText plain text markup syntax, and uses Sphinx to generate document formats such as HTML / CHM / PDF.

Downloads: 1 This Week

Last Update: 2024-12-08
See Project
4

PageIndex

Document Index for Vectorless, Reasoning-based RAG

...The project includes example notebooks, scripts for tree generation and search, and support for multiple document formats including PDF and markdown, with tools designed to preserve context and semantic boundaries.

Downloads: 3 This Week

Last Update: 5 days ago
See Project
AI-generated apps that pass security review
Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.

Try Retool free
5

Sphinx

Main repository for the Sphinx documentation builder

...It was originally created for the Python documentation, and it has excellent facilities for the documentation of software projects in a range of languages. Of course, this site is also created from reStructuredText sources using Sphinx! HTML (including Windows HTML Help), LaTeX (for printable PDF versions), ePub, Texinfo, manual pages, plain text. Semantic markup and automatic links for functions, classes, citations, glossary terms and similar pieces of information. Easy definition of a document tree, with automatic links to siblings, parents and children. General index as well as a language-specific module index. Automatic highlighting using the Pygments highlighter. ...

Downloads: 22 This Week

Last Update: 2025-12-31
See Project
6

Pdf_tools

✅ Image to PDF Convert multiple image files into a single PDF. Supports formats: JPG, JPEG, PNG, BMP, TIFF. ✅ PDF Merger Merge multiple PDF files into one. Reorder PDF files before merging. ✅ PDF Splitter Split PDF files by range or into individual pages. ✅ Page Remover Remove specific pages from a PDF. ✅ Fill & Sign Add text and signature to a PDF.

Downloads: 6 This Week

Last Update: 2025-03-20
See Project
7

NeMo Retriever Library

Document content and metadata extraction microservice

NeMo Retriever Library is a scalable microservice framework designed for extracting, structuring, and enriching content from documents to support downstream generative AI applications. It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient...

Downloads: 1 This Week

Last Update: 2026-03-18
See Project
8

deepdoctection

A Repo For Document AI

DeepDoctection is a document AI framework that applies deep learning techniques to analyze and extract structured data from scanned documents, PDFs, and images. deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated frameworks for...

Downloads: 0 This Week

Last Update: 2026-04-09
See Project
9

DeepSeek-OCR 2

Visual Causal Flow

DeepSeek-OCR-2 is the second-generation optical character recognition system developed to improve document understanding by introducing a “visual causal flow” mechanism, enabling the encoder to reorder visual tokens in a way that better reflects semantic structure rather than strict raster scan order. It is designed to handle complex layouts and noisy documents by giving the model causal reasoning capabilities that mimic human visual scanning behavior, enhancing OCR performance on documents...

Downloads: 7 This Week

Last Update: 2026-02-03
See Project
Train ML Models With SQL You Already Know
BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.

Try Free
10

ChatGPT Academic

ChatGPT extension for scientific research work

ChatGPT extension for scientific research work, specially optimized academic paper polishing experience, supports custom shortcut buttons, supports custom function plug-ins, supports markdown table display, double display of Tex formulas, complete code display function, new local Python/C++/Go project tree Analysis function/Project source code self-translation ability, newly added PDF and Word document batch summary function/PDF paper full-text translation function. All buttons are dynamically generated by reading functional.py, you can add custom functions at will, and liberate the pasteboard. Support for markdown tables output by GPT. If the output contains a formula, it will be displayed in tex form and rendered form at the same time, which is convenient for copying and reading.

Downloads: 0 This Week

Last Update: 2024-12-19
See Project
11

myGPTReader

AI Slack bot for reading, summarizing, and chatting with content

myGPTReader is an AI-powered Slack bot designed to help users read, summarize, and interact with various types of digital content through conversational interfaces. It enables users to quickly understand web pages, documents, and even video content by transforming them into interactive discussions rather than static reading experiences. myGPTReader supports a wide range of file formats, including eBooks, PDFs, and text-based documents, making it flexible for both casual and professional use...

Downloads: 1 This Week

Last Update: 5 days ago
See Project
12

ArXiv MCP Server

A Model Context Protocol server for searching and analyzing arXiv

arxiv-mcp-server bridges AI assistants and the arXiv repository through a clean MCP interface, enabling search, metadata retrieval, and content access without bespoke scraping. With simple tools like “search” and “fetch,” an agent can find papers, pull abstracts, and download PDFs for downstream summarization or analysis. The project includes packaging and CI to publish to PyPI, plus tests and linting for reliability. Issue threads show feature requests such as extracting embedded LaTeX and...

Downloads: 1 This Week

Last Update: 2026-04-06
See Project
13

Controllable-RAG-Agent

This repository provides an advanced RAG

Controllable-RAG-Agent is an advanced Retrieval-Augmented Generation (RAG) system designed specifically for complex, multi-step question answering over your own documents. Instead of relying solely on simple semantic search, it builds a deterministic control graph that acts as the “brain” of the agent, orchestrating planning, retrieval, reasoning, and verification across many steps. The pipeline ingests PDFs, splits them into chapters, cleans and preprocesses text, then constructs vector...

Downloads: 0 This Week

Last Update: 2025-11-13
See Project
14

Jina

Build cross-modal and multimodal applications on the cloud

...Jina handles the infrastructure complexity, making advanced solution engineering and cloud-native technologies accessible to every developer. Build applications that deliver fresh insights from multiple data types such as text, image, audio, video, 3D mesh, PDF with Jina AI’s DocArray. Polyglot gateway that supports gRPC, Websockets, HTTP, GraphQL protocols with TLS. Intuitive design pattern for high-performance microservices. Seamless Docker container integration: sharing, exploring, sandboxing, versioning and dependency control via Jina Hub. Fast deployment to Kubernetes, Docker Compose and Jina Cloud. ...

Downloads: 0 This Week

Last Update: 2024-11-12
See Project
15

realwatermark

A Python application to add watermarks (text or image) to PDF files

A Python application to add watermarks (text or image) to PDF files, converts them into image and back to PDF with options for OCR and compression.

Downloads: 1 This Week

Last Update: 2025-01-27
See Project
16

Create Index from PDF

PDF Indexing Script: Searches PDF for words, records page numbers

This Python script helps automate the process of creating an index for a PDF document. It reads a list of words from a text file, searches through each page of the PDF, and records the page numbers where each word appears. The script accounts for the first 24 pages of the PDF that use Roman numerals (i-xxiv) and adjusts the page numbers accordingly. It is designed to be case-insensitive, ensuring that variations in capitalization do not affect the search results. ...

Downloads: 0 This Week

Last Update: 2025-03-03
See Project
17

Scribus

Powerful desktop publishing software

Scribus is an Open Source program that brings professional page layout to Linux, BSD UNIX, Solaris, OpenIndiana, GNU/Hurd, Mac OS X, OS/2 Warp 4, eComStation, and Windows desktops with a combination of press-ready output and new approaches to page design. Underneath a modern and user-friendly interface, Scribus supports professional publishing features, such as color separations, CMYK and spot colors, ICC color management, and versatile PDF creation.

143 Reviews

Downloads: 21,245 This Week

Last Update: 2026-04-13
See Project
18

LangChain Extract

Did you say you like data?

LangChain Extract is an open-source reference application designed to demonstrate how large language models can be used to extract structured data from unstructured text and document files. The project implements a lightweight web service that allows developers to define extraction schemas and apply them to various sources such as plain text, HTML, or PDF documents. Built using FastAPI and the LangChain framework, the application exposes a REST API that can process documents and return structured outputs that match user-defined JSON schemas. ...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
19

bridgex

Convert files like docx, xlsx, pptx, html, and more to MarkDown

... - Support for multiple input formats. - Lightweight editing prior to saving. Supported Formats 📂 Bridgex supports conversion of the following file formats: - PDF (.pdf) - Word (.docx) - PowerPoint (.pptx) - Excel (.xlsx, .xls, .csv) - Outlook Messages (.msg) - Text (.txt, .text) - Markdown (.md, .markdown) - JSON (.json, .jsonl) - XML (.xml) - RSS/Atom (.rss, .atom) - HTML/MHTML (.html, .htm, .mhtml) - ePub (.epub) - Compressed files (.zip) - Jupyter Notebooks (.ipynb) - Other formats supported by Markitdown Bridgex is not an IDE, text editor, Markdown editor, or document viewer

Downloads: 3 This Week

Last Update: 2026-01-11
See Project
20

shortcutnotes

copy but NO Paste and make presentations with PDF support.

The Modern Notes & Presentation Creator is a Python desktop application built using CustomTkinter, designed for managing rich text notes and creating PowerPoint presentations. It features a clean, modern GUI with dark/light theme toggling and adjustable font sizes. Users can copy text from the clipboard, automatically add serial numbers, and organize content in a text area with support for Unicode, including Hindi text and emojis. The app allows saving notes as PPT, PDF, or TXT files, ensuring text formatting is preserved. ...

Downloads: 0 This Week

Last Update: 2024-11-09
See Project
21

pdf combiner merger converter splitter

PDF Combiner is a user-friendly, GUI-based tool built in

PDF Combiner is a user-friendly open source free to use, GUI-based tool for combining, pdf to excel, pdf to word, image to pdf, zip, unzip annotate and splitting PDF files. It is easy to use, supports multiple file insert and delete and process, and allows you to adjust the order of files before combining.

1 Review

Downloads: 2 This Week

Last Update: 2024-05-03
See Project
22

MediaWiki to LaTeX

MediaWiki To LaTeX converts MediaWiki markup to LaTeX and generates a PDF. So it provides an export from MediaWiki to LaTeX. It works with any project running MediaWiki, especially Wikipedia and Wikibooks.

1 Review

Downloads: 1 This Week

Last Update: 2026-01-01
See Project
23

Eugraphios

Free, portable desktop Computer-Assisted Translation (CAT) tool.

Eugraphios is a free, portable desktop Computer-Assisted Translation (CAT) tool designed for freelancers. Whether you're translating documents, websites, or software, Eugraphios is designed to meet your needs and exceed your expectations. With a focus on intuitive design and user-friendly interfaces, Eugraphios aims to eliminate the complexity that often hinders professionals and beginners in the translation field. By providing a seamless and enjoyable experience, this tool empowers users...

1 Review

Downloads: 2 This Week

Last Update: 2026-01-04
See Project
24

CiteFlow

Desktop research workspace for PDFs, notes, citations, bibliographies.

CiteFlow is a focused desktop research workspace for students, researchers, and academic writers who want to manage PDFs, notes, citations, and bibliographies in one place. Create project-based workspaces for essays, articles, reports, literature reviews, and long-form research. Import PDFs, read them inside the app, search within documents, compare files side by side, highlight key passages, and add page-based notes. CiteFlow can assist with DOI metadata detection, keeps citation history...

Downloads: 3 This Week

Last Update: 15 hours ago
See Project
25

LexiFinder

AI-powered semantic indexing: automating the creation of book indexes

...LexiFinder works in two ways: as a command-line tool for scripting, automation, and batch processing, and as a graphical application for a guided, point-and-click experience. Both interfaces share the same underlying engine and support the same features. Supported input formats are PDF, DOCX, and ODT. The index can be exported as plain text, JSON, CSV, or HTML.

Downloads: 0 This Week

Last Update: 2026-03-04
See Project