python pdf scaper free download

Showing 390 open source projects for "python pdf scaper"

View related business solutions

Ship Agents Faster
Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free
$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
1

PDF Arranger

Small python-gtk application, to merge or split PDFs

PDF Arranger is a small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface. It is a front end for pikepdf. PDF Arranger is a fork of Konstantinos Poulios’s PDF Shuffler (see Savannah or Sourceforge). It’s a humble attempt to make the project a bit more active.

1 Review

Downloads: 510 This Week

Last Update: 2026-05-23
See Project
2

OpenDataLoader PDF

PDF Parser for AI-ready data. Automate PDF accessibility

OpenDataLoader PDF is an open-source document processing system designed to convert complex PDF files into structured, AI-ready formats such as Markdown, JSON, and HTML while preserving layout, hierarchy, and semantic meaning. It focuses on enabling downstream use cases like retrieval-augmented generation (RAG), knowledge extraction, and document intelligence pipelines by maintaining accurate reading order and spatial metadata through bounding boxes. The tool combines deterministic parsing...

Downloads: 3 This Week

Last Update: 2026-07-14
See Project
3

Nano PDF Editor

Edit PDF files with Nano Banana

Nano PDF Editor is a minimalist, portable PDF viewer and toolkit that focuses on simplicity, speed, and ease of integration for applications that need basic PDF rendering without heavy dependencies. It provides core functionality such as page navigation, zooming, text selection, and rendering directly to native graphics surfaces, making it suitable for lightweight PDF viewing scenarios on desktop or embedded platforms. Designed to be easily embedded into larger software projects, Nano-PDF...

Downloads: 16 This Week

Last Update: 2026-02-05
See Project
4

py-pdf-parser

A Python tool to help extracting information from structured PDFs

py-pdf-parser is a Python tool designed to help extract information from structured PDFs. It provides a simple interface to define parsing rules and extract data from PDF documents.

Downloads: 0 This Week

Last Update: 2025-04-28
See Project
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
5

Malicious PDF Generator

Generate a bunch of malicious pdf files with phone-home functionality

Generate ten different malicious PDF files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh. Used for penetration testing and/or red-teaming etc. I created this tool because I needed a third-party tool to generate a bunch of PDF files with various links.

Downloads: 1 This Week

Last Update: 2026-04-20
See Project
6

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files

OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF files, allowing them to be searched. PDF is the best format for storing and exchanging scanned documents. Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply image processing and OCR (recognized, searchable text) to existing PDFs.

Downloads: 119 This Week

Last Update: 2026-07-17
See Project
7

pikepdf

A Python library for reading and writing PDF, powered by QPDF

pikepdf is a Python library allowing the creation, manipulation, and repair of PDFs. It provides a Pythonic wrapper around the C++ PDF content transformation library, QPDF. Python + QPDF = “py” + “qpdf” = “pyqpdf”, which looks like a dyslexia test and is no fun to type. But say “pyqpdf” out loud, and it sounds like “pikepdf”. pikepdf is a library intended for developers who want to create, manipulate, parse, repair, and abuse the PDF format.

Downloads: 8 This Week

Last Update: 2026-07-10
See Project
8

PyPDF

A pure-python PDF library capable of splitting, merging, cropping

pypdf is a pure Python library for working with PDF files, allowing developers to split, merge, rotate, encrypt, and extract content from PDFs. It’s an actively maintained fork of PyPDF2, improving performance, compatibility, and support for modern PDF standards. Suitable for both automation scripts and full-featured applications, pypdf handles PDFs without requiring external dependencies.

Downloads: 6 This Week

Last Update: 2026-06-23
See Project
9

MinerU

A high-quality tool for convert PDF to Markdown and JSON

MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.

Downloads: 19 This Week

Last Update: 2026-07-10
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
10

borb

borb is a library for reading, creating and manipulating PDF files

borb is a library for creating and manipulating PDF files in python. borb is a pure python library to read, write, and manipulate PDF documents. It represents a PDF document as a JSON-like data structure of nested lists, dictionaries and primitives (numbers, string, booleans, etc) This is currently a one-man project, so the focus will always be to support those use-cases that are more common in favor of those that are rare.

Downloads: 0 This Week

Last Update: 2026-06-14
See Project
11

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 0 This Week

Last Update: 2025-10-13
See Project
12

fpdf2

Simple PDF generation for Python

fpdf2 is a library for simple & fast PDF document generation in Python. It is a fork and the successor of PyFPDF. Compared with other PDF libraries, fpdf2 is fast, versatile, easy to learn and to extend (example). It is also entirely written in Python and has very few dependencies: Pillow, defusedxml, & fontTools. It is a fork and the successor of PyFPDF.

Downloads: 6 This Week

Last Update: 2026-02-28
See Project
13

xhtml2pdf

A library for converting HTML into PDFs using ReportLab

xhtml2pdf enables users to generate PDF documents from HTML content easily and with automated flow control such as pagination and keeping text together. The Python module can be used in any Python environment, including Django. The Command line tool is a stand-alone program that can be executed from the command line.

Downloads: 0 This Week

Last Update: 2025-02-23
See Project
14

PyMuPDF

Python bindings for MuPDF's rendering library.

MuPDF is a lightweight PDF, XPS, and E-book viewer. MuPDF consists of a software library, command line tools, and viewers for various platforms. The renderer in MuPDF is tailored for high-quality anti-aliased graphics. It renders text with metrics and spacing accurate to within fractions of a pixel for the highest fidelity in reproducing the look of a printed page on the screen. The viewer is small, fast, yet complete. It supports many document formats, such as PDF, XPS, OpenXPS, CBZ, EPUB,...

Downloads: 10 This Week

Last Update: 2026-06-26
See Project
15

PDFSticher

Code repository for PDFStitcher, a utility to stitch together PDFs

The open source PDF stitching software for sewists, by sewists. PDFSticher is a utility for stitching together many PDF pages from one document into a single page. This is also called "N-Up" or page imposition. This program was created in order to convert sewing patterns into a convenient format for projecting, though it could be used to stitch together any PDF. Since version 0.4, it is also possible to select layers for inclusion/exclusion in the final output. Additionally, line properties...

Downloads: 5 This Week

Last Update: 2025-06-26
See Project
16

WeasyPrint

The awesome document factory

WeasyPrint is a smart solution helping people to create PDF documents. You can generate gorgeous statistical reports, invoices, tickets, and anything you want as long as you have some webdesign skills! Design your documents just as you design your websites! WeasyPrint follows the widely used HTML and CSS specifications from the W3C. You can use your usual web tools, languages and frameworks, but for print. Creating high-quality digital documents requires features that you love to use as...

Downloads: 24 This Week

Last Update: 2026-06-02
See Project
17

Unlimited OCR Works

Welcome the Era of One-shot Long-horizon Parsing

Unlimited-OCR is an OCR and document parsing model project focused on one-shot long-horizon parsing. It is designed to push OCR beyond short, isolated image recognition and into longer document understanding workflows. The project supports single-image parsing as well as multi-page and PDF-style parsing by converting pages into images. It provides inference paths for Hugging Face Transformers, vLLM, and SGLang, which gives users several deployment options. The repository also includes...

Downloads: 27 This Week

Last Update: 3 days ago
See Project
18

book-to-skill

Turn any technical book PDF into a Claude Code skill

book-to-skill is a Claude Code skill that turns technical books and documents into reusable AI reference skills. It extracts content from PDFs and EPUBs, then organizes the material so an assistant can study, reference, and apply it while working. The project is useful for transforming dense manuals, textbooks, internal documentation, or technical guides into practical agent-accessible knowledge. It includes an extraction script and a SKILL.md workflow that guides how the resulting content...

Downloads: 6 This Week

Last Update: 2026-06-17
See Project
19

TikZ

TikZ figures for concepts in physics/chemistry/ML

Collection of 111 standalone TikZ figures for illustrating concepts in physics, chemistry, and machine learning. Check out janosh.github.io to search, sort, open in Overleaf, and download figures (PDF/SVG/PNG) from this collection.

Downloads: 8 This Week

Last Update: 2025-01-25
See Project
20

python-toolbox

Offline, efficient & beautiful desktop toolbox

Offline, efficient & beautiful desktop toolbox: image compression, format conversion, image stitching, image-to-PDF, scaling, PDF merge, PDF split, file dedup | 离线 · 高效 · 美观的桌面工具集合，支持图片压缩、图片格式转换、图片拼接、图片转PDF、图片缩放、PDF合并、PDF拆分、文件去重

1 Review

Downloads: 0 This Week

Last Update: 2026-07-10
See Project
21

Docling

Get your documents ready for gen AI

...The project focuses on converting and parsing many document formats into a unified structured representation that downstream systems can easily consume. It supports advanced PDF understanding, including layout detection, table extraction, and reading order analysis, enabling high-fidelity document intelligence pipelines. Docling is designed to run efficiently on commodity hardware and can be used both as a Python API and a command-line tool. Its modular architecture allows developers to extend functionality and integrate specialized models for tasks such as OCR and audio transcription. ...

Downloads: 11 This Week

Last Update: 12 hours ago
See Project
22

ShredOS

ShredOS Disk Eraser 64 bit for all Intel 64 bit processors

ShredOS is a lightweight, bootable Linux-based operating system designed specifically for secure disk erasure and data destruction. It enables users to permanently wipe hard drives, SSDs, and NVMe devices using the powerful nwipe utility and multiple industry-recognized wiping methods. Compatible with both BIOS and UEFI systems, ShredOS supports PCs, servers, and Intel-based Macs running on 32-bit and 64-bit processors. The platform can erase multiple drives simultaneously while generating...

Downloads: 573 This Week

Last Update: 2026-07-16
See Project
23

Umi-OCR

OCR software, free and offline

Umi-OCR is a free and open-source optical character recognition (OCR) tool designed to provide fast, offline text extraction from images, screenshots, PDFs, and more without requiring a network connection. It includes a highly efficient offline OCR engine with built-in multilingual recognition libraries, so users can extract text across multiple languages with high accuracy directly on their machines. The software supports flexible usage patterns including screenshot capture OCR, batch...

Downloads: 106 This Week

Last Update: 2026-01-15
See Project
24

Ollama RAG Chatbot

Chat with multiple PDFs locally

Ollama RAG Chatbot is a local-first retrieval chatbot project built to let users chat with the contents of multiple PDF documents through a simple interface. The project is framed as an experiment, but its setup and packaging make it approachable for practical local use as well. It supports running on a local machine or in Kaggle, which lowers the barrier for users who want to test RAG workflows without building everything from scratch. Model support is flexible, with compatibility for both...

Downloads: 2 This Week

Last Update: 2026-04-20
See Project
25

WeebCentral Downloader

A powerful manga downloader for WeebCentral with both GUI and CLI

...Users can select specific chapters, adjust download speed, and configure output formats such as PDF or CBZ, making it adaptable to different reading preferences. The tool also incorporates progress tracking and background worker threads to ensure a responsive experience during large downloads. Its modular structure separates scraping logic, interface components, and configuration management, making it maintainable and extensible.

Downloads: 20 This Week

Last Update: 2026-03-24
See Project