Search Results for "python pdf scaper" - Page 2

Showing 333 open source projects for "python pdf scaper"

View related business solutions
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    Build gen AI apps with an all-in-one modern database: MongoDB Atlas

    MongoDB Atlas provides built-in vector search and a flexible document model so developers can build, scale, and run gen AI apps without stitching together multiple databases. From LLM integration to semantic search, Atlas simplifies your AI architecture—and it’s free to get started.
    Start Free
  • 1
    RenderCV

    RenderCV

    LaTeX CV generator from a YAML/JSON input file

    RenderCV is a LaTeX CV/resume framework. It allows you to create a high-quality CV as a PDF from a YAML file with full Markdown syntax support and complete control over the LaTeX code. RenderCV offers built-in LaTeX and Markdown templates ready to produce high-quality CVs. However, the templates are entirely arbitrary and can easily be updated to leverage RenderCV's capabilities with your custom CV themes.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 2
    Pix2Text

    Pix2Text

    Open-Source Python3 tool for recognizing layouts, tables, and math

    ...A free alternative to Mathpix, empowering seamless conversion of visual content into text-based representations. 80+ languages are supported. Pix2Text (P2T) aims to be a free and open-source Python alternative to Mathpix, and it can already accomplish Mathpix's core functionality. Pix2Text (P2T) can recognize layouts, tables, images, text, and mathematical formulas, and integrate all of these contents into Markdown format. P2T can also convert an entire PDF file (which can contain scanned images or any other format) into Markdown format.
    Downloads: 21 This Week
    Last Update:
    See Project
  • 3
    Memvid

    Memvid

    Video-based AI memory library. Store millions of text chunks in MP4

    Memvid encodes text chunks as QR codes within MP4 frames to build a portable “video memory” for AI systems. This innovative approach uses standard video containers and offers millisecond-level semantic search across large corpora with dramatically less storage than vector DBs. It's self-contained—no DB needed—and supports features like PDF indexing, chat integration, and cloud dashboards.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 4
    Zerox OCR

    Zerox OCR

    PDF to Markdown with vision models

    A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense. ZeroX is an open-source machine learning framework designed for fast experimentation and production deployment, optimized for speed and ease of use.
    Downloads: 8 This Week
    Last Update:
    See Project
  • Find Hidden Risks in Windows Task Scheduler Icon
    Find Hidden Risks in Windows Task Scheduler

    Free diagnostic script reveals configuration issues, error patterns, and security risks. Instant HTML report.

    Windows Task Scheduler might be hiding critical failures. Download the free JAMS diagnostic tool to uncover problems before they impact production—get a color-coded risk report with clear remediation steps in minutes.
    Download Free Tool
  • 5
    ChatGPT Academic

    ChatGPT Academic

    ChatGPT extension for scientific research work

    ChatGPT extension for scientific research work, specially optimized academic paper polishing experience, supports custom shortcut buttons, supports custom function plug-ins, supports markdown table display, double display of Tex formulas, complete code display function, new local Python/C++/Go project tree Analysis function/Project source code self-translation ability, newly added PDF and Word document batch summary function/PDF paper full-text translation function. All buttons are dynamically generated by reading functional.py, you can add custom functions at will, and liberate the pasteboard. Support for markdown tables output by GPT. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    PageIndex

    PageIndex

    Document Index for Vectorless, Reasoning-based RAG

    PageIndex is an innovative open-source framework that reimagines retrieval-augmented generation (RAG) by eliminating conventional vector similarity search and instead building hierarchical semantic indexes that mirror a document’s natural structure. Rather than chunking text and embedding it into a vector database, PageIndex constructs a tree-structured index — similar to a detailed, AI-enhanced table of contents — that a large language model can traverse to locate the most relevant sections...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    changedetection.io

    changedetection.io

    The best free open source website change detection and restock service

    Loved by smart shoppers, data journalists, research engineers, data scientists, security researchers, and more. From simply monitoring website pages that have a change (such as watching prices, and restocking notifications), to deep inspection such as PDF text support, JSON and XML monitoring, and extensive text triggers. Monitor out-of-stock products and get alerts when those products are back in stock, get restock alerts via Discord, Slack, email, and many other platforms. Using the...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    PDF-utility

    PDF-utility

    PDF Utility is a tool designed to efficiently manipulate PDF files

    Digna PDF Utility is a tool designed to efficiently manipulate PDF documents. It offers a range of functionalities including adding page numbers, deleting unwanted pages, merging multiple PDFs into a single file, converting PDF to DOCX and vice versa, protect a PDF file with password and displaying PDF content.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    Papermerge

    Papermerge

    Open Source Document Management System for Digital Archives

    Papermerge is an open source document management system (DMS) primarily designed for archiving and retrieving your digital documents. Instead of having piles of paper documents all over your desk, office or drawers - you can quickly scan them and configure your scanner to directly upload to Papermerge DMS. Store, organize and index scanned documents in PDF, JPEG and TIFF formats. Instantly find relevant information using full text, tags and metadata-based search. Papermerge is free and...
    Downloads: 22 This Week
    Last Update:
    See Project
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • 10
    Google Open Source Project Style Guide

    Google Open Source Project Style Guide

    Chinese version of Google open source project style guide

    Each larger open source project has its own style guide, a series of conventions on how to write code for the project (sometimes more arbitrary). When all the code maintains a consistent style, it is more important when understanding large code bases. easy. The meaning of "style" covers a wide range, from "variables use camelCase" to "never use global variables" to "never use exceptions". The English version of the project maintains the programming style guidelines used in Google. If the...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 11
    DeepSeek-OCR 2

    DeepSeek-OCR 2

    Visual Causal Flow

    DeepSeek-OCR-2 is the second-generation optical character recognition system developed to improve document understanding by introducing a “visual causal flow” mechanism, enabling the encoder to reorder visual tokens in a way that better reflects semantic structure rather than strict raster scan order. It is designed to handle complex layouts and noisy documents by giving the model causal reasoning capabilities that mimic human visual scanning behavior, enhancing OCR performance on documents...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 12
    Remarkable for Linux

    Remarkable for Linux

    The Markdown Editor for Linux

    With Live Preview you can see your changes as you make them. There is no need to export first to check your syntax. This is accompanied by synchronized scrolling. Remarkable has Github Flavoured Markdown. This has a simple, easy-to-learn syntax with features like checklists, highlighting, links, images and more. Remarkable allows you to export your files to PDF and HTML from within the app. The HTML code is even prettified and PDFs have a TOC. You can style your markdown documents however...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 13
    Pdf_tools
    ✅ Image to PDF Convert multiple image files into a single PDF. Supports formats: JPG, JPEG, PNG, BMP, TIFF. ✅ PDF Merger Merge multiple PDF files into one. Reorder PDF files before merging. ✅ PDF Splitter Split PDF files by range or into individual pages. ✅ Page Remover Remove specific pages from a PDF. ✅ Fill & Sign Add text and signature to a PDF. Preview the PDF and adjust signature position.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 14
    ArchiveBox

    ArchiveBox

    Open source self-hosted web archiving

    ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline. Without active preservation effort, everything on the internet eventually disappears or degrades. Archive.org does a great job as a centralized service, but saved URLs have to be public, and they can't save every type of content. ArchiveBox is an open source tool that lets organizations & individuals archive both public & private web content while retaining control over their data....
    Downloads: 9 This Week
    Last Update:
    See Project
  • 15
    rmarkdown

    rmarkdown

    Dynamic Documents for R

    R Markdown is an R package for creating dynamic, reproducible documents that combine code (R, Python, SQL, etc.), results (figures, tables), and narrative text. Built on Knitr and Pandoc, it supports generating HTML, PDF, Word, slideshows, dashboards, and more. It’s widely used in data science and reproducible reporting workflows.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    DB-GPT

    DB-GPT

    Revolutionizing Database Interactions with Private LLM Technology

    DB-GPT is an experimental open-source project that uses localized GPT large models to interact with your data and environment. With this solution, you can be assured that there is no risk of data leakage, and your data is 100% private and secure.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 17
    LlamaParse

    LlamaParse

    Parse files for optimal RAG

    LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 18
    JupyterLab

    JupyterLab

    JupyterLab computational environment

    JupyterLab is the next-generation web-based user interface for Project Jupyter. Try it on Binder. JupyterLab follows the Jupyter Community Guides. JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. You can arrange multiple documents and activities side by side in the work area using tabs and splitters. Documents and activities integrate with each other,...
    Downloads: 272 This Week
    Last Update:
    See Project
  • 19
    Jupyter Notebook Tools for Sphinx

    Jupyter Notebook Tools for Sphinx

    Sphinx source parser for Jupyter notebooks

    nbsphinx is a Sphinx extension that provides a source parser for *.ipynb files. Custom Sphinx directives are used to show Jupyter Notebook code cells (and of course their results) in both HTML and LaTeX output. Un-evaluated notebooks – i.e. notebooks without stored output cells – will be automatically executed during the Sphinx build process.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Tarjamento de Dados Pessoais e Sigilosos

    Tarjamento de Dados Pessoais e Sigilosos

    Ferramenta de Tarjamento de Dados Pessoais e Sigilosos

    ATENÇÃO: Alguns antivirus (ex: Windows Defender) detectam, equivocadamente, o arquivo como um malware. Basta desabilitar ou adicionar exceção e baixar tranquilamente. Um dos maiores desafios da edição de arquivos PDF é algo extremamente simples: o tarjamento de dados. A falta de uma funcionalidade de tarjamento adequada impacta, por exemplo, na falta de transparência pública, uma vez que documentos inteiros são indisponibilizados, quando, na realidade, bastava que o dado sigiloso...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 21
    deepdoctection

    deepdoctection

    A Repo For Document AI

    DeepDoctection is a document AI framework that applies deep learning techniques to analyze and extract structured data from scanned documents, PDFs, and images. deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated frameworks for...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    WingetUI

    WingetUI

    A better UI for your package managers

    WingetUI is an intuitive, responsive and nice-looking user interface for the most common (command-line) package managers for Windows 10 and Windows 11. With this program, you'll be able to easily download, install, update and uninstall any software that's published on those package managers. The available packages range from well-known software such as browsers, pdf readers and other common programs to other less-known utilities such as vim for windows, which I bet you didn't know it...
    Downloads: 45 This Week
    Last Update:
    See Project
  • 23
    Napkin

    Napkin

    An Infinitely Large Napkin

    Napkin (also titled “An Infinitely Large Napkin”) is a lightweight, semi-formal introduction to higher mathematics, aimed at giving readers a bird’s-eye view over various mathematical fields. It is not a polished textbook full of full proofs; rather it offers clean definitions, theorem statements, intuitive motivations, and informal sketches of why things work, with the goal of building conceptual understanding. The coverage spans undergraduate and early graduate topics, designed to show how...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    abogen

    abogen

    Generate audiobooks from EPUBs, PDFs and text with captions

    abogen is a tool designed to generate audiobooks (or speech narrations) from textual sources such as EPUBs, PDFs, or plain text, with synchronized captions. In other words, it automates the pipeline of reading a digital book (or document), converting its text into speech via a TTS engine, and packaging the result into an audiobook format — likely along with timestamped captions or subtitles that align with the spoken audio. This can be very useful for accessibility, content consumption on...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 25
    PaperQA2

    PaperQA2

    High accuracy RAG for answering questions from scientific documents

    PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs or text files, with a focus on the scientific literature. See our recent 2024 paper to see examples of PaperQA2's superhuman performance in scientific tasks like question answering, summarization, and contradiction detection. In this example we take a folder of research paper PDFs, magically get their metadata - including citation counts and a retraction check, then parse and cache PDFs into a...
    Downloads: 5 This Week
    Last Update:
    See Project