extraction free download

Showing 247 open source projects for "extraction"

View related business solutions

Python Clear Filters & Widen Search

Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
1

Video-subtitle-extractor

A GUI tool for extracting hard-coded subtitle (hardsub) from videos

Video hard subtitle extraction, generate srt file. There is no need to apply for a third-party API, and text recognition can be implemented locally. A deep learning-based video subtitle extraction framework, including subtitle region detection and subtitle content extraction. A GUI tool for extracting hard-coded subtitles (hardsub) from videos and generating srt files.

1 Review

Downloads: 90 This Week

Last Update: 2026-04-05
See Project
2

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 8 This Week

Last Update: 2025-10-13
See Project
3

Sparrow

Structured data extraction and instruction calling with ML, LLM

...It combines several components, including OCR pipelines, vision-language models, and LLM-based reasoning modules to identify and extract meaningful data fields from heterogeneous document layouts. The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.

Downloads: 7 This Week

Last Update: 2026-06-05
See Project
4

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 6 This Week

Last Update: 2025-06-09
See Project
Ship Agents Faster
Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free
5

Unstract

No-code LLM Platform to launch APIs and ETL Pipelines

Unstract is a powerful open-source, no-code platform built to automate the extraction and structuring of unstructured documents using large language models and flexible workflows, enabling developers and data teams to turn messy files into organized JSON content without complex coding. It integrates a visual Prompt Studio environment where users can iteratively design extraction schemas, compare outputs from different models, and monitor costs and accuracy side by side, making it easier to refine prompts and extraction logic before deploying at scale. ...

Downloads: 4 This Week

Last Update: 3 hours ago
See Project
6

zpdf

Zero-copy PDF text extraction library written in Zig

zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches.

Downloads: 2 This Week

Last Update: 2026-02-01
See Project
7

text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API

...The project focuses on converting complex files such as PDFs, images, scanned documents, and office files into structured plain text that can be processed by downstream applications or language models. Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction capabilities into a unified API that standardizes the output. The platform supports automated processing pipelines that detect file types and apply the appropriate extraction method to obtain the most accurate text representation possible. It can be integrated into document analysis systems, knowledge retrieval tools, and AI pipelines that rely on clean textual data. ...

Downloads: 6 This Week

Last Update: 2026-03-05
See Project
8

ContextGem

ContextGem: Effortless LLM extraction from documents

ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.

Downloads: 5 This Week

Last Update: 2026-06-06
See Project
9

MinerU

A high-quality tool for convert PDF to Markdown and JSON

MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.

Downloads: 33 This Week

Last Update: 7 days ago
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
10

NLP

Open source NLP guide with models, methods, and real use cases

...It explains how machines process and understand human language, combining theory with practical examples. Its covers core NLP concepts such as text representation, feature extraction, and model evaluation, alongside hands-on implementations using tools like Word2Vec, TF-IDF, and FastText. It also introduces topic modeling with LDA, keyword extraction techniques, and document similarity methods. NLP extends into real-world applications, including sentiment analysis and text classification, helping readers connect concepts to use cases. ...

Downloads: 7 This Week

Last Update: 5 days ago
See Project
11

Trafilatura

Python & command-line tool to gather text on the Web

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. ...

Downloads: 5 This Week

Last Update: 2026-06-07
See Project
12

NeMo Retriever Library

Document content and metadata extraction microservice

...It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient handling of large datasets. It supports multiple extraction strategies for different document formats, balancing accuracy and throughput depending on the use case. Additionally, it can generate embeddings for extracted content and integrate with vector databases like Milvus, making it well-suited for retrieval-augmented generation pipelines.

Downloads: 1 This Week

Last Update: 2026-05-29
See Project
13

watercrawl

AI-ready web crawler that extracts and structures website content

WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. ...

Downloads: 5 This Week

Last Update: 2026-05-20
See Project
14

Web RPA

Web Robotics Process Automation Tool

Web RPA is a browser automation framework designed to perform robotic process automation tasks directly within web environments. It enables users to automate repetitive actions such as form filling, data extraction, and workflow execution through programmable scripts. The system focuses on simplicity and flexibility, allowing automation without requiring complex infrastructure. It supports interaction with web elements, navigation flows, and dynamic content handling, making it suitable for scraping and automation scenarios. WebRPA can be integrated into larger systems or used as a standalone tool for automating browser-based operations. ...

Downloads: 8 This Week

Last Update: 2 days ago
See Project
15

book-to-skill

Turn any technical book PDF into a Claude Code skill

...The project is useful for transforming dense manuals, textbooks, internal documentation, or technical guides into practical agent-accessible knowledge. It includes an extraction script and a SKILL.md workflow that guides how the resulting content should be used. The goal is not simply to summarize a book, but to make its knowledge available during problem solving and implementation. book-to-skill is best suited for developers and researchers who want AI assistants to work from specific long-form source material.

Downloads: 8 This Week

Last Update: 2026-06-17
See Project
16

pyAudioAnalysis

Python Audio Analysis Library: Feature Extraction, Classification

...The project provides a collection of tools that allow developers to extract meaningful features from audio files and use those features for classification, segmentation, and analysis. The library supports multiple audio processing workflows, including feature extraction from raw audio signals, training of machine learning models, and automatic audio segmentation. It also includes utilities for visualizing audio features and analyzing patterns within sound recordings, which can be useful in applications such as speech recognition, music classification, and acoustic event detection. Because the library integrates machine learning algorithms with signal processing tools, it enables researchers to develop complete audio analysis pipelines using a single framework.

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
17

video2robot

End-to-end pipeline converting generative videos

video2robot is an end-to-end open-source pipeline that converts generative video or prompt-driven motion content into executable humanoid robot motion sequences, enabling researchers and developers to go from high-level action descriptions or videos to robot-ready motion data. The pipeline supports both prompt-to-video generation using models like Veo/Sora and video upload processing, followed by human pose extraction through a 3D pose model and retargeting of that motion to robot joints using a general motion retargeting system. This workflow allows users to generate robot motion files that specify joint angles, root positions, and orientations that can be deployed on supported robot platforms (e.g., Unitree models). Video2robot includes scripts for each stage of the pipeline (generation, extraction, conversion, visualization) and can run as a CLI or through a basic web UI.

Downloads: 0 This Week

Last Update: 2026-01-30
See Project
18

yt-dlp-gui

A cross-platform GUI wrapper for yt-dlp written in PySide6

...Written in PySide6 (Python with Qt bindings), it wraps the powerful yt-dlp engine in a visual application that lets users paste video URLs, choose formats, apply presets, and start downloads with a click, while still exposing options for advanced tweaks via configuration files. The project supports preset definitions and global arguments through a config file, so users can customize their most common download workflows—like audio extraction, quality ranking, or embedding thumbnails—without retyping arguments each time. Downloads can be initiated from a portable app bundle or run manually with Python, making it flexible across platforms including Windows and Linux.

Downloads: 426 This Week

Last Update: 2026-01-20
See Project
19

claude-video

Give Claude the ability to watch any video

Claude Video is an agent skill that gives Claude and compatible coding assistants the ability to analyze video content. It accepts public video URLs or local video files, then extracts the information needed to answer user questions about what happened on screen and in the audio. The workflow checks captions first, downloads only what is necessary, extracts timestamped frames, and produces a transcript through native captions or Whisper fallback. It supports different detail levels so users...

Downloads: 3 This Week

Last Update: 2026-07-06
See Project
20

DINOv2

PyTorch code and models for the DINOv2 self-supervised learning

...The core promise is that a single pretrained backbone can transfer well to many downstream tasks—from linear probing on classification to retrieval, detection, and segmentation—often requiring little or no fine-tuning. The repository includes code for training, evaluating, and feature extraction, with utilities to run k-NN or linear evaluation baselines to assess representation quality. Pretrained checkpoints cover multiple model sizes so practitioners can trade accuracy for speed and memory depending on their deployment constraints.

Downloads: 6 This Week

Last Update: 2026-06-03
See Project
21

docext

An on-premises, OCR-free unstructured data extraction

docext is a document intelligence toolkit that uses vision-language models to extract structured information from documents such as PDFs, forms, and scanned images. The system is designed to operate entirely on-premises, allowing organizations to process sensitive documents without relying on external cloud services. Unlike traditional document processing pipelines that rely heavily on optical character recognition, docext leverages multimodal AI models capable of understanding both visual...

Downloads: 3 This Week

Last Update: 2026-03-12
See Project
22

Python Client For NLP Cloud

NLP Cloud serves high performance pre-trained or custom models for NER

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, dialogue summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, blog post generation, source code generation, question answering, automatic speech recognition, machine translation, language detection, semantic search, semantic similarity, tokenization, POS tagging, embeddings, and dependency parsing. It is ready for production, served through a REST API. ...

Downloads: 3 This Week

Last Update: 2024-11-27
See Project
23

Fapello.Downloader

NSFW Windows app to batch download images and videos

Fapello.Downloader is a Python-based desktop application designed to automate the bulk downloading of images and videos from the Fapello platform through a simple graphical interface. The tool allows users to paste a content URL and retrieve all associated media in a single operation, eliminating the need for manual downloading of individual files. It is built entirely in Python and leverages libraries such as BeautifulSoup and requests for scraping and data retrieval, while using a...

Downloads: 76 This Week

Last Update: 2026-03-18
See Project
24

Bespoke Curator

Synthetic data curation for post-training and data extraction

...It supports workflows where models are used to produce synthetic examples that can later be refined into reliable training datasets for reasoning, question answering, or structured information extraction tasks. Curator includes tools for monitoring data generation processes and managing dataset quality while large batches of examples are being created. The framework also integrates with multiple inference systems and APIs, allowing users to generate data using different model providers or open-source inference engines.

Downloads: 2 This Week

Last Update: 2026-03-14
See Project
25

Chonkie

The no-nonsense RAG chunking library

Chonkie is an AI-powered framework designed for building conversational agents and chatbots with natural language understanding and multi-turn conversation support.

Downloads: 9 This Week

Last Update: 2025-03-01
See Project