Showing 219 open source projects for "extraction"

View related business solutions
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 1
    X-osint

    X-osint

    Open source OSINT tool for gathering data on emails, phones, and IPs

    X-osint is an open source intelligence framework designed to collect and analyze publicly available information from multiple sources. It focuses on gathering useful and credible data about entities such as phone numbers, email addresses, and IP addresses using a range of automated OSINT techniques. It provides investigators and researchers with a centralized interface for running information-gathering tasks that would normally require multiple separate tools. X-osint can also perform...
    Downloads: 41 This Week
    Last Update:
    See Project
  • 2
    HeartMuLa

    HeartMuLa

    A Family of Open Sourced Music Foundation Models

    ...The project also includes HeartCodec, a music codec optimized for high reconstruction fidelity, enabling efficient tokenization and reconstruction workflows that are critical for training and generation pipelines. For text extraction from audio, it provides HeartTranscriptor, a Whisper-based model tuned specifically for lyrics transcription, which helps bridge generated or recorded audio back into structured text. It also introduces HeartCLAP, which aligns audio and text into a shared embedding space.
    Downloads: 17 This Week
    Last Update:
    See Project
  • 3
    Unredact

    Unredact

    A simple tool for reading in poorly redacted documents

    Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and...
    Downloads: 16 This Week
    Last Update:
    See Project
  • 4
    yt-dlp

    yt-dlp

    A youtube-dl fork with additional features and fixes

    yt-dlp is a youtube-dl fork based on the now inactive youtube-dlc. The main focus of this project is adding new features and patches while also keeping up to date with the original project
    Downloads: 618 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 5
    TextFSM

    TextFSM

    Python module for parsing semi-structured text into python tables

    ...By defining parsing logic through reusable template files, TextFSM transforms unstructured text into structured data like lists or tables without requiring complex regular expression code. Each template defines states, transitions, and regex patterns that determine how to interpret text line by line, enabling precise extraction of key information from varied sources. This modular approach allows users to maintain a library of templates for different data formats, improving automation in network operations and system administration. Widely used in network automation workflows, TextFSM integrates easily with Python scripts, making it an essential tool for engineers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    DINOv3

    DINOv3

    Reference PyTorch implementation and models for DINOv3

    DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 7
    River ML

    River ML

    Online machine learning in Python

    River is a Python library for online machine learning. It aims to be the most user-friendly library for doing machine learning on streaming data. River is the result of a merger between creme and scikit-multiflow.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    pyLoad

    pyLoad

    The free and open-source Download Manager written in pure Python

    ...It uses a plugin-driven architecture that supports hundreds of hosters, link decrypters, and extensions that extend its capabilities. pyLoad includes a modern web-based interface that allows users to remotely manage downloads from a browser, enabling full control over queues, links, and download settings. The system supports features such as premium account integration, automated captcha solving, and link extraction from container files or encrypted link lists.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 9
    GLM-OCR

    GLM-OCR

    Accurate × Fast × Comprehensive

    GLM-OCR is an open-source multimodal optical character recognition (OCR) model built on a GLM-V encoder–decoder foundation that brings robust, accurate document understanding to complex real-world layouts and modalities. Designed to handle text recognition, table parsing, formula extraction, and general information retrieval from documents containing mixed content, GLM-OCR excels across major benchmarks while remaining highly efficient with a relatively compact parameter size (~0.9B), enabling deployment in high-concurrency services and edge environments. The model’s multimodal capabilities allow it to reason across image and text content holistically, capturing structured and unstructured information from pages that include dense tables, seals, code snippets, and varied document graphics. ...
    Downloads: 20 This Week
    Last Update:
    See Project
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 10
    spaCy models

    spaCy models

    Models for the spaCy Natural Language Processing (NLP) library

    ...The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive. spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. If your application needs to process entire web dumps, spaCy is the library you want to be using. Since its release in 2015, spaCy has become an industry standard with a huge ecosystem. Choose from a variety of plugins, integrate with your machine learning stack and build custom components and workflows.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 11
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...
    Downloads: 25 This Week
    Last Update:
    See Project
  • 12
    Transformers

    Transformers

    State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX

    ...Using pre-trained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities. Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages. Images, for tasks like image classification, object detection, and segmentation. Audio, for tasks like speech recognition and audio classification. Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. ...
    Downloads: 23 This Week
    Last Update:
    See Project
  • 13
    MarkPDFDown

    MarkPDFDown

    A high-quality PDF to Markdown tool based on large language model

    MarkPDFdown is an open-source document processing tool designed to convert PDF files into structured Markdown output that can be easily used for documentation, content pipelines, and AI processing workflows. The project focuses on extracting text, formatting, and structural information from complex PDF documents and transforming that information into clean Markdown that preserves the original hierarchy of headings, paragraphs, tables, and lists. By producing Markdown rather than raw text,...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 14
    Superlinked

    Superlinked

    Superlinked is a Python framework for AI Engineers

    Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Docling

    Docling

    Get your documents ready for gen AI

    ...The project focuses on converting and parsing many document formats into a unified structured representation that downstream systems can easily consume. It supports advanced PDF understanding, including layout detection, table extraction, and reading order analysis, enabling high-fidelity document intelligence pipelines. Docling is designed to run efficiently on commodity hardware and can be used both as a Python API and a command-line tool. Its modular architecture allows developers to extend functionality and integrate specialized models for tasks such as OCR and audio transcription. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 16
    Prompt Engineering Interactive Tutorial

    Prompt Engineering Interactive Tutorial

    Anthropic's Interactive Prompt Engineering Tutorial

    ...The course leans heavily on realistic failure modes (ambiguity, hallucination, brittle instructions) and shows how to iteratively debug prompts the way you would debug code. Lessons include building prompts from scratch for common tasks like extraction, classification, transformation, and step-by-step reasoning, with checkpoints that let you compare your outputs against solid baselines. You’ll also practice advanced patterns such as tool use, constrained generation, and response validation so outputs are trustworthy and machine-consumable.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    ClatScope

    ClatScope

    OSINT reconnaissance tool for IP, domain, email, and username lookups

    ClatScope is a Python-based OSINT (open source intelligence) utility designed to gather and analyze publicly available information from multiple online sources. It is primarily aimed at investigators, cybersecurity professionals, penetration testers, and researchers who need a centralized platform for reconnaissance tasks. It integrates with numerous public APIs and internet services to retrieve detailed data about IP addresses, domains, email addresses, phone numbers, usernames, and other...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 18
    RAG Anything

    RAG Anything

    RAG-Anything: All-in-One RAG Framework

    RAG-Anything is an open-source unified framework that extends the Retrieval-Augmented Generation (RAG) paradigm to fully multimodal document and knowledge retrieval, enabling systems to ingest, parse, represent, and query rich content that includes text, images, tables, formulas, and other structured or visual elements. Traditional RAG systems are typically limited to text and cannot effectively work across heterogeneous document layouts, but RAG-Anything addresses this by modeling...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 19
    MegaParse

    MegaParse

    File Parser optimised for LLM Ingestion with no loss

    ...It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    Nano PDF Editor

    Nano PDF Editor

    Edit PDF files with Nano Banana

    Nano PDF Editor is a minimalist, portable PDF viewer and toolkit that focuses on simplicity, speed, and ease of integration for applications that need basic PDF rendering without heavy dependencies. It provides core functionality such as page navigation, zooming, text selection, and rendering directly to native graphics surfaces, making it suitable for lightweight PDF viewing scenarios on desktop or embedded platforms. Designed to be easily embedded into larger software projects, Nano-PDF...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 21
    HunyuanOCR

    HunyuanOCR

    OCR expert VLM powered by Hunyuan's native multimodal architecture

    HunyuanOCR is an open-source, end-to-end OCR (optical character recognition) Vision-Language Model (VLM) developed by Tencent‑Hunyuan. It’s designed to unify the entire OCR pipeline, detection, recognition, layout parsing, information extraction, translation, and even subtitle or structured output generation, into a single model inference instead of a cascade of separate tools. Despite being fairly lightweight (about 1 billion parameters), it delivers state-of-the-art performance across a wide variety of OCR tasks, outperforming many traditional OCR systems and even other multimodal models on benchmark suites. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Unrud Video Downloader

    Unrud Video Downloader

    Download videos from websites like YouTube and many others

    ...The application supports a wide range of features, including downloading entire playlists, handling private or password-protected content, and automatically selecting optimal formats based on user preferences. It also allows users to convert videos into audio files such as MP3, making it useful for media extraction workflows. The software is distributed across multiple platforms, including Linux package managers and containerized environments, ensuring broad accessibility. It includes configuration options and debugging capabilities for advanced users who want more control over the download process.
    Downloads: 15 This Week
    Last Update:
    See Project
  • 23
    Matcha-TTS

    Matcha-TTS

    A fast TTS architecture with conditional flow matching

    Matcha-TTS is a non-autoregressive neural text-to-speech architecture that uses conditional flow matching to generate speech quickly while maintaining natural quality. It models speech as an ODE-based generative process, and conditional flow matching lets it reach high-quality audio in only a few synthesis steps, which greatly reduces latency compared to score-matching diffusion approaches. The model is fully probabilistic, so it can generate diverse realizations of the same text while still...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 24
    Director

    Director

    AI video agents framework for next-gen video interactions

    Director is a video database management system designed to organize, search, and retrieve large collections of video content efficiently.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    LangExtract

    LangExtract

    A Python library for extracting structured information

    ...LangExtract supports a wide range of models, including Google Gemini, OpenAI GPT, and local LLMs via Ollama, making it adaptable to different deployment environments and compliance needs. The system excels at handling long documents using optimized chunking, multi-pass extraction, and parallel processing to ensure both high recall and structured consistency.
    Downloads: 10 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB