Showing 774 open source projects for "extraction"

View related business solutions
  • Earn up to 16% annual interest with Nexo. Icon
    Earn up to 16% annual interest with Nexo.

    Access competitive interest rates on your digital assets.

    Generate interest, borrow against your crypto, and trade a range of cryptocurrencies — all in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 1
    X-osint

    X-osint

    Open source OSINT tool for gathering data on emails, phones, and IPs

    X-osint is an open source intelligence framework designed to collect and analyze publicly available information from multiple sources. It focuses on gathering useful and credible data about entities such as phone numbers, email addresses, and IP addresses using a range of automated OSINT techniques. It provides investigators and researchers with a centralized interface for running information-gathering tasks that would normally require multiple separate tools. X-osint can also perform...
    Downloads: 47 This Week
    Last Update:
    See Project
  • 2
    Adversarial Robustness Toolbox

    Adversarial Robustness Toolbox

    Adversarial Robustness Toolbox (ART) - Python Library for ML security

    ...ART provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, sci-kit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, generation, certification, etc.).
    Downloads: 9 This Week
    Last Update:
    See Project
  • 3
    Pot Desktop

    Pot Desktop

    A cross-platform software for text translation and recognition

    ...It supports picking text via mouse selection (“highlight-and-translate”), clipboard listening, or screenshot-based OCR; this makes it ideal for reading webpages, documents, images — or any on-screen text — and instantly getting translations or text extraction. The tool supports external plugin extensions, which means its functionality can be expanded far beyond the built-in options: you can add translation engines, OCR backends, TTS engines, vocabulary export (e.g. for language learning), and more. Pot-Desktop works on Windows, macOS, and Linux (including Wayland environments), and offers convenient installers or package-manager installation methods (e.g. via brew or .deb, etc.), so it’s accessible for users on all major desktop OSes.
    Downloads: 61 This Week
    Last Update:
    See Project
  • 4
    DINOv3

    DINOv3

    Reference PyTorch implementation and models for DINOv3

    DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while...
    Downloads: 22 This Week
    Last Update:
    See Project
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 5
    CL4R1T4S

    CL4R1T4S

    Archive of leaked AI system prompts and internal instruction sets

    ...According to the its description, the initiative is motivated by the idea that understanding the input instructions behind an AI system helps users better interpret its outputs. Contributors are encouraged to add newly discovered or extracted prompts, along with information such as the model version, extraction date, and etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    newspaper4k

    newspaper4k

    Python library for scraping and analyzing online news articles easily

    ...Newspaper4k also includes natural language processing capabilities that can generate summaries and identify keywords from extracted article text. Newspaper4k supports both single-article extraction and full news site processing, allowing users to build sources representing entire publications and iterate through their articles. It maintains compatibility with the original project so that existing code written for newspaper3k can continue working with minimal changes.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    JS Analyzer

    JS Analyzer

    Burp Suite extension for JavaScript static analysis

    JS Analyzer is a powerful static analysis tool implemented as a Burp Suite extension that helps security researchers and web developers automatically uncover important artifacts in JavaScript files during web application testing. It parses JavaScript responses intercepted by Burp Suite and intelligently extracts API endpoints, full URLs (including cloud storage links), secrets like API keys or tokens, and email addresses while filtering out noise from irrelevant code patterns. The extension...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    3FS

    3FS

    A high-performance distributed file system

    The 3FS repository (standing likely for “Feature 3F System” or similar) is focused on providing a feature extraction and transformation framework tailored to deep and large models, especially in token-based systems. Its primary aim is to support efficient and scalable feature transformation pipelines—especially for inference environments—by batching, caching, and integrating feature-based modules like segmenters, sparse retrievers, and scorers seamlessly.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Liferea

    Liferea

    Liferea (Linux Feed Reader), a news reader for GTK/GNOME

    Liferea is a feed reader/news aggregator that brings together all of the content from your favorite subscriptions into a simple interface that makes it easy to organize and browse feeds. Its GUI is similar to a desktop mail/news client, with an embedded web browser.
    Downloads: 3 This Week
    Last Update:
    See Project
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • 10
    Certificate Ripper

    Certificate Ripper

    A CLI tool to extract server certificates

    ...It can be used with or without Java, native executables are present in the releases. Extracts all the sub-fields of the certificate. Certificates can be formatted to PEM format. Bulk extraction of multiple different URLs with a single command is possible. Extracted certificates can be stored automatically in a p12 trust store. Works also behind a proxy.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    pyLoad

    pyLoad

    The free and open-source Download Manager written in pure Python

    ...It uses a plugin-driven architecture that supports hundreds of hosters, link decrypters, and extensions that extend its capabilities. pyLoad includes a modern web-based interface that allows users to remotely manage downloads from a browser, enabling full control over queues, links, and download settings. The system supports features such as premium account integration, automated captcha solving, and link extraction from container files or encrypted link lists.
    Downloads: 17 This Week
    Last Update:
    See Project
  • 12
    yt-dlp

    yt-dlp

    A youtube-dl fork with additional features and fixes

    yt-dlp is a youtube-dl fork based on the now inactive youtube-dlc. The main focus of this project is adding new features and patches while also keeping up to date with the original project
    Downloads: 606 This Week
    Last Update:
    See Project
  • 13
    Unredact

    Unredact

    A simple tool for reading in poorly redacted documents

    Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and...
    Downloads: 16 This Week
    Last Update:
    See Project
  • 14
    HeartMuLa

    HeartMuLa

    A Family of Open Sourced Music Foundation Models

    ...The project also includes HeartCodec, a music codec optimized for high reconstruction fidelity, enabling efficient tokenization and reconstruction workflows that are critical for training and generation pipelines. For text extraction from audio, it provides HeartTranscriptor, a Whisper-based model tuned specifically for lyrics transcription, which helps bridge generated or recorded audio back into structured text. It also introduces HeartCLAP, which aligns audio and text into a shared embedding space.
    Downloads: 16 This Week
    Last Update:
    See Project
  • 15
    GenAIScript

    GenAIScript

    Automatable GenAI Scripting

    JavaScript-ish environment with convenient tooling for file ingestion, prompt development, and structured data extraction. A Microsoft tool that generates AI-powered text based on prompts, useful for content creation and automation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    GLM-OCR

    GLM-OCR

    Accurate × Fast × Comprehensive

    GLM-OCR is an open-source multimodal optical character recognition (OCR) model built on a GLM-V encoder–decoder foundation that brings robust, accurate document understanding to complex real-world layouts and modalities. Designed to handle text recognition, table parsing, formula extraction, and general information retrieval from documents containing mixed content, GLM-OCR excels across major benchmarks while remaining highly efficient with a relatively compact parameter size (~0.9B), enabling deployment in high-concurrency services and edge environments. The model’s multimodal capabilities allow it to reason across image and text content holistically, capturing structured and unstructured information from pages that include dense tables, seals, code snippets, and varied document graphics. ...
    Downloads: 22 This Week
    Last Update:
    See Project
  • 17
    TextFSM

    TextFSM

    Python module for parsing semi-structured text into python tables

    ...By defining parsing logic through reusable template files, TextFSM transforms unstructured text into structured data like lists or tables without requiring complex regular expression code. Each template defines states, transitions, and regex patterns that determine how to interpret text line by line, enabling precise extraction of key information from varied sources. This modular approach allows users to maintain a library of templates for different data formats, improving automation in network operations and system administration. Widely used in network automation workflows, TextFSM integrates easily with Python scripts, making it an essential tool for engineers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Read Frog

    Read Frog

    Open Source Immersive Translate

    Read Frog is an open-source browser extension designed to transform everyday web reading into an immersive language learning experience powered by artificial intelligence. The tool integrates translation, contextual explanations, and content analysis directly into the browsing workflow so users can learn languages naturally while reading authentic online content. Instead of forcing learners to switch between translation tools and the original text, the extension displays translations...
    Downloads: 20 This Week
    Last Update:
    See Project
  • 19
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...
    Downloads: 28 This Week
    Last Update:
    See Project
  • 20
    River ML

    River ML

    Online machine learning in Python

    River is a Python library for online machine learning. It aims to be the most user-friendly library for doing machine learning on streaming data. River is the result of a merger between creme and scikit-multiflow.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    LibPDF

    LibPDF

    A modern PDF library for TypeScript

    ...The library offers full read and write manipulation, including support for encryption with RC4 and modern AES cipher suites, form filling and flattening, digital signature creation and verification, page merging/splitting, rich text extraction with layout information, and font embedding with subsetting.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 22
    Markdownify MCP Server

    Markdownify MCP Server

    Convert files and web content into clean, usable Markdown easily

    ...It supports formats such as PDFs, images, audio with transcription, DOCX, XLSX, and PPTX, along with web sources like YouTube transcripts, Bing results, and general webpages. Markdownify MCP is designed to simplify content extraction and make data easier to read, share, and reuse in structured workflows. Developers can install dependencies, build, and run the server locally, then extend functionality by modifying its TypeScript-based tools and server logic. It also allows retrieval of existing Markdown files, making it useful for documentation, research, and AI-assisted workflows. ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 23
    BrowserOS

    BrowserOS

    Agentic browser; privacy-first alternative to ChatGPT Atlas

    BrowserOS is an open-source, agentic web browser built on a Chromium base that integrates AI agents directly into the browsing experience. Rather than just doing standard browsing, it places AI intelligence at the core: you can connect your own API keys (for e.g., OpenAI, Anthropic, Google Gemini) or run local models (via e.g., Ollama) so that your browsing data and automation stay on your machine — privacy and control are emphasized throughout. The interface remains familiar to users of...
    Downloads: 33 This Week
    Last Update:
    See Project
  • 24
    spaCy models

    spaCy models

    Models for the spaCy Natural Language Processing (NLP) library

    ...The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive. spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. If your application needs to process entire web dumps, spaCy is the library you want to be using. Since its release in 2015, spaCy has become an industry standard with a huge ecosystem. Choose from a variety of plugins, integrate with your machine learning stack and build custom components and workflows.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 25
    Transformers

    Transformers

    State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX

    ...Using pre-trained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities. Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages. Images, for tasks like image classification, object detection, and segmentation. Audio, for tasks like speech recognition and audio classification. Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. ...
    Downloads: 23 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB