Showing 987 open source projects for "extraction"

View related business solutions
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 1
    MiroFish

    MiroFish

    A Simple and Universal Swarm Intelligence Engine

    MiroFish is a next-generation artificial intelligence prediction engine that leverages multi-agent technology and swarm-intelligence simulation to model, simulate, and forecast complex real-world scenarios. The system extracts “seed” information from sources such as breaking news, policy documents, and market signals to construct a high-fidelity digital parallel world populated by thousands of virtual agents with independent memory and behavior rules. Users can inject variables or conditions...
    Downloads: 880 This Week
    Last Update:
    See Project
  • 2
    NLP

    NLP

    Open source NLP guide with models, methods, and real use cases

    ...It explains how machines process and understand human language, combining theory with practical examples. Its covers core NLP concepts such as text representation, feature extraction, and model evaluation, alongside hands-on implementations using tools like Word2Vec, TF-IDF, and FastText. It also introduces topic modeling with LDA, keyword extraction techniques, and document similarity methods. NLP extends into real-world applications, including sentiment analysis and text classification, helping readers connect concepts to use cases. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    pyAudioAnalysis

    pyAudioAnalysis

    Python Audio Analysis Library: Feature Extraction, Classification

    ...The project provides a collection of tools that allow developers to extract meaningful features from audio files and use those features for classification, segmentation, and analysis. The library supports multiple audio processing workflows, including feature extraction from raw audio signals, training of machine learning models, and automatic audio segmentation. It also includes utilities for visualizing audio features and analyzing patterns within sound recordings, which can be useful in applications such as speech recognition, music classification, and acoustic event detection. Because the library integrates machine learning algorithms with signal processing tools, it enables researchers to develop complete audio analysis pipelines using a single framework.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    Vectorize MCP Server

    Vectorize MCP Server

    Official Vectorize MCP Server

    The Vectorize MCP Server is a Model Context Protocol server that integrates with Vectorize, offering advanced vector retrieval and text extraction capabilities. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    FModel

    FModel

    Unreal Engine Archives Explorer

    FModel is a freeware application designed to explore Unreal Engine games archives, allowing users to delve into the assets and structures of games developed with Unreal Engine.
    Downloads: 89 This Week
    Last Update:
    See Project
  • 7
    Wiseflow

    Wiseflow

    Enhance any agent's browser use skill

    Wiseflow is an open-source information extraction and knowledge discovery system designed to collect, filter, and organize valuable information from large volumes of online content. The platform continuously monitors specified sources such as websites, social platforms, and other digital channels to identify relevant data according to user-defined interests or topics. By combining web crawling, content parsing, and large language model analysis, the system extracts concise insights from raw information streams and converts them into structured data that can be stored or analyzed. ...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 8
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    tsfresh

    tsfresh

    Automatic extraction of relevant features from time series

    tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. tsfresh is used to to extract characteristics from time series. Without tsfresh, you would have to calculate all characteristics by hand. With tsfresh this process is automated and all your features can be calculated automatically. Further tsfresh is compatible with pythons pandas and scikit-learn APIs, two important packages for Data Science endeavours in python....
    Downloads: 1 This Week
    Last Update:
    See Project
  • Train ML Models With SQL You Already Know Icon
    Train ML Models With SQL You Already Know

    BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

    Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.
    Try Free
  • 10
    LiteParse

    LiteParse

    A fast, helpful, and open-source document parser

    ...It also includes mechanisms for validation and error handling, ensuring that outputs conform to expected schemas and reducing the need for manual postprocessing. The library is particularly useful for tasks such as data extraction, document processing, and building pipelines that require structured outputs from natural language input.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 11
    video2robot

    video2robot

    End-to-end pipeline converting generative videos

    video2robot is an end-to-end open-source pipeline that converts generative video or prompt-driven motion content into executable humanoid robot motion sequences, enabling researchers and developers to go from high-level action descriptions or videos to robot-ready motion data. The pipeline supports both prompt-to-video generation using models like Veo/Sora and video upload processing, followed by human pose extraction through a 3D pose model and retargeting of that motion to robot joints using a general motion retargeting system. This workflow allows users to generate robot motion files that specify joint angles, root positions, and orientations that can be deployed on supported robot platforms (e.g., Unitree models). Video2robot includes scripts for each stage of the pipeline (generation, extraction, conversion, visualization) and can run as a CLI or through a basic web UI.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    AgentQL MCP

    AgentQL MCP

    Model Context Protocol server that integrates AgentQL's data

    The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    newpipeextractor

    newpipeextractor

    Library for extracting streaming site data without official APIs

    NewPipeExtractor is an open source Java library designed to extract data from streaming platforms by analyzing their web interfaces instead of relying on official APIs. It serves as the core extraction component used by the NewPipe Android application, but it is built as a standalone library that can also be integrated into other software projects. NewPipeExtractor provides a unified framework for retrieving information such as video streams, playlists, channels, and search results from supported streaming services. It handles many low-level tasks involved in web data extraction, including parsing responses, managing platform-specific logic, and handling errors, allowing developers to focus on implementing application features rather than scraping mechanics. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    MatImage

    MatImage

    Image Processing library for Matlab

    matImage is an open-source MATLAB library for image processing and analysis. It provides a variety of tools for image enhancement, segmentation, and feature extraction. It’s especially useful for users working on biomedical images or those needing detailed image analysis in MATLAB.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Scribe.js

    Scribe.js

    JavaScript OCR and text extraction for images and PDFs

    ...In addition to simple text extraction, Scribe.js supports writing or injecting a high-quality invisible text layer back into PDFs, effectively making them searchable and improving usability for indexing or accessibility. It is written in modern ECMAScript Modules (ESM), so it can be imported in both browser and Node.js environments without a build step, though browser usage requires same-origin hosting of the files.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Symfony DomCrawler

    Symfony DomCrawler

    Eases DOM navigation for HTML and XML documents

    Symfony DomCrawler is a PHP component that provides powerful tools for navigating and extracting data from HTML and XML documents. It allows developers to parse, filter, and manipulate web pages using CSS selectors and XPath expressions. DomCrawler is widely used for web scraping, testing, and processing structured content, and integrates well with other Symfony components like BrowserKit.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Auto-Deep-Research

    Auto-Deep-Research

    Your Fully-Automated Personal AI Assistant

    ...Users provide a research topic or multifaceted goal, and the system autonomously breaks the objective down into subtasks like literature collection, critical summarization, cross-comparison, citation extraction, metric evaluation, and structured writing. Auto-Deep-Research integrates retrieval from academic and web sources, processes document corpora for relevance and key insights, and organizes outputs into coherent chapters or sections according to research standards. It also embeds validation loops, where intermediate drafts are self-checked for consistency, coverage, and alignment with sound reasoning practices, reducing reliance on raw generation alone.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 19
    yt-dlp-gui

    yt-dlp-gui

    A cross-platform GUI wrapper for yt-dlp written in PySide6

    ...Written in PySide6 (Python with Qt bindings), it wraps the powerful yt-dlp engine in a visual application that lets users paste video URLs, choose formats, apply presets, and start downloads with a click, while still exposing options for advanced tweaks via configuration files. The project supports preset definitions and global arguments through a config file, so users can customize their most common download workflows—like audio extraction, quality ranking, or embedding thumbnails—without retyping arguments each time. Downloads can be initiated from a portable app bundle or run manually with Python, making it flexible across platforms including Windows and Linux.
    Downloads: 272 This Week
    Last Update:
    See Project
  • 20
    Magnitude

    Magnitude

    Vision AI browser agent for automation, testing, and extraction

    ...This approach allows the agent to generalize better across complex and modern websites, making it more robust than traditional selector-based automation tools. Browser Agent by Magnitude supports a wide range of capabilities including navigation, interaction, data extraction, and automated verification through built-in testing features. Developers can use it to automate repetitive web tasks, integrate services without APIs, or build advanced browser-based agents. It also provides flexible abstraction levels, allowing both high-level task execution and precise low-level control of actions like mouse movements and keyboard input.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    skycaiji

    skycaiji

    Open source web scraping system for automated data collection tasks

    SkyCaiji is an open source web scraping and data collection system designed to gather information from websites through configurable extraction rules. It focuses on simplifying the process of building crawlers by allowing users to visually define scraping rules rather than writing complex code. It can collect structured or unstructured data from many types of webpages and automate the extraction process for large datasets. SkyCaiji is designed to run on a variety of hosting environments including local machines, shared hosting environments, and cloud servers. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 22
    Scanopy

    Scanopy

    Clean network diagrams, One-time setup, zero upkeep

    Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. Developers can compose custom pipelines that chain together transforms, filters, and exporters, enabling automation of tedious data preparation steps and accelerating insights with minimal code. The system places a premium on extensibility, allowing contributors to add new extractors or analysis modules tailored to specific industries or datasets. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 23
    MemProcFS Analyzer

    MemProcFS Analyzer

    Automated Forensic Analysis of Windows Memory Dumps for DFIR

    ...By exposing process memory, kernel objects, and derived artifacts as regular files, the framework lets analysts use familiar filesystem operations and standard tools (editors, grep, diff) to explore memory snapshots. The Analyzer layer adds higher-level parsing and extraction routines—for example, carving strings, locating injected modules, enumerating handles, or reconstructing network sockets—so investigators can go from raw memory to actionable evidence more quickly. It emphasizes automation and reproducibility: parsers can be chained, results exported, and reports templated to fit incident workflows. ...
    Downloads: 18 This Week
    Last Update:
    See Project
  • 24
    Hacks

    Hacks

    A collection of hacks and one-off scripts

    ...Rather than being a single cohesive application, it serves as a repository of practical command-line tools that can be used independently or combined into workflows. The scripts cover a wide range of tasks, including URL manipulation, parameter replacement, data extraction, and reconnaissance automation. Many of the tools in the repository are designed for efficiency and simplicity, enabling users to perform complex operations with minimal overhead. It is particularly popular among security researchers and developers who need quick, flexible solutions for niche problems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    KaraKeep

    KaraKeep

    A self-hostable bookmark-everything app

    ...Automatic fetching of link titles, descriptions, and images streamlines saving content without manual edits, while rule-based management lets users define customized workflows. With support for image OCR and structured data extraction, Karakeep functions as a flexible personal knowledge base for researchers, content creators, and heavy bookmarkers.
    Downloads: 1 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB