Showing 56 open source projects for "extraction"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • 1
    Scribe.js

    Scribe.js

    JavaScript OCR and text extraction for images and PDFs

    ...In addition to simple text extraction, Scribe.js supports writing or injecting a high-quality invisible text layer back into PDFs, effectively making them searchable and improving usability for indexing or accessibility. It is written in modern ECMAScript Modules (ESM), so it can be imported in both browser and Node.js environments without a build step, though browser usage requires same-origin hosting of the files.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 2
    newpipeextractor

    newpipeextractor

    Library for extracting streaming site data without official APIs

    NewPipeExtractor is an open source Java library designed to extract data from streaming platforms by analyzing their web interfaces instead of relying on official APIs. It serves as the core extraction component used by the NewPipe Android application, but it is built as a standalone library that can also be integrated into other software projects. NewPipeExtractor provides a unified framework for retrieving information such as video streams, playlists, channels, and search results from supported streaming services. It handles many low-level tasks involved in web data extraction, including parsing responses, managing platform-specific logic, and handling errors, allowing developers to focus on implementing application features rather than scraping mechanics. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 3
    Vectorize MCP Server

    Vectorize MCP Server

    Official Vectorize MCP Server

    The Vectorize MCP Server is a Model Context Protocol server that integrates with Vectorize, offering advanced vector retrieval and text extraction capabilities. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    ytDownloader

    ytDownloader

    Desktop App for downloading Videos and Audios from hundreds of sites

    ...The application supports downloading from major platforms such as YouTube, Facebook, TikTok, Instagram, Twitch, and Twitter, offering users the ability to retrieve content in multiple formats and resolutions including MP4, MP3, and WebM. It includes advanced features such as playlist downloading, subtitle extraction, and range selection for partial downloads, making it useful for both casual users and power users. Additionally, ytDownloader incorporates hardware-accelerated video compression, multiple UI themes, and localization support, enhancing both performance and usability.
    Downloads: 12 This Week
    Last Update:
    See Project
  • Stop vibe-debugging. Icon
    Stop vibe-debugging.

    Plug Claude into your app's actual errors.

    AppSignal's MCP server hands Claude, Cursor, or Zed your real errors, traces, and the deploy that shipped them. AI writes the fix; you review the diff.
    Free 30 days.
  • 5
    AgentQL MCP

    AgentQL MCP

    Model Context Protocol server that integrates AgentQL's data

    The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Magnitude

    Magnitude

    Vision AI browser agent for automation, testing, and extraction

    ...This approach allows the agent to generalize better across complex and modern websites, making it more robust than traditional selector-based automation tools. Browser Agent by Magnitude supports a wide range of capabilities including navigation, interaction, data extraction, and automated verification through built-in testing features. Developers can use it to automate repetitive web tasks, integrate services without APIs, or build advanced browser-based agents. It also provides flexible abstraction levels, allowing both high-level task execution and precise low-level control of actions like mouse movements and keyboard input.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Wiseflow

    Wiseflow

    Enhance any agent's browser use skill

    Wiseflow is an open-source information extraction and knowledge discovery system designed to collect, filter, and organize valuable information from large volumes of online content. The platform continuously monitors specified sources such as websites, social platforms, and other digital channels to identify relevant data according to user-defined interests or topics. By combining web crawling, content parsing, and large language model analysis, the system extracts concise insights from raw information streams and converts them into structured data that can be stored or analyzed. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    WanGP

    WanGP

    AI video generator optimized for low VRAM and older GPUs use

    ...Wan2GP provides a full web-based interface that simplifies interaction with complex generative pipelines, making it easier to configure prompts, models, and rendering settings. It also integrates a wide range of utilities such as prompt enhancement, mask editing, motion design, and extraction tools for pose, depth, and flow data to support advanced video workflows.
    Downloads: 60 This Week
    Last Update:
    See Project
  • 9
    TikTok MCP

    TikTok MCP

    Model Context Protocol (MCP) with TikTok integration

    The TikTok MCP integrates TikTok access into AI applications like Claude AI via TikNeuron. It enables analysis and interaction with TikTok content to determine virality factors and extract video content. ​
    Downloads: 3 This Week
    Last Update:
    See Project
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 10
    web-access

    web-access

    Skill for installing full networking capabilities for Claude Code

    web-access is a tool designed to give AI agents structured and controlled access to web content, enabling them to retrieve, navigate, and process information from online sources in real time. It abstracts common web interactions such as page loading, data extraction, and navigation into reusable functions that can be invoked by agents. The system emphasizes safety and control, likely including mechanisms to manage permissions, rate limits, and content filtering. This allows agents to operate within defined boundaries while still benefiting from dynamic, up-to-date information. The architecture supports integration with broader agent frameworks, making it a key component for building systems that require external knowledge. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    The Web MCP

    The Web MCP

    A powerful Model Context Protocol (MCP) server

    Bright Data’s Web MCP server gives AI assistants robust, real-time web capabilities through an MCP interface designed to avoid blocks, rate limits, and CAPTCHAs. It presents search, crawl, navigate, and extraction tools that agents can call directly, replacing brittle scraping prompts with typed operations. The README markets it as a “gateway” to the live web so assistants don’t fall back to stale training data. Bright Data also advertises a getting-started tier with a free monthly allotment, plus options for remote or self-hosted operation depending on governance needs. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Zotero

    Zotero

    Tool to help you collect, organize, annotate, cite, and share research

    Zotero is a powerful, free, open-source research management application designed to help students, academics, and professionals collect, organize, annotate, cite, and share research sources and materials for papers, projects, or books. It can save web pages, PDFs, books, articles, and more with metadata, automatically extract bibliographic information, and organize items into collections and tag systems, while supporting notes and annotations directly alongside references. Zotero’s interface...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 13
    Article Extractor

    Article Extractor

    To extract main article from given URL with Node.js

    A Node.js library for extracting main content from web articles, removing unnecessary clutter like ads and navigation elements.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    sharp

    sharp

    High performance Node.js image processing module

    ...Colour spaces, embedded ICC profiles and alpha transparency channels are all handled correctly. Lanczos resampling ensures quality is not sacrificed for speed. As well as image resizing, operations such as rotation, extraction, compositing and gamma correction are available. Most modern macOS, Windows and Linux systems running Node.js v10+ do not require any additional install or runtime dependencies. This module supports reading JPEG, PNG, WebP, AVIF, TIFF, GIF and SVG images. Output images can be in JPEG, PNG, WebP, AVIF and TIFF formats as well as uncompressed raw pixel data. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    npm-pdfreader

    npm-pdfreader

    Parse text and tables from PDF files.

    npm-pdfreader is a Node.js library for reading text and parsing tables from PDF files. It supports tabular data with automatic column detection and rule-based parsing, making it useful for extracting structured data from PDFs. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Pot Desktop

    Pot Desktop

    A cross-platform software for text translation and recognition

    ...It supports picking text via mouse selection (“highlight-and-translate”), clipboard listening, or screenshot-based OCR; this makes it ideal for reading webpages, documents, images — or any on-screen text — and instantly getting translations or text extraction. The tool supports external plugin extensions, which means its functionality can be expanded far beyond the built-in options: you can add translation engines, OCR backends, TTS engines, vocabulary export (e.g. for language learning), and more. Pot-Desktop works on Windows, macOS, and Linux (including Wayland environments), and offers convenient installers or package-manager installation methods (e.g. via brew or .deb, etc.), so it’s accessible for users on all major desktop OSes.
    Downloads: 16 This Week
    Last Update:
    See Project
  • 17
    DeepCamera

    DeepCamera

    Open-Source AI Camera. Empower any camera/CCTV

    ...SharpAI yolov7_reid is an open-source Python application that leverages AI technologies to detect intruders with traditional surveillance cameras. The source code is here It leverages Yolov7 as a person detector, FastReID for person feature extraction, Milvus the local vector database for self-supervised learning to identify unseen persons, Labelstudio to host images locally and for further usage such as label data and train your own classifier. It also integrates with Home-Assistant to empower smart homes with AI technology.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 18
    Open Semantic Search

    Open Semantic Search

    Open source semantic search and text analytics for large document sets

    Open Semantic Search is an open source research and analytics platform designed for searching, analyzing, and exploring large collections of documents using semantic search technologies. It provides an integrated search server combined with a document processing pipeline that supports crawling, text extraction, and automated analysis of content from many different sources. Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and enrich the data with extracted information such as named entities and metadata. It also supports optical character recognition to extract text from images and scanned documents, including images embedded inside PDF files. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 19
    Browserless

    Browserless

    The headless Chrome/Chromium driver on top of Puppeteer

    Browserless is an open-source headless browser automation library and service built on top of Puppeteer that simplifies the process of running and scaling Chromium-based browser tasks in production environments. It provides a high-level API for interacting with headless Chrome, allowing developers to perform operations such as generating PDFs, capturing screenshots, extracting text or HTML, and automating web navigation. The project is designed to act as a production-ready abstraction layer...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    designlang

    designlang

    Extract any website's complete design system with one command

    designlang is a powerful tool that extracts complete design systems from existing websites using automated analysis and converts them into reusable assets and tokens. It generates structured outputs such as design tokens, semantic components, and styling systems that can be used across multiple platforms. The tool supports exporting to frameworks like Tailwind, SwiftUI, Flutter, and WordPress, making it highly versatile for cross-platform development. It also integrates with tools like Figma...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    h265web.js

    h265web.js

    A HEVC/H.265 Web Player

    h265web.js is a WebAssembly-powered video decoding library designed to enable playback and processing of H.265/HEVC video streams directly in web browsers without relying on native browser codec support. It provides a low-level decoding API that allows developers to build custom video players capable of handling raw H.265 streams, which are typically not widely supported natively in browsers. The project includes components for parsing H.265 bitstreams into NAL units and decoding them into...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    Browserbase Skills

    Browserbase Skills

    Claude Agent SDK with a web browsing tool

    Browserbase Skills is a collection of reusable automation “skills” designed to enable AI agents to interact with web environments programmatically. It provides structured workflows that abstract browser actions such as navigation, form filling, and data extraction into composable building blocks. The system is intended to simplify the development of browser-based agents by offering prebuilt capabilities that can be orchestrated together. It integrates with headless browser infrastructure, allowing scalable automation across multiple sessions. The design emphasizes reliability and repeatability, reducing the complexity of handling dynamic web interfaces. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    katana

    katana

    Fast CLI web crawler for discovering endpoints in modern web apps

    Katana is an open source command-line web crawling and spidering framework developed by ProjectDiscovery. It is designed to efficiently crawl websites and web applications in order to discover endpoints, resources, and other useful information that may not be easily visible through manual browsing. Katana focuses on speed and automation, making it suitable for use in security reconnaissance workflows and automated pipelines. Katana supports both standard HTTP crawling and headless browser...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    Firecrawl MCP Server

    Firecrawl MCP Server

    Adds powerful web scraping and search to Cursor and Claude

    firecrawl-mcp-server is the official MCP integration for Firecrawl that brings high-recall web scraping, crawling, and search into IDEs and agent runtimes. It exposes tools for single-page scrape, multi-URL batch jobs, site discovery, and search enrichment, returning cleaned, structured content suitable for downstream LLM reasoning. The server is designed to run with Firecrawl’s hosted API or self-hosted deployments, making it flexible for enterprise data-governance requirements. Built-in...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    Markdownify MCP Server

    Markdownify MCP Server

    Convert files and web content into clean, usable Markdown easily

    ...It supports formats such as PDFs, images, audio with transcription, DOCX, XLSX, and PPTX, along with web sources like YouTube transcripts, Bing results, and general webpages. Markdownify MCP is designed to simplify content extraction and make data easier to read, share, and reuse in structured workflows. Developers can install dependencies, build, and run the server locally, then extend functionality by modifying its TypeScript-based tools and server logic. It also allows retrieval of existing Markdown files, making it useful for documentation, research, and AI-assisted workflows. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next