Search Results for "data extraction" - Page 5

Showing 386 open source projects for "data extraction"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 1
    BrowserOS

    BrowserOS

    Agentic browser; privacy-first alternative to ChatGPT Atlas

    BrowserOS is an open-source, agentic web browser built on a Chromium base that integrates AI agents directly into the browsing experience. Rather than just doing standard browsing, it places AI intelligence at the core: you can connect your own API keys (for e.g., OpenAI, Anthropic, Google Gemini) or run local models (via e.g., Ollama) so that your browsing data and automation stay on your machine — privacy and control are emphasized throughout. The interface remains familiar to users of...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 2
    Skyvern

    Skyvern

    Automate browser-based workflows with LLMs and Computer Vision

    Skyvern uses a combination of computer vision and AI to understand content on a webpage, making it adaptable to any website. Skyvern takes instructions in natural language, allowing it to execute complex objectives with simple commands. Skyvern is an API-first product. Workflows execute in the cloud, allowing it to run hundreds of workflows at the same time. Skyvern's AI decisions come with built-in explanations, providing clear summaries and justifications for every action. Support for...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3
    Dungbeetle

    Dungbeetle

    A distributed job server

    Dungbeetle is a metadata and data lineage tracking tool developed by Zerodha to map and visualize how data flows across systems. It helps teams maintain data transparency by tracking dependencies between databases, tables, and reports, offering a centralized view of data pipelines. Dungbeetle is designed to enhance observability and trust in analytics ecosystems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    spider_collection

    spider_collection

    Collection of Python web scraping scripts for data extraction tasks

    spider_collection is a collection of Python web crawler scripts created primarily for experimentation, learning, and practical scraping tasks. spider_collection gathers multiple independent spiders designed to collect data from different platforms and services, demonstrating a variety of scraping techniques and workflows. These crawlers make use of common Python scraping tools such as requests, parsel, BeautifulSoup, and the Scrapy framework to extract structured information from web pages....
    Downloads: 2 This Week
    Last Update:
    See Project
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 5
    Spider

    Spider

    High-performance Rust web crawler and scraper for large-scale data

    ...Spider can operate concurrently across many pages, allowing it to gather large datasets in a short period of time. Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. These capabilities make the project suitable for building search indexers, data extraction pipelines, & SEO analysis tools.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    QueryList

    QueryList

    Progressive PHP web crawler framework with jQuery-like DOM parsing

    ...QueryList supports common data extraction scenarios such as retrieving lists of titles, links, images, and other page elements from structured or semi-structured content. It also includes a powerful HTTP request system that enables complex operations such as simulated logins, proxy usage, and customized request headers. QueryList is designed with a modular architecture that allows developers to extend its capabilities through plugins for tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Viral-Clips-Crew

    Viral-Clips-Crew

    Your CrewAI Powered Video Editing Assistant

    Viral-Clips-Crew is an AI-driven video processing pipeline designed to generate short-form, engaging clips from long-form video content automatically. It analyzes transcripts and video data to identify the most engaging or “viral” moments, reducing the need for manual editing. The system integrates tools like FFmpeg and AI models to handle segmentation, cropping, and formatting for vertical video platforms. It supports automation workflows that allow creators to produce multiple clips...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 8
    AgenticSeek

    AgenticSeek

    Fully Local Manus AI. No APIs, No $200 monthly bills

    AgenticSeek is a fully local autonomous AI assistant designed as a privacy-focused alternative to cloud-based agent platforms. It runs entirely on the user’s hardware and can autonomously browse the web, write code, and plan multi-step tasks without sending data to external services. The system is optimized for local reasoning models and emphasizes zero cloud dependency to maintain full user control. AgenticSeek includes intelligent agent selection, allowing it to determine the best internal agent to handle a given request. It also supports hands-free workflows such as automated web form interaction and information extraction. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 9
    Symfony Panther

    Symfony Panther

    A browser testing and web crawling library for PHP and Symfony

    Symfony Panther is a browser testing and web scraping tool that allows developers to interact with websites programmatically. It uses headless Chrome or Firefox to automate browser tasks, making it suitable for end-to-end testing and data extraction. Panther integrates well with Symfony and PHPUnit, allowing developers to write comprehensive tests for web applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 10
    DevHub Application

    DevHub Application

    A feature-rich offline application

    A feature-rich offline application, carefully crafted to support developers' daily tasks and ensure the highest security for their data. I am actively developing it with a bold goal in mind: to release updates weekly. I strive to maintain a lean footprint, aiming to curate an extensive collection comprising over 100 utilities, providing developers with a diverse array of tools. This initiative reflects my commitment to continuous improvement, offering rich tools to empower developers. DevHub...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    h265web.js

    h265web.js

    A HEVC/H.265 Web Player

    h265web.js is a WebAssembly-powered video decoding library designed to enable playback and processing of H.265/HEVC video streams directly in web browsers without relying on native browser codec support. It provides a low-level decoding API that allows developers to build custom video players capable of handling raw H.265 streams, which are typically not widely supported natively in browsers. The project includes components for parsing H.265 bitstreams into NAL units and decoding them into...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    MCiSEE

    MCiSEE

    All of Minecraft, EASILY get Minecraft resources

    MCiSEE is an open-source project designed to integrate Minecraft with computer vision and artificial intelligence experiments. The system focuses on capturing visual information from the game environment and exposing it to external programs for analysis or machine learning research. By converting gameplay data into visual or structured formats, MCiSEE enables researchers and developers to build AI agents capable of interacting with the Minecraft environment. The project can be used as a...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    chrome-cdp

    chrome-cdp

    Give your AI agent access to your live Chrome session

    chrome-cdp-skill is a specialized integration that enables AI agents to control and interact with web browsers through the Chrome DevTools Protocol (CDP). It allows agents to perform tasks such as navigating pages, extracting data, interacting with elements, and executing scripts in a browser environment. The project is designed to extend the capabilities of AI systems beyond static knowledge by giving them real-time access to web content and interactive interfaces. Its architecture likely...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    HyperAgent

    HyperAgent

    AI Browser Automation

    ...Built on top of Playwright, the framework allows developers to automate complex browser interactions using natural language commands rather than fragile selectors or hard-coded scripts. Instead of manually writing logic for clicking elements, extracting data, or navigating web pages, developers can instruct the agent in plain language and allow the AI layer to interpret and execute the task. This approach reduces the brittleness commonly associated with traditional automation scripts that break when the DOM structure changes. HyperAgent includes APIs such as page.ai() and page.extract() that allow structured data extraction and dynamic task execution through AI reasoning.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    Docling

    Docling

    Get your documents ready for gen AI

    Docling is an open-source document processing toolkit built to prepare diverse content types for modern generative AI and data workflows. The project focuses on converting and parsing many document formats into a unified structured representation that downstream systems can easily consume. It supports advanced PDF understanding, including layout detection, table extraction, and reading order analysis, enabling high-fidelity document intelligence pipelines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Instaloader

    Instaloader

    Download pictures (or videos) along with their captions

    Instaloader is a mature open-source utility for downloading and archiving Instagram content along with rich metadata. It enables users to retrieve posts, stories, reels, highlights, profile pictures, and associated information such as captions, comments, timestamps, and geotags. The tool supports both public and permitted private content when proper authentication is provided, making it useful for research, digital archiving, and social media analysis. Instaloader can be run as a simple...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 17
    DrissionPage

    DrissionPage

    Python based web automation tool. Powerful and elegant

    DrissionPage is a Python-based automation framework that blends the capabilities of Selenium for browser automation with Requests-HTML for fast, headless web data extraction. It enables seamless switching between browser-controlled and headless HTTP sessions within the same interface. Ideal for web scraping, testing, and automation, DrissionPage is lightweight and highly efficient, offering more flexibility than standard Selenium or Requests usage alone.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Actors MCP Server

    Actors MCP Server

    Model Context Protocol (MCP) Server for Apify's Actors

    The Apify Actors MCP Server is a Model Context Protocol (MCP) server that enables AI assistants to interact with Apify Actors. This integration allows AI models to utilize various web scraping and automation tools provided by Apify, facilitating tasks such as data extraction and web automation. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    DeepCamera

    DeepCamera

    Open-Source AI Camera. Empower any camera/CCTV

    ...SharpAI yolov7_reid is an open-source Python application that leverages AI technologies to detect intruders with traditional surveillance cameras. The source code is here It leverages Yolov7 as a person detector, FastReID for person feature extraction, Milvus the local vector database for self-supervised learning to identify unseen persons, Labelstudio to host images locally and for further usage such as label data and train your own classifier. It also integrates with Home-Assistant to empower smart homes with AI technology.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 20
    Scweet

    Scweet

    Scrape tweets, profiles, followers and following from Twitter/X

    Scweet is a Python-based Twitter/X scraping library and CLI designed to collect tweets, profile timelines, followers, following lists, and user profile data without requiring the official Twitter/X API or a developer account. Instead of depending on deprecated unauthenticated scraping methods, it works by using X’s web GraphQL API together with authenticated browser cookies, which gives it a more current and practical approach for data extraction. The project supports a broad set of collection patterns, including searches by keyword, hashtag, user, date range, engagement thresholds, language, and location, making it useful for research, monitoring, and data gathering workflows. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    sharp

    sharp

    High performance Node.js image processing module

    ...Colour spaces, embedded ICC profiles and alpha transparency channels are all handled correctly. Lanczos resampling ensures quality is not sacrificed for speed. As well as image resizing, operations such as rotation, extraction, compositing and gamma correction are available. Most modern macOS, Windows and Linux systems running Node.js v10+ do not require any additional install or runtime dependencies. This module supports reading JPEG, PNG, WebP, AVIF, TIFF, GIF and SVG images. Output images can be in JPEG, PNG, WebP, AVIF and TIFF formats as well as uncompressed raw pixel data. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    txtai

    txtai

    Build AI-powered semantic search applications

    ...Innovation is happening at a rapid pace, models can understand concepts in documents, audio, images and more. Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction. Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes). Applications range from similarity search to complex NLP-driven data extractions to generate structured databases. The following applications are powered by txtai.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Adversarial Robustness Toolbox

    Adversarial Robustness Toolbox

    Adversarial Robustness Toolbox (ART) - Python Library for ML security

    ...ART provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, sci-kit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, generation, certification, etc.).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Browserbase MCP Server

    Browserbase MCP Server

    Allow LLMs to control a browser with Browserbase and Stagehand

    Browserbase MCP Server is a server implementation of the Model Context Protocol (MCP) that enables large language models to interact with web browsers programmatically through cloud-based automation. The project provides a standardized interface for connecting AI systems to real-world web environments, allowing them to navigate pages, extract structured data, and perform user-like actions such as clicking, typing, and form submission. It leverages Browserbase infrastructure along with...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    FlexLLMGen

    FlexLLMGen

    Running large language models on a single GPU

    FlexLLMGen is an open-source inference engine designed to run large language models efficiently on limited hardware resources such as a single GPU. The system focuses on high-throughput generation workloads where large batches of text must be processed quickly, such as large-scale data extraction or document analysis tasks. Instead of requiring expensive multi-GPU systems, the framework uses techniques such as memory offloading, compression, and optimized batching to run large models on commodity hardware. The architecture distributes computation and memory usage across the GPU, CPU, and disk in order to maximize the number of tokens processed during inference. ...
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB