Showing 117 open source projects for "extract"

View related business solutions
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 1
    ScrapeGraphAI

    ScrapeGraphAI

    Python scraper based on AI

    ...ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 2
    OpenAPI.NET

    OpenAPI.NET

    Object model for OpenAPI documents in .NET

    The OpenAPI.NET SDK contains a useful object model for OpenAPI documents in .NET along with common serializers to extract raw OpenAPI JSON and YAML documents from the model. The OpenAPI.NET project holds the base object model for representing OpenAPI documents as .NET objects. Some developers have found the need to write processors that convert other data formats into this OpenAPI.NET object model. We'd like to curate that list of processors in this section of the readme.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    ...Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring and automated testing.
    Downloads: 33 This Week
    Last Update:
    See Project
  • 4
    Geziyor

    Geziyor

    Blazing fast Go framework for web crawling and data scraping tasks

    Geziyor is a high-performance web crawling and web scraping framework built for the Go programming language. It is designed to help developers crawl websites and extract structured information from web pages efficiently. It focuses on speed and scalability, allowing large numbers of requests to be processed concurrently. Geziyor supports use cases such as data mining, monitoring web content, and automated testing workflows. It provides a flexible architecture where developers define parsing functions that process responses and extract the desired data. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 5
    Nyxt

    Nyxt

    The hacker's power-browser

    Out of the box Nyxt ships with tens of features that allow you to quickly analyze, navigate, and extract information from the Internet. Plus, Nyxt is fully hackable- all of its source code can be introspected, modified, and tweaked to your exact specification. Navigate large documents with ease. Utilize the power of running commands against multiple objects to avoid repeating yourself. You can select and close all buffers that match the string "ele".
    Downloads: 6 This Week
    Last Update:
    See Project
  • 6
    Weibo Crawler

    Weibo Crawler

    Python crawler for collecting and downloading Sina Weibo user data

    weibo-crawler is a Python-based data collection tool designed to retrieve information from Sina Weibo user accounts. It automates the process of gathering posts, user profile details, and engagement metrics from one or more target accounts. weibo-crawler can extract comprehensive information about users, including profile attributes such as nickname, follower count, following count, and account metadata. It also captures detailed data about each post, including the content, publishing time, topics, mentions, likes, reposts, and comments. In addition to textual data, the project can download original media from posts, such as images, videos, and Live Photo content. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 7
    Hugo Theme Stack

    Hugo Theme Stack

    Card-style Hugo theme designed for bloggers

    Card-style Hugo theme designed for bloggers. Stack is a simple card-style Hugo theme designed for bloggers, some of its features are responsive images support, lazy load images, dark mode, local search, PhotoSwipe integration, archive page template, full native JavaScript, and no jQuery or any other frameworks are used, no CSS framework, keep it simple and minimal, properly cropped thumbnails. Subsection support, table of contents, multilingual mode and RTL support. It's necessary to use...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    CommunityScrapers

    CommunityScrapers

    This is a public repository containing scrapers

    Stash Community Scrapers is a large open-source collection of metadata extraction tools designed to work with the Stash media management platform, enabling automated scraping of content information from various online sources. The repository contains hundreds of scraper definitions written primarily in YAML and Python, each tailored to extract structured metadata such as titles, performers, tags, and media details from specific websites. These scrapers integrate directly into Stash, allowing users to enrich their media libraries with accurate and detailed information without manual entry. The project supports both automatic installation through in-app feeds and manual configuration for advanced use cases. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    HeadlessX

    HeadlessX

    The undetected self-hosted browser automation platform

    HeadlessX is an open-source, self-hosted browser automation platform designed to run headless browsers for tasks such as web scraping, automation, and testing. The system provides a centralized service that allows developers to programmatically control browser sessions and extract data from websites through a structured API. It is built using modern technologies including Node.js, Next.js, TypeScript, and Playwright, and uses a specialized browser engine called Camoufox based on Firefox. One of the platform’s goals is to bypass common bot-detection systems by implementing advanced fingerprint spoofing and stealth techniques. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 10
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    crawley

    crawley

    The unix-way web crawler

    Crawls web pages and prints any link it can find. Fast HTML SAX-parser (powered by golang.org/x/net/html) Small (below 1500 SLOC), idiomatic, 100% test-covered codebase. Grabs most of useful resources URLs (pics, videos, audios, forms, etc...) Found URLs are streamed to stdout and guaranteed to be unique (with fragments omitted) Scan depth (limited by starting host and path, by default - 0) can be configured. Can crawl rules and sitemaps from robots.txt. Brute mode - scan HTML comments for...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    jsoup

    jsoup

    Java library for working with real-world HTML

    jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree. The parser will make...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    newspaper4k

    newspaper4k

    Python library for scraping and analyzing online news articles easily

    ...It is a continuation and active fork of the original newspaper3k library, which had stopped receiving updates, with the goal of keeping the ecosystem maintained while adding improvements and bug fixes. It provides developers with tools to automatically download web pages, extract the main article content, and collect associated metadata such as titles, authors, images, and publication dates. Newspaper4k also includes natural language processing capabilities that can generate summaries and identify keywords from extracted article text. Newspaper4k supports both single-article extraction and full news site processing, allowing users to build sources representing entire publications and iterate through their articles. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    videodl

    videodl

    Lightweight Python tool for downloading videos from many platforms

    ...It supports numerous video platforms across both Chinese and international streaming ecosystems, enabling users to fetch content from many popular services through a unified interface. Videodl works by implementing platform-specific client modules that extract video information and download links from supported services. Videodl can integrate with external command-line utilities to improve downloading performance, handle streaming formats such as HLS, and manage encrypted or segmented media streams. Additional utilities can also enable faster downloads, resume interrupted transfers, and process complex playlist structures.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    spider_collection

    spider_collection

    Collection of Python web scraping scripts for data extraction tasks

    spider_collection is a collection of Python web crawler scripts created primarily for experimentation, learning, and practical scraping tasks. spider_collection gathers multiple independent spiders designed to collect data from different platforms and services, demonstrating a variety of scraping techniques and workflows. These crawlers make use of common Python scraping tools such as requests, parsel, BeautifulSoup, and the Scrapy framework to extract structured information from web pages. Several scripts also incorporate multi-threading and proxy usage to improve scraping efficiency and help avoid common anti-scraping limitations. In addition to raw data collection, some spiders include basic data processing and analysis using tools such as pandas and simple visualization with matplotlib. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Spider

    Spider

    High-performance Rust web crawler and scraper for large-scale data

    ...It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large datasets in a short period of time. Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    QueryList

    QueryList

    Progressive PHP web crawler framework with jQuery-like DOM parsing

    QueryList is an extensible PHP web scraping and crawling framework designed to extract and process data from web pages. It provides a simple and expressive API that allows developers to collect structured information from HTML documents using familiar DOM traversal techniques. It is built on top of phpQuery and uses CSS3 selectors similar to those found in jQuery, making it easy for developers to query and manipulate page elements during scraping tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    owllook

    owllook

    Vertical novel search engine with unified reading and tracking tools

    ...It focuses on providing a simple and comfortable reading experience with features such as searching for books, following updates, bookmarking chapters, and maintaining a personal bookshelf. It aggregates results from multiple search engines and applies parsing rules to extract novel metadata, chapters, and content in a consistent format. Owllook also includes functionality for tracking reading history, displaying rankings based on search activity, and recommending books using a similarity-based approach. Owllook is built using asynchronous technologies to support efficient data retrieval and responsive interactions while reading or searching.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Search by Image

    Search by Image

    Browser extension for reverse image search, available for Chrome

    Search by Image is a powerful browser extension for Safari that makes effortless reverse image searches possible, and comes with support for more than 30 search engines, such as Google, Bing, Yandex, Baidu and TinEye. Search by Image is an open source project made possible thanks to a community of awesome supporters. By purchasing the extension on the Mac App Store, you help support the continued development of the extension. The extension helps journalists and researchers verify the...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Web Spider, Web Crawler, Email Extractor

    Web Spider, Web Crawler, Email Extractor

    Free Extracts Emails, Phones and custom text from Web using JAVA Regex

    In Files there is WebCrawlerMySQL.jar which supports MySql Connection Free Web Spider & Crawler. Extracts Information from Web by parsing millions of pages. Store data into Derby Database and data are not being lost after force closing the spider. - Free Web Spider , Parser, Extractor, Crawler - Extraction of Emails , Phones and Custom Text from Web - Export to Excel File - Data Saved into Derby and MySQL Database - Written in Java Cross Platform Also See Free email Sender :...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21

    Media Dock

    Extract and download media from any website with ease and speed.

    Media Extractor is a powerful and easy-to-use Chrome extension designed to help users quickly extract images and videos from any website. Whether you need media for reference, presentations, or personal use, this tool simplifies the process by scanning web pages and providing direct download links. The extension supports a wide range of formats and sources while maintaining a clean, user-friendly interface. It operates securely without tracking your activity and ensures smooth performance without slowing down your browsing experience. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    MBR-Bulk-WP-Detector

    MBR-Bulk-WP-Detector

    A free WP plugin that lets you check unlimited URLs

    MBR Bulk WP Detector is a free WordPress plugin that lets you check unlimited URLs right from your own dashboard. No subscriptions, no URL limits, and your data stays completely private on your server. What Can You Do With It? The basics are simple: Paste a list of URLs (or upload a CSV file), click a button, and boom—you’ve got a clear breakdown of which sites are running WordPress and which aren’t. But it gets better… Turn on Deep Scan mode, and you’ll also discover what...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 23
    ai-scrapper
    🚀 Discover AI Web Scraper! 🚀 Tired of copying and pasting data from websites? I developed a desktop application with Electron and Gemini AI to extract structured data easily and efficiently! 🤖✨
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    WFDownloader App

    WFDownloader App

    Free batch downloader for image, wallpaper, video, audio, document,

    Use as an image gallery, wallpaper, audio/music, video, document, and other media bulk downloader from supported websites. Also use to download sequential website urls that have a certain pattern (e.g. image01.png to image100.png). Also use app's built-in site crawler for advanced link search or extraction. There is also special support for forum media and open directory downloading. It's a programmable downloader and also works with password protected sites. Say goodbye to downloading one...
    Leader badge
    Downloads: 316 This Week
    Last Update:
    See Project
  • 25
    OmniPull

    OmniPull

    Just pull anything

    OmniPull is a powerful, cross-platform download manager built with Python and PySide6. It provides a modern, intuitive interface for managing downloads with advanced features like multi-threading, queue management, and media extraction.
    Downloads: 26 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB