Browse free open source Python Web Scrapers and projects below. Use the toggles on the left to filter open source Python Web Scrapers by OS, license, language, programming language, and project status.

  • $300 Free Credits to Build on Google Cloud Icon
    $300 Free Credits to Build on Google Cloud

    New to Google Cloud? Get $300 in credits to explore Compute Engine, BigQuery, Cloud Run, Gemini Enterprise Agent Platform, and more.

    Start your next project with $300 in free Google Cloud credit. Spin up VMs, run containers, query petabytes in BigQuery, or build agents with Gemini Enterprise Agent Platform. Once your credits are used, keep building with 20+ always-free tier products including Compute Engine, Cloud Storage, GKE, and Cloud Run functions. No commitment required—just sign up and start building.
    Claim $300 Free
  • Secure File Transfer for Windows with Cerberus by Redwood Icon
    Secure File Transfer for Windows with Cerberus by Redwood

    Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

    Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.
    Try for Free
  • 1
    Kemono Downloader

    Kemono Downloader

    Kemono Downloader - A cross-platform Python app built with PyQt6

    Welcome to Kemono Downloader, a versatile Python-based desktop application built with PyQt6, designed to download content from Kemono.su. This tool enables users to archive individual posts or entire creator profiles from services like Patreon, Fanbox, and more, supporting a wide range of file types with customizable settings and advanced features.
    Leader badge
    Downloads: 820 This Week
    Last Update:
    See Project
  • 2
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring and automated testing.
    Downloads: 25 This Week
    Last Update:
    See Project
  • 3
    Bili23 Downloader

    Bili23 Downloader

    Cross platform GUI tool for downloading videos from Bilibili sites

    Bili23-Downloader is an open source desktop application designed for downloading video content from the Bilibili platform. It provides a graphical interface that allows users to download various types of media including user-uploaded videos, series episodes, movies, and other hosted content. It focuses on ease of use with a zero-configuration setup, making it accessible to both beginners and experienced users. It supports high performance downloads through multi-threading and includes resume capabilities so interrupted downloads can continue without starting over. It can parse different types of links such as standard video pages, short links, and collection or activity pages to automatically retrieve downloadable media. It also allows users to choose video resolution, audio quality, and encoding format based on the available sources. Additional features include downloading subtitles, comments, metadata, and artwork associated with videos.
    Downloads: 24 This Week
    Last Update:
    See Project
  • 4
    Scylla

    Scylla

    Intelligent proxy pool for collecting and managing public proxies

    Scylla is an open source proxy pool system designed to collect, validate, and manage large numbers of public proxy servers for use in web scraping and data extraction workflows. It automatically crawls the internet to discover proxy IP addresses and evaluates their availability and reliability before adding them to a usable pool. It includes a JSON API that allows developers and applications to retrieve proxy information programmatically, making it easier to integrate proxy rotation into scraping tools or automation scripts. Scylla also runs a built-in HTTP forward proxy server that can dynamically select a recently validated proxy whenever a request is made. In addition to the API, the system provides a web-based interface where users can view available proxies and monitor their global distribution through a visual dashboard. It is commonly used by developers who need scalable proxy management when gathering data from the internet or building datasets for machine learning.
    Downloads: 17 This Week
    Last Update:
    See Project
  • Ship Agents Faster Icon
    Ship Agents Faster

    Transform your applications and workflows into powerful agentic systems at global scale.

    Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
    Get Started Free
  • 5
    bilibili-manga-downloader

    bilibili-manga-downloader

    Download and manage Bilibili Manga chapters with GUI downloader

    BiliBili-Manga-Downloader is an open source desktop application designed to download manga chapters from the Bilibili Manga platform for offline reading and local management. It was created to address limitations of the web reading experience, such as intrusive advertisements, inconvenient image zooming, and inconsistent navigation during reading sessions. It provides a graphical user interface that allows users to search for manga titles using keywords, view detailed information about available series, and select chapters to download. BiliBili-Manga-Downloader supports multi-threaded downloading to improve performance and includes progress tracking with estimated time remaining for active downloads. It also offers multiple output formats, allowing chapters to be saved as image folders or compressed comic archive formats suitable for local manga readers.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 6
    bilili

    bilili

    Command-line Bilibili video and danmaku downloader with batch support

    bilili is a command-line tool designed to download videos and related content from the Bilibili video platform. It focuses on enabling users to retrieve user-uploaded videos as well as serialized content such as bangumi episodes directly from the terminal environment. It provides automated downloading capabilities that handle video streams and associated data efficiently while minimizing manual interaction. bilili supports retrieving both the video files and danmaku comments, which are the scrolling overlay comments commonly associated with the platform’s videos. These danmaku comments can be automatically converted into ASS subtitle format for playback compatibility with media players. bilili also implements multi-threaded and segmented downloading techniques to improve download performance and reliability. Additionally, it supports resuming interrupted downloads so partially downloaded content can continue without restarting from the beginning.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 7
    MDCx

    MDCx

    Movie metadata scraper and organizer for media libraries and NFO

    MDCx is an open source media metadata scraping and organization tool designed to automate the process of collecting detailed information for movie files. It retrieves metadata from multiple online sources and applies it to local media collections, helping users maintain structured and well-organized libraries. MDCx can download information such as titles, cast data, artwork, and other metadata, then generate standardized NFO files compatible with media management systems. It also supports image processing tasks such as downloading and cropping artwork used by media centers. It includes several interfaces, allowing users to operate it through a graphical desktop application, a browser-based web interface, or command-line utilities depending on their workflow. Its architecture separates core scraping logic from the user interfaces, allowing the same metadata processing system to be reused across different modes.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 8
    Snoop Project

    Snoop Project

    This is the most powerful software taking into account CIS location

    Snoop is an open data intelligence tool (OSINT world). Snoop Project is one of the most promising OSINT tools for finding nicknames. This is the most powerful software taking into account the CIS location. Is your life slideshow? Ask Snoop. Snoop project is developed without taking into account the opinions of the NSA and their friends, that is, it is available to the average user. Snoop is a research work (own database / closed bugbounty) in the field of searching and processing public data on the Internet. In terms of specialized search, Snoop is able to compete with traditional search engines.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 9
    grab-site

    grab-site

    Web crawler for archiving and backing up sites into WARC archives

    grab-site is an open source web crawling tool designed to archive and back up websites by recursively downloading their content. It works by taking a starting URL and systematically following links across the site, capturing pages and resources and saving them into WARC archive files for long-term preservation. Internally, the crawler uses a fork of the wpull engine to fetch and process web pages efficiently during large-scale crawls. grab-site includes a built-in dashboard that displays real-time crawl activity, including which URLs are currently being processed and how many remain in the queue. Users can dynamically apply ignore patterns during an active crawl, allowing them to skip problematic or unnecessary URLs that could slow down or block the archiving process. grab-site also provides predefined ignore sets for common site structures such as forums and other complex web platforms. Additional mechanisms like duplicate page detection help avoid re-crawling identical content.
    Downloads: 5 This Week
    Last Update:
    See Project
  • Error to trace to log to deploy. One click. No SSH. Icon
    Error to trace to log to deploy. One click. No SSH.

    Catch the cause before the pager goes off.

    AppSignal links every error to the trace, the trace to the log, the log to the deploy that shipped it.
    Free 30 days.
  • 10
    sqliv

    sqliv

    Massive SQL injection vulnerability scanner for automated web testing

    SQLiv is a command-line security tool designed to identify SQL injection vulnerabilities in web applications through automated scanning techniques. Written primarily in Python, the project focuses on discovering potentially vulnerable web pages by analyzing URLs that contain database query parameters. It can perform large-scale scanning by using search engine queries known as SQL injection dorks to collect candidate websites and then test them for vulnerabilities. In addition to bulk scanning, SQLiv supports targeted analysis of specific domains or individual URLs, allowing security researchers to focus on particular web applications. When a domain is supplied, the scanner can crawl the site to gather URLs with parameters and evaluate them for potential SQL injection weaknesses. SQLiv also supports reverse domain scanning to locate other websites hosted on the same server, which can then be examined for similar vulnerabilities.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 11
    finvizfinance

    finvizfinance

    Finviz analysis python library

    finvizfinance is a package that collects financial information from FinViz website. Stock charts, fundamental & technical information, insider information and stock news. Forex charts and performance. Crypto charts and performance. Screener and Group provide data frames for comparing stocks according to different filters and trading signals. Getting information (fundament, description, outer rating, stock news, inside trader) of an individual stock.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 12
    Scrapling

    Scrapling

    An adaptive Web Scraping framework

    Scrapling is an adaptive web scraping framework designed to handle everything from a single HTTP request to large-scale, concurrent crawls. Built for modern websites, it intelligently adapts to structural changes by automatically relocating elements when page layouts update. The framework includes advanced fetchers capable of bypassing anti-bot protections such as Cloudflare Turnstile using stealth and browser automation techniques. Its powerful spider system supports multi-session crawling, pause and resume functionality, and real-time streaming of scraped data. Scrapling combines high performance, memory efficiency, and extensive async support to deliver blazing-fast scraping workflows. With a developer-friendly API, CLI tools, MCP server integration for AI-assisted extraction, and Docker support, it offers a complete solution for modern web scrapers.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 13
    SimpDL

    SimpDL

    A tool to scrape images from SimpCity

    SimpDL is an open-source media downloading tool designed to retrieve content from subscription-based or creator platforms, focusing on simplicity and ease of use. It enables users to download images, videos, and other media associated with specific creators or accounts, often through authenticated sessions. The project emphasizes a straightforward workflow where users provide login credentials or tokens, and the tool handles the retrieval and storage of content automatically. It is designed to reduce the complexity of manual downloading while still offering flexibility in how content is saved and organized. SimpDL typically supports batch downloads, allowing users to archive entire profiles or content collections efficiently. The tool is often used for offline access or backup purposes, especially for platforms where content may be time-limited.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    douyin

    douyin

    Open source Douyin crawler for collecting and downloading public data

    DouyinCrawler is an open source data collection tool designed to gather publicly available information from the Douyin platform. It demonstrates how to build a Python-based web crawler combined with a graphical interface and command line functionality. It allows users to collect data from various types of Douyin content, including user profiles, videos, hashtags, and music pages. DouyinCrawler supports both automated scraping and batch operations to process multiple targets efficiently. It also integrates with the Aria2 download utility to enable large-scale downloading of videos and images associated with collected content. It includes multiple usage modes such as a desktop GUI, a web service interface, and a command line tool for flexible deployment. In addition to data collection, it supports incremental updates so users can track and gather newly published content without reprocessing previously collected data.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    Crawl4AI

    Crawl4AI

    Open-source LLM Friendly Web Crawler & Scraper

    Crawl4AI is a high-performance, AI‑ready web crawler tailored for LLM data ingestion and RAG pipelines. It supports adaptive crawling heuristics (stopping when enough info is gathered), structured markdown output, and high-speed parallel execution. Designed to operate at scale with optional Docker deployment and framework integrations.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 16
    ECommerceCrawlers

    ECommerceCrawlers

    Collection of Python ecommerce and website crawler examples projects

    ECommerceCrawlers is a collection of practical Python web crawler projects designed to gather data from a variety of ecommerce platforms, websites, and online services. It aggregates many independent crawler examples created by contributors and organized into separate subprojects that target specific sites or data sources. These examples demonstrate how to build and operate web scrapers capable of collecting structured information such as product listings, news content, job postings, social media data, and other publicly available web data. It aims to help developers understand the full workflow of web scraping, including request simulation, data extraction, storage, and handling anti-scraping techniques. It includes crawlers for platforms such as ecommerce marketplaces, blogging platforms, recruitment sites, and social networks, providing real-world practice scenarios. Developers can study the individual project documentation to understand the analysis process.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    ProxyBroker

    ProxyBroker

    Asynchronous tool for finding and checking public proxy servers

    ProxyBroker is an open source Python tool designed to automatically discover and verify public proxy servers from many online sources. It operates asynchronously, allowing it to gather and test large numbers of proxies efficiently while performing multiple checks concurrently. It collects proxy addresses from dozens of providers and evaluates whether they are functional and suitable for use. It supports several proxy protocols, including HTTP, HTTPS, SOCKS4, and SOCKS5, making it flexible for different networking and scraping scenarios. ProxyBroker can filter proxies based on criteria such as anonymity level, response time, country of origin, and DNS blacklist status. In addition to discovering and validating proxies, it can also function as a proxy server that distributes incoming requests across a rotating pool of working proxies. This capability allows users to route traffic through multiple proxies automatically, which can help with tasks.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    Python API for JMComic

    Python API for JMComic

    Python crawler and API for downloading JMComic albums and images

    JMComic-Crawler-Python is a Python library and crawler framework designed to programmatically access and download comic content from the JMComic platform. It provides a structured API that allows developers to retrieve albums, chapters, and images using simple Python code while handling the necessary network requests and data processing behind the scenes. It supports both web-based and mobile API interfaces, enabling flexible interaction with the platform depending on the available endpoints. Its architecture includes components for configuration management, download orchestration, and client communication, allowing users to automate the retrieval of manga chapters or entire albums. It includes command-line functionality and configuration files so users can customize download behavior, directory structures, and performance settings without modifying code. It also supports plugin-based extensions that allow additional processing.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 19
    Toapi

    Toapi

    Convert websites into structured APIs automatically with Python tool

    Toapi is a Python library designed to transform ordinary websites into usable API services. Instead of building a traditional web crawler that collects and stores data before exposing it through an API, Toapi simplifies the process by allowing developers to define data structures that automatically generate an API layer from existing web pages. It works by parsing HTML content from a source site and mapping selected elements into structured data that can be returned as JSON through API endpoints. Developers define items and routes that determine how web pages are parsed and how the resulting data is exposed through the API interface. It also includes mechanisms for caching both page content and API requests, helping reduce repeated network calls and improving performance. Because the generated service is built on top of a Flask application, it can be deployed like any other Flask-based project and integrated into existing Python workflows.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    crawler

    crawler

    Collection of JS reverse engineering examples for web scraping study

    crawler is a collection of web scraping and JavaScript reverse engineering examples designed for learning how modern websites protect their data and how those protections can be analyzed. It contains many case studies that demonstrate how to analyze and replicate request parameters, cookies, and encryption logic used by real websites. Each directory in the project focuses on a specific target service or scenario, showing how browser network requests and JavaScript code can be studied to reproduce API calls programmatically. Many examples illustrate techniques such as debugging scripts, intercepting requests, analyzing encrypted parameters, and understanding authentication flows. crawler also explores common anti-scraping defenses and demonstrates how developers can examine them through debugging tools and reverse engineering techniques.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    dirhunt

    dirhunt

    Web crawler that finds hidden web directories without brute force

    Dirhunt is an open source security tool designed to discover web directories and analyze website structures without relying on brute-force techniques. Instead of sending large numbers of guess-based requests, it operates as a specialized crawler that intelligently explores websites to identify accessible or hidden directories. Dirhunt can detect directories that expose “Index Of” listings, which may reveal files and other resources that were not intended to be publicly visible. It can also identify situations where directories are intentionally hidden through empty index files or servers that return misleading responses such as fake 404 errors. Dirhunt processes HTML pages and other available sources to discover additional paths and directories while minimizing the number of requests sent to the server, making scans faster and less intrusive. It supports scanning multiple targets at the same time and allows results to be filtered, analyzed, and exported for further review.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22
    owllook

    owllook

    Vertical novel search engine with unified reading and tracking tools

    Owllook is an open source vertical search engine designed for discovering and reading online novels from multiple sources. Instead of redirecting users to different sites, the system parses content from many novel platforms and presents it in a unified reading interface. It focuses on providing a simple and comfortable reading experience with features such as searching for books, following updates, bookmarking chapters, and maintaining a personal bookshelf. It aggregates results from multiple search engines and applies parsing rules to extract novel metadata, chapters, and content in a consistent format. Owllook also includes functionality for tracking reading history, displaying rankings based on search activity, and recommending books using a similarity-based approach. Owllook is built using asynchronous technologies to support efficient data retrieval and responsive interactions while reading or searching.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    videodl

    videodl

    Lightweight Python tool for downloading videos from many platforms

    Videodl is a lightweight video downloader implemented entirely in Python that allows users to retrieve videos from a wide range of online media platforms. It focuses on providing a fast and simple way to parse video pages and download media files, often prioritizing high-definition versions without watermarks when available. It supports numerous video platforms across both Chinese and international streaming ecosystems, enabling users to fetch content from many popular services through a unified interface. Videodl works by implementing platform-specific client modules that extract video information and download links from supported services. Videodl can integrate with external command-line utilities to improve downloading performance, handle streaming formats such as HLS, and manage encrypted or segmented media streams. Additional utilities can also enable faster downloads, resume interrupted transfers, and process complex playlist structures.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    Crawlab

    Crawlab

    Distributed web crawler admin platform for spiders management

    Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium. Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB database. The frontend app interacts with the master node, which communicates with other components such as MongoDB, SeaweedFS and worker nodes. Master node and worker nodes communicate with each other via gRPC (a RPC framework). Tasks are scheduled by the task scheduler module in the master node, and received by the task handler module in worker nodes, which executes these tasks in task runners. Task runners are actually processes running spider or crawler programs, and can also send data through gRPC (integrated in SDK) to other data sources, e.g. MongoDB.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    CyberScraper 2077

    CyberScraper 2077

    A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollama

    CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI, Gemini and LocalLLM Models to slice through the web's defenses, extracting the data you need with unparalleled precision and style.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • Next
Auth0 Logo