Showing 148 open source projects for "crawl"

View related business solutions
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Dungeon Crawl: Stone Soup

    Dungeon Crawl: Stone Soup

    A game of dungeon exploration, combat and magic

    Crawl also sports a number of systems and mechanics that diverge from more traditional roguelikes. For example, there's less emphasis on character classes (or backgrounds in Crawl parlance); your character is defined more by their skills, species and choice of deity than their background. Crawl includes an in-game tutorial and manual in the name of providing something playable without the aid of a guide or wiki.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 3
    Firecrawl

    Firecrawl

    Turn entire websites into LLM-ready markdown or structured data

    Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 4
    Nebula libp2p DHT

    Nebula libp2p DHT

    A libp2p DHT crawler, monitor, and measurement tool

    ...The crawler can store its results as JSON documents or in a postgres database - the --dry-run flag prevents it from doing either. Nebula will print a summary of the crawl at the end instead. A crawl takes ~5-10 min depending on your internet connection. You can also specify the network you want to crawl by appending, e.g., --network FILECOIN and limit the number of peers to crawl by providing the --limit flag.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 5
    Spatie Crawler

    Spatie Crawler

    An easy to use, powerful crawler implemented in PHP

    Spatie Crawler is a PHP library that allows developers to crawl websites and extract information efficiently. It can be used for web scraping, link checking, or automated testing of web pages. The library is simple to use and supports customizable crawling strategies, including controlling crawl depth and handling redirects. It’s suitable for building crawlers that navigate large or dynamically generated websites.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Spider

    Spider

    High-performance Rust web crawler and scraper for large-scale data

    ...Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. These capabilities make the project suitable for building search indexers, data extraction pipelines, & SEO analysis tools.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Heritrix

    Heritrix

    Internet Archive's open-source, web-scale, web crawler project

    ...Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt. Heritrix is designed to respect the robots.txt exclusion directives† and META nofollow tags. Please consider the load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the User-Agent so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    GPT Crawler

    GPT Crawler

    Crawl a site to generate knowledge files to create your own custom GPT

    GPT Crawler is an open-source tool designed to automatically crawl websites and generate structured knowledge that can be used to build AI assistants and retrieval systems. It focuses on extracting high-quality textual content from web pages and preparing it in formats suitable for embedding, indexing, or fine-tuning workflows. The project is especially useful for teams that want to turn documentation sites or knowledge bases into conversational AI backends without building custom scrapers from scratch. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    Laravel Sitemap

    Laravel Sitemap

    Create and generate sitemaps with ease

    ...This works by crawling your entire site. The generator has the ability to execute JavaScript on each page so links injected into the dom by JavaScript will be crawled as well. The easiest way is to crawl the given domain and generate a sitemap with all found links. The destination of the sitemap should be specified by $path. If you don't want a crawled link to appear in the sitemap, just don't return it in the callable you pass to hasCrawled. You can also instruct the underlying crawler to not crawl some pages by passing a callable to shouldCrawl. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 10
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    AI-Crawler

    AI-Crawler

    Crawl a website starting from a URL, find relevant pages

    ...Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure. Users can define their data requirements in plain English, and the system will interpret those instructions to crawl a domain and extract structured data. The tool supports output formats such as JSON and Markdown, and it can generate or accept schemas to ensure that extracted data is structured according to application needs. It is designed as a low-code solution, reducing the complexity of building and maintaining custom scraping pipelines.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    Douyin TikTok Download API

    Douyin TikTok Download API

    Douyin TikTok Download API

    ...You can deploy or transform this project yourself to achieve more functions, or you can directly call scraper.py in your project or install an existing pip package as a parsing library to easily crawl data, etc. Support input Douyin|TikTokuser homepage to crawl the author [homepage video data (remove watermark link, liked video list (permission must be public), video comment data, background music video list data, etc...).
    Downloads: 4 This Week
    Last Update:
    See Project
  • 13
    The Web MCP

    The Web MCP

    A powerful Model Context Protocol (MCP) server

    Bright Data’s Web MCP server gives AI assistants robust, real-time web capabilities through an MCP interface designed to avoid blocks, rate limits, and CAPTCHAs. It presents search, crawl, navigate, and extraction tools that agents can call directly, replacing brittle scraping prompts with typed operations. The README markets it as a “gateway” to the live web so assistants don’t fall back to stale training data. Bright Data also advertises a getting-started tier with a free monthly allotment, plus options for remote or self-hosted operation depending on governance needs. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Search1API MCP

    Search1API MCP

    A Model Context Protocol (MCP) server

    The Search1API MCP Server is a Model Context Protocol server that provides search and crawl functionality using Search1API. It enables web and news searches, content extraction, and sitemap retrieval, integrating seamlessly with MCP clients. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Web-Check

    Web-Check

    All-in-one OSINT tool for analysing any website

    ...Get an insight into the inner-workings of a given website: uncover potential attack vectors, analyse server architecture, view security configurations, and learn what technologies a site is using. Currently the dashboard will show: IP info, SSL chain, DNS records, cookies, headers, domain info, search crawl rules, page map, server location, redirect ledger, open ports, traceroute, DNS security extensions, site performance, trackers, associated hostnames, carbon footprint. Stay tuned, as I'll add more soon. The aim is to help you easily understand, optimize and secure your website.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 16
    crawley

    crawley

    The unix-way web crawler

    ...Found URLs are streamed to stdout and guaranteed to be unique (with fragments omitted) Scan depth (limited by starting host and path, by default - 0) can be configured. Can crawl rules and sitemaps from robots.txt. Brute mode - scan HTML comments for URLs (this can lead to bogus results) Make use of HTTP_PROXY / HTTPS_PROXY environment values + handle proxy auth. Directory-only scan mode (aka fast-scan)
    Downloads: 3 This Week
    Last Update:
    See Project
  • 17
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 18
    StringZilla

    StringZilla

    10x faster string search, split, sort, and shuffle for long strings

    ...The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 19
    Puppeteer

    Puppeteer

    Headless Chrome Node.js API

    ...However, it can also be set to run full or non-headless Chrome or Chromium, simply set the headless option when launching a browser. Many of the things you can do manually in the browser, you can also do with Puppeteer such as generate page screenshots and PDFs, crawl a Single-Page Application, test Chrome extensions and more.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 20
    katana

    katana

    Fast CLI web crawler for discovering endpoints in modern web apps

    Katana is an open source command-line web crawling and spidering framework developed by ProjectDiscovery. It is designed to efficiently crawl websites and web applications in order to discover endpoints, resources, and other useful information that may not be easily visible through manual browsing. Katana focuses on speed and automation, making it suitable for use in security reconnaissance workflows and automated pipelines. Katana supports both standard HTTP crawling and headless browser crawling, allowing it to navigate modern web applications that rely heavily on JavaScript. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 21
    Deep Research

    Deep Research

    Use any LLMs (Large Language Models) for Deep Research

    Deep Research is a local-first research agent that orchestrates multiple LLMs to generate in-depth reports in minutes. It combines “thinking” and “task” model roles with live internet access to plan, search, read, and synthesize findings into structured outputs. The project emphasizes privacy: processing and storage happen locally, avoiding server-side retention of your queries and notes. A simple web UI lets you enter topics and configure models, while the backend streams progress as...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    tumblr-crawler

    tumblr-crawler

    Python crawler to download photos and videos from Tumblr blogs

    ...It provides a script that automatically retrieves photos and videos from specified Tumblr sites and saves them locally for offline access. Users can specify one or multiple blogs to crawl by editing a configuration file or by passing parameters through the command line. Once executed, the script fetches media from the Tumblr API and stores the downloaded files in folders named after each blog. tumblr-crawler avoids re-downloading files that have already been saved, making repeated runs safe and useful for recovering missing media. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    Geziyor

    Geziyor

    Blazing fast Go framework for web crawling and data scraping tasks

    Geziyor is a high-performance web crawling and web scraping framework built for the Go programming language. It is designed to help developers crawl websites and extract structured information from web pages efficiently. It focuses on speed and scalability, allowing large numbers of requests to be processed concurrently. Geziyor supports use cases such as data mining, monitoring web content, and automated testing workflows. It provides a flexible architecture where developers define parsing functions that process responses and extract the desired data. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    Scope Sentry

    Scope Sentry

    Cyberspace asset mapping and vulnerability scanning platform

    ...ScopeSentry combines multiple reconnaissance and vulnerability assessment capabilities such as subdomain enumeration, port scanning, directory scanning, and sensitive information detection. ScopeSentry can automatically identify assets and services, extract URLs, and crawl websites to collect useful security data for further analysis. It also includes vulnerability scanning and subdomain takeover detection to help identify common security weaknesses across web infrastructure. It supports distributed scanning with multiple nodes, allowing large scanning tasks to be performed efficiently across different systems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    BettaFish

    BettaFish

    Public opinion analysis system

    BettaFish is an open-source, multi-agent public opinion analysis system built to automate the collection, deep analysis, and reporting of social media data at scale through conversational queries. It uses a modular architecture of specialized agents that collaborate to crawl mainstream platforms, extract multimodal content like text and short video, and synthesize insights through both statistical and large language model techniques. With a design that lets users pose questions in natural language and receive structured reports, charts, and visualizations, the system aims to break information cocoons and provide comprehensive views of trends and public sentiment. ...
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB