web crawler java free download

Showing 11003 open source projects for "web crawler java"

View related business solutions

Fully Managed MySQL, PostgreSQL, and SQL Server
Automatic backups, patching, replication, and failover. Focus on your app, not your database.

Cloud SQL handles your database ops end to end, so you can focus on your app.

Try Free
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
1

crawler

Collection of JS reverse engineering examples for web scraping study

crawler is a collection of web scraping and JavaScript reverse engineering examples designed for learning how modern websites protect their data and how those protections can be analyzed. It contains many case studies that demonstrate how to analyze and replicate request parameters, cookies, and encryption logic used by real websites. Each directory in the project focuses on a specific target service or scenario, showing how browser network requests and JavaScript code can be studied to reproduce API calls programmatically. ...

Downloads: 1 This Week

Last Update: 11 hours ago
See Project
2

AI-Crawler

Crawl a website starting from a URL, find relevant pages

AI Crawler is an experimental AI-powered web crawling and data extraction tool that uses natural language prompts to guide the discovery and retrieval of relevant information across websites. Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure.

Downloads: 5 This Week

Last Update: 2 days ago
See Project
3

Weibo Crawler

Python crawler for collecting and downloading Sina Weibo user data

weibo-crawler is a Python-based data collection tool designed to retrieve information from Sina Weibo user accounts. It automates the process of gathering posts, user profile details, and engagement metrics from one or more target accounts. weibo-crawler can extract comprehensive information about users, including profile attributes such as nickname, follower count, following count, and account metadata. It also captures detailed data about each post, including the content, publishing time,...

Downloads: 2 This Week

Last Update: 3 days ago
See Project
4

tumblr-crawler

Python crawler to download photos and videos from Tumblr blogs

tumblr-crawler is an open source Python-based utility designed to download media content from Tumblr blogs. It provides a script that automatically retrieves photos and videos from specified Tumblr sites and saves them locally for offline access. Users can specify one or multiple blogs to crawl by editing a configuration file or by passing parameters through the command line. Once executed, the script fetches media from the Tumblr API and stores the downloaded files in folders named after...

Downloads: 1 This Week

Last Update: 2 days ago
See Project
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
5

Spatie Crawler

An easy to use, powerful crawler implemented in PHP

Spatie Crawler is a PHP library that allows developers to crawl websites and extract information efficiently. It can be used for web scraping, link checking, or automated testing of web pages. The library is simple to use and supports customizable crawling strategies, including controlling crawl depth and handling redirects. It’s suitable for building crawlers that navigate large or dynamically generated websites.

Downloads: 0 This Week

Last Update: 2026-03-20
See Project
6

WebMagic

A scalable web crawler framework for Java

WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other...

Downloads: 0 This Week

Last Update: 2025-02-10
See Project
7

GPT Crawler

Crawl a site to generate knowledge files to create your own custom GPT

GPT Crawler is an open-source tool designed to automatically crawl websites and generate structured knowledge that can be used to build AI assistants and retrieval systems. It focuses on extracting high-quality textual content from web pages and preparing it in formats suitable for embedding, indexing, or fine-tuning workflows. The project is especially useful for teams that want to turn documentation sites or knowledge bases into conversational AI backends without building custom scrapers from scratch. ...

Downloads: 0 This Week

Last Update: 2026-03-02
See Project
8

EasySpider

A visual no-code/code-free web crawler/spider

A visual code-free/no-code web crawler/spider, supporting both Chinese and English.

Downloads: 5 This Week

Last Update: 2025-01-01
See Project
9

Heritrix

Internet Archive's open-source, web-scale, web crawler project

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

Downloads: 1 This Week

Last Update: 2026-02-06
See Project
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
10

Java JWT

Java implementation of JSON Web Token (JWT)

A Java implementation of JSON Web Token (JWT) - RFC 7519. This library requires Java 8 or higher. The last version that supported Java 7 was 3.11.0. The library implements JWT Verification and Signing using several algorithms. The Algorithm defines how a token is signed and verified. It can be instantiated with the raw value of the secret in the case of HMAC algorithms, or the key pairs or KeyProvider in the case of RSA and ECDSA algorithms.

Downloads: 10 This Week

Last Update: 2026-02-16
See Project
11

Web Spider, Web Crawler, Email Extractor

Free Extracts Emails, Phones and custom text from Web using JAVA Regex

In Files there is WebCrawlerMySQL.jar which supports MySql Connection Free Web Spider & Crawler. Extracts Information from Web by parsing millions of pages. Store data into Derby Database and data are not being lost after force closing the spider. - Free Web Spider , Parser, Extractor, Crawler - Extraction of Emails , Phones and Custom Text from Web - Export to Excel File - Data Saved into Derby and MySQL Database - Written in Java Cross Platform Also See Free email Sender : https://sourceforge.net/projects/gitst-free-email-ender/ Please install Microsoft OpenJDK to start the application https://www.microsoft.com/openjdk

Downloads: 6 This Week

Last Update: 2025-11-23
See Project
12

dxy-covid-19-crawler

Realtime crawler for COVID-19 outbreak statistics from DXY data

DXY-COVID-19-Crawler is a Python-based project designed to collect real-time COVID-19 infection data from the public dataset provided by Ding Xiang Yuan (DXY). The crawler periodically retrieves pandemic statistics and stores them in a database so that historical changes in the outbreak can be preserved and analyzed later. It was created to make up-to-date infection data more accessible for developers, researchers, and analysts who wanted to build visualizations or conduct data analysis...

Downloads: 1 This Week

Last Update: 1 day ago
See Project
13

Playwright for Java

Java version of the Playwright testing and automation library

Playwright Java is the Java version of the Playwright testing and automation library, enabling reliable end-to-end testing for modern web applications.

Downloads: 6 This Week

Last Update: 2026-01-28
See Project
14

FEAPDER

Powerful Python crawler framework for scalable web scraping tasks

...It also integrates monitoring and alerting capabilities to help developers track crawler performance and detect issues during execution. feapder includes browser rendering support for handling dynamic web pages and provides mechanisms for large-scale data deduplication during crawling.

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
15

Java WebSockets

A barebones WebSocket client and server implementation

This repository contains a barebones WebSocket server and client implementation written in 100% Java. The underlying classes are implemented java.nio, which allows for a non-blocking event-driven model (similar to the WebSocket API for web browsers). The org.java_websocket.server.WebSocketServer abstract class implements the server-side of the WebSocket Protocol. A WebSocket server by itself doesn't do anything except establish socket connections though HTTP.

Downloads: 5 This Week

Last Update: 2024-12-15
See Project
16

crwlr

Library for Rapid (Web) Crawler and Scraper Development

...Before diving into the library, let's have a look at the terms crawling and scraping. For most real-world use cases, those two things go hand in hand, which is why this library helps with and combines both. A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could just load actually all links it is finding (and is allowed to load according to the robots.txt file), then it would just load the whole internet (if the URL(s) it starts with are no dead end). Or it can be restricted to load only links matching certain criteria (on same domain/host, URL path starts with "/foo",...) or only to a certain depth. ...

Downloads: 0 This Week

Last Update: 2026-01-05
See Project
17

fess

Open source enterprise search server for websites, files, and data

...Fess is built on top of OpenSearch and offers an integrated solution for crawling, indexing, and searching documents from websites, file systems, and various data stores. Fess includes a built-in crawler that can collect content from sources such as databases, CSV files, and shared storage, making it suitable for centralized knowledge discovery. It supports indexing and searching across many document formats including office documents, PDFs, and compressed archives. It also provides a web-based administrative interface that allows administrators to configure crawling targets, manage indexing tasks, and adjust search settings from a graphical dashboard.

Downloads: 6 This Week

Last Update: 2026-03-11
See Project
18

Spider

High-performance Rust web crawler and scraper for large-scale data

Spider is a high-performance web crawler and web scraping library written in Rust that enables developers to crawl and index websites efficiently. It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents.

Downloads: 3 This Week

Last Update: 3 days ago
See Project
19

Python API for JMComic

Python crawler and API for downloading JMComic albums and images

JMComic-Crawler-Python is a Python library and crawler framework designed to programmatically access and download comic content from the JMComic platform. It provides a structured API that allows developers to retrieve albums, chapters, and images using simple Python code while handling the necessary network requests and data processing behind the scenes.

Downloads: 3 This Week

Last Update: 6 days ago
See Project
20

Java JWT JSON

Java JWT: JSON Web Token for Java and Android

JJWT aims to be the easiest-to-use and understand library for creating and verifying JSON Web Tokens (JWTs) and JSON Web Keys (JWKs) on the JVM and Android. JJWT is a pure Java implementation based exclusively on the JOSE Working Group RFC specifications.

Downloads: 1 This Week

Last Update: 2025-08-20
See Project
21

katana

Fast CLI web crawler for discovering endpoints in modern web apps

Katana is an open source command-line web crawling and spidering framework developed by ProjectDiscovery. It is designed to efficiently crawl websites and web applications in order to discover endpoints, resources, and other useful information that may not be easily visible through manual browsing. Katana focuses on speed and automation, making it suitable for use in security reconnaissance workflows and automated pipelines. Katana supports both standard HTTP crawling and headless browser...

Downloads: 9 This Week

Last Update: 2026-03-10
See Project
22

Testcontainers Java

Testcontainers is a Java library that supports JUnit tests

Testcontainers for Java is a Java library that supports JUnit tests, providing lightweight, throwaway instances of common databases, Selenium web browsers, or anything else that can run in a Docker container. Use a containerized instance of a MySQL, PostgreSQL or Oracle database to test your data access layer code for complete compatibility, but without requiring complex setup on developers' machines and safe in the knowledge that your tests will always start with a known DB state. ...

Downloads: 0 This Week

Last Update: 2026-03-19
See Project
23

Crawl4AI

Open-source LLM Friendly Web Crawler & Scraper

Crawl4AI is a high-performance, AI‑ready web crawler tailored for LLM data ingestion and RAG pipelines. It supports adaptive crawling heuristics (stopping when enough info is gathered), structured markdown output, and high-speed parallel execution. Designed to operate at scale with optional Docker deployment and framework integrations.

Downloads: 0 This Week

Last Update: 2026-03-18
See Project
24

Snap Lens Web Crawler

Crawl and download Snap Lenses from lens.snapchat.com with ease.

Crawl and download Snap Lenses from lens.snapchat.com with ease. This crawler is a dependency of Snap Camera Server https://snap-camera-server.sourceforge.io

Downloads: 0 This Week

Last Update: 2025-07-18
See Project
25

Useful Java links

A list of useful Java frameworks, libraries, software and hello worlds

...These resources cover many areas of software development, including web frameworks, testing libraries, concurrency tools, build systems, microservices architectures, and development best practices. By grouping links into categorized sections, the repository allows developers to quickly discover relevant technologies and learning materials for building Java applications. The project is maintained as a living reference library that evolves alongside the Java ecosystem as new frameworks and development tools emerge.

Downloads: 0 This Week

Last Update: 2026-03-10
See Project