The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

Features

  • deeply and thoroughly harvests website content
  • works on any Java platform (Linux recommended)
  • stores content to ARC or ISO WARC aggregate/transcript format
  • web interface for operator control and monitoring of crawls

Project Activity

See All Activity >

License

GNU Library or Lesser General Public License version 2.0 (LGPLv2), Apache License V2.0

Follow Heritrix: Internet Archive Web Crawler

Heritrix: Internet Archive Web Crawler Web Site

You Might Also Like
Pimberly PIM - the leading enterprise Product Information Management platform. Icon
Pimberly PIM - the leading enterprise Product Information Management platform.

Pimberly enables businesses to create amazing online experiences with richer, differentiated product descriptions.

Drive amazing product experiences with quality product data.
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
21
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5

User Reviews

There are no 2 star reviews.

Additional Project Details

Operating Systems

Linux

Languages

English

Intended Audience

Non-Profit Organizations, Government, Information Technology, Education, Advanced End Users, Developers

User Interface

Web-based

Programming Language

Java

Database Environment

Berkeley/Sleepycat/Gdbm (DBM)

Related Categories

Java Library Management Software, Java Archiving Software, Java Web Scrapers

Registered

2003-02-12