The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

Features

  • deeply and thoroughly harvests website content
  • works on any Java platform (Linux recommended)
  • stores content to ARC or ISO WARC aggregate/transcript format
  • web interface for operator control and monitoring of crawls

Project Activity

See All Activity >

Follow Heritrix: Internet Archive Web Crawler

Heritrix: Internet Archive Web Crawler Web Site

Other Useful Business Software
The RMM That Puts the Power of Automation in Your Hands Icon
The RMM That Puts the Power of Automation in Your Hands

Take the Mundane and Routine Out of Tech Support With ConnectWise Automate

While we wait for the do-everything astromech droid to become a reality, ConnectWise Automate is the next best thing. With out-of-the-box scripts, around-the-clock monitoring, and unmatched automation capabilities, start doing way more with less and bring real value to your service delivery.
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
21
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5

User Reviews

  • Cool
  • Cool.
  • Useful project. Thanks
  • Great software, thank you.
  • The app works well in my PC. Serves its purpose too, so no regrets for me.
Read more reviews >

Additional Project Details

Languages

English

Intended Audience

Advanced End Users, Developers, Education, Government, Information Technology, Non-Profit Organizations

User Interface

Web-based

Programming Language

Java

Database Environment

Berkeley/Sleepycat/Gdbm (DBM)

Registered

2003-02-11