The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

Features

  • deeply and thoroughly harvests website content
  • works on any Java platform (Linux recommended)
  • stores content to ARC or ISO WARC aggregate/transcript format
  • web interface for operator control and monitoring of crawls

Project Activity

See All Activity >

License

GNU Library or Lesser General Public License version 2.0 (LGPLv2), Apache License V2.0

Follow Heritrix: Internet Archive Web Crawler

Heritrix: Internet Archive Web Crawler Web Site

Other Useful Business Software
Enterprise-grade ITSM, for every business Icon
Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
Try it Free
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
21
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5

User Reviews

  • Cool
  • Cool.
  • Useful project. Thanks
  • Great software, thank you.
  • The app works well in my PC. Serves its purpose too, so no regrets for me.
Read more reviews >

Additional Project Details

Operating Systems

Linux

Languages

English

Intended Audience

Non-Profit Organizations, Government, Information Technology, Education, Advanced End Users, Developers

User Interface

Web-based

Programming Language

Java

Database Environment

Berkeley/Sleepycat/Gdbm (DBM)

Related Categories

Java Library Management Software, Java Archiving Software, Java Web Scrapers

Registered

2003-02-12