Heritrix: Internet Archive Web Crawler
gojomo, ia_igor, iasf-admin, jlee-archive, johnerik, kristinn_sig, nlevitt, paul_jack, rstata, stack-sf, szznax
Description
The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.
Heritrix: Internet Archive Web Crawler Web SiteFeatures
- deeply and thoroughly harvests website content
- works on any Java platform (Linux recommended)
- stores content to ARC or ISO WARC aggregate/transcript format
- web interface for operator control and monitoring of crawls
User Ratings
User Reviews
-
Nice and Easy to use.
-
O melhor programa para compartilhamento
-
Awesome job, thank you so much for posting
-
good job
-
very good project