The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.
Features
- deeply and thoroughly harvests website content
- works on any Java platform (Linux recommended)
- stores content to ARC or ISO WARC aggregate/transcript format
- web interface for operator control and monitoring of crawls
License
Apache License V2.0, GNU Library or Lesser General Public License version 2.0 (LGPLv2)Follow Heritrix: Internet Archive Web Crawler
Other Useful Business Software
Auth0 for AI Agents now in GA
Connect your AI agents to apps and data more securely, give users control over the actions AI agents can perform and the data they can access, and enable human confirmation for critical agent actions.
Rate This Project
Login To Rate This Project
User Reviews
-
Cool
-
Cool.
-
Useful project. Thanks
-
Great software, thank you.
-
The app works well in my PC. Serves its purpose too, so no regrets for me.