The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.

Project Activity

See All Activity >

License

GNU Library or Lesser General Public License version 2.0 (LGPLv2)

Follow DeDuplicator (Heritrix add-on)

DeDuplicator (Heritrix add-on) Web Site

Other Useful Business Software
Retool your internal operations Icon
Retool your internal operations

Generate secure, production-grade apps that connect to your business data. Not just prototypes, but tools your team can actually deploy.

Build internal software that meets enterprise security standards without waiting on engineering resources. Retool connects to your databases, APIs, and data sources while maintaining the permissions and controls you need. Create custom dashboards, admin tools, and workflows from natural language prompts—all deployed in your cloud with security baked in. Stop duct-taping operations together, start building in Retool.
Build an app in Retool
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of DeDuplicator (Heritrix add-on)!

Additional Project Details

Languages

English

Intended Audience

Advanced End Users, Developers, System Administrators

User Interface

Plugins

Programming Language

Java

Related Categories

Java Internet Software, Java Web Scrapers

Registered

2006-11-06