From: Alex O. <no...@gi...> - 2025-01-13 06:40:36
|
Branch: refs/heads/rel-nofollow Home: https://github.com/internetarchive/heritrix3 Commit: 4e8bda1a0757de7fd46485f0c94f8eadb241e1ec https://github.com/internetarchive/heritrix3/commit/4e8bda1a0757de7fd46485f0c94f8eadb241e1ec Author: Alex Osborne <aos...@nl...> Date: 2025-01-13 (Mon, 13 Jan 2025) Changed paths: M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.groovy M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/JerichoExtractorHTML.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java Log Message: ----------- ExtractorHTML: Add obeyRelNofollow option When enabled this option causes regular links annotated with rel=nofollow to not be extracted. This is useful for sites that use rel=nofollow to hint crawler traps. To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |