From: Alex O. <no...@gi...> - 2025-01-13 06:26:42
|
Branch: refs/heads/rel-nofollow Home: https://github.com/internetarchive/heritrix3 Commit: 46fc1ab085a186ebbbee580f3bc7f1e78f06b294 https://github.com/internetarchive/heritrix3/commit/46fc1ab085a186ebbbee580f3bc7f1e78f06b294 Author: Alex Osborne <aos...@nl...> Date: 2025-01-13 (Mon, 13 Jan 2025) Changed paths: M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.groovy M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java Log Message: ----------- ExtractorHTML: Add obeyRelNofollow option When enabled this option causes regular links annotated with rel=nofollow to not be extracted. This is useful for sites that use rel=nofollow to hint crawler traps. To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |