From: Alex O. <no...@gi...> - 2025-01-22 03:21:09
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 4e8bda1a0757de7fd46485f0c94f8eadb241e1ec https://github.com/internetarchive/heritrix3/commit/4e8bda1a0757de7fd46485f0c94f8eadb241e1ec Author: Alex Osborne <aos...@nl...> Date: 2025-01-13 (Mon, 13 Jan 2025) Changed paths: M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.groovy M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/JerichoExtractorHTML.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java Log Message: ----------- ExtractorHTML: Add obeyRelNofollow option When enabled this option causes regular links annotated with rel=nofollow to not be extracted. This is useful for sites that use rel=nofollow to hint crawler traps. Commit: a4c98e4d27c97a8f2927b4cec30d03e3d72f5195 https://github.com/internetarchive/heritrix3/commit/a4c98e4d27c97a8f2927b4cec30d03e3d72f5195 Author: Alex Osborne <aos...@nl...> Date: 2025-01-22 (Wed, 22 Jan 2025) Changed paths: M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.groovy M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/JerichoExtractorHTML.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java Log Message: ----------- Merge pull request #638 from internetarchive/rel-nofollow ExtractorHTML: Add obeyRelNofollow option Compare: https://github.com/internetarchive/heritrix3/compare/4c4510a36485...a4c98e4d27c9 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |