From: Alex O. <no...@gi...> - 2024-11-30 15:13:51
|
Branch: refs/heads/groovy-config Home: https://github.com/internetarchive/heritrix3 Commit: 7a57afad45f382c86bca804c63ac3836a7e018c7 https://github.com/internetarchive/heritrix3/commit/7a57afad45f382c86bca804c63ac3836a7e018c7 Author: Alex Osborne <aos...@nl...> Date: 2024-12-01 (Sun, 01 Dec 2024) Changed paths: M commons/pom.xml M commons/src/main/java/org/archive/spring/PathSharingContext.java A commons/src/test/java/org/archive/spring/PathSharingContextTest.java A commons/src/test/resources/org/archive/spring/PathSharingContextTestBeans.cxml A commons/src/test/resources/org/archive/spring/PathSharingContextTestBeans.groovy M engine/src/main/java/org/archive/crawler/framework/CrawlJob.java M engine/src/main/java/org/archive/crawler/framework/Engine.java A engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.groovy A engine/src/test/java/org/archive/crawler/restlet/ProfileCrawlerBeansTest.java M modules/pom.xml M pom.xml Log Message: ----------- Add Groovy crawl configs This enables crawl configuration files to use Spring's [Groovy Bean Definition DSL] as an optional alternative to Spring XML. It uses the same bean configuration model but the syntax is more terse and human-readable. No more need for `&` in seed URLs. :-) ```groovy checkpointService(CheckpointService) { checkpointIntervalMinutes = 15 checkpointsDir = 'checkpoints' forgetAllButLatest = true } ``` It also enables some powerful scripting capabilities. For example, defining a custom DecideRule directly in the crawl scope: ```groovy scope(DecideRuleSequence) { rules = [ new RejectDecideRule(), // ACCEPT everything linked from a .pdf file new PredicatedDecideRule() { boolean evaluate(CrawlURI uri) { return uri.via?.path?.endsWith(".pdf") } }, // ... ] } ``` The main downsides are defining nested inner beans can be a bit awkward, some of the errors can be cryptic, and you can't just manipulate the config files with an XML parser. This commit includes a Groovy version of the default crawl profile for reference, but doesn't expose a way to use it in the UI yet. For now, you need to manually create a `crawler-beans.groovy` file in your job directory. [Groovy Bean Definition DSL]: https://docs.spring.io/spring-framework/reference/core/beans/basics.html#beans-factory-groovy To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |