[Archive-crawler-cvs] [internetarchive/heritrix3] 7a57af: Add Groovy crawl configs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

  Branch: refs/heads/groovy-config
  Home:   https://github.com/internetarchive/heritrix3
  Commit: 7a57afad45f382c86bca804c63ac3836a7e018c7
      https://github.com/internetarchive/heritrix3/commit/7a57afad45f382c86bca804c63ac3836a7e018c7
  Author: Alex Osborne <aos...@nl...>
  Date:   2024-12-01 (Sun, 01 Dec 2024)

  Changed paths:
    M commons/pom.xml
    M commons/src/main/java/org/archive/spring/PathSharingContext.java
    A commons/src/test/java/org/archive/spring/PathSharingContextTest.java
    A commons/src/test/resources/org/archive/spring/PathSharingContextTestBeans.cxml
    A commons/src/test/resources/org/archive/spring/PathSharingContextTestBeans.groovy
    M engine/src/main/java/org/archive/crawler/framework/CrawlJob.java
    M engine/src/main/java/org/archive/crawler/framework/Engine.java
    A engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.groovy
    A engine/src/test/java/org/archive/crawler/restlet/ProfileCrawlerBeansTest.java
    M modules/pom.xml
    M pom.xml

  Log Message:
  -----------
  Add Groovy crawl configs

This enables crawl configuration files to use Spring's [Groovy Bean Definition DSL] as an optional alternative to Spring XML. It uses the same bean configuration model but the syntax is more terse and human-readable. No more need for `&amp;` in seed URLs. :-)

```groovy
   checkpointService(CheckpointService) {
        checkpointIntervalMinutes = 15
        checkpointsDir = 'checkpoints'
        forgetAllButLatest = true
   }
```

It also enables some powerful scripting capabilities. For example, defining a custom DecideRule directly in the crawl scope:

```groovy
scope(DecideRuleSequence) {
    rules = [
        new RejectDecideRule(),
        // ACCEPT everything linked from a .pdf file
        new PredicatedDecideRule() {
             boolean evaluate(CrawlURI uri) {
                 return uri.via?.path?.endsWith(".pdf")
             }
        },
        // ...
    ]
}
```

The main downsides are defining nested inner beans can be a bit awkward, some of the errors can be cryptic, and you can't just manipulate the config files with an XML parser.

This commit includes a Groovy version of the default crawl profile for reference, but doesn't expose a way to use it in the UI yet. For now, you need to manually create a `crawler-beans.groovy` file in your job directory.

[Groovy Bean Definition DSL]: https://docs.spring.io/spring-framework/reference/core/beans/basics.html#beans-factory-groovy

To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications