Download Latest Version 3.10.0 source code.tar.gz (2.3 MB)
Email in envelope

Get an email when there's a new version of Heritrix

Home / 3.10.0
Name Modified Size InfoDownloads / Week
Parent folder
3.10.0 source code.tar.gz 2025-06-12 2.3 MB
3.10.0 source code.zip 2025-06-12 3.0 MB
README.md 2025-06-12 3.4 kB
Totals: 3 Items   5.3 MB 1

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • BrowserProcessor: Loads fetched pages in a local browser (Firefox/ChromeDriver), records all browser requests, and runs pluggable behaviors (e.g. scrolling, link extraction). #653
  • Uses the WebDriver BiDi protocol for browser automation.
  • The recording proxy is built on Jetty's ProxyHandler and the FetchHTTP2 module.
  • Status: Working for small crawls but needs more robust error handling (browser crashes, resource limits).

  • Basic web auth: You can now switch the web interface from Digest authentication to Basic authentication with the --web-auth basic command-line option. This is useful when running Heritrix behind a reverse proxy that adds external authentication. #654

  • Robots.txt wildcards: The * and $ wildcard rules from RFC 9309 are now supported. #656

  • FetchHTTP2: Added HTTP proxy support. #657

Fixes

  • Code editor: The configuration editor and script console were upgraded to CodeMirror 6. This resolves some browser incompatibilities, allowing CodeMirror’s own find function to be re-enabled for reliable text search of content far outside the viewport. #651

  • BDB shutdown interrupt handling: The thread’s interrupted flag is now cleared before some BDB interactions to reduce the likelihood of environment invalidation when requestCrawlStop() is called repeatedly. #659

  • FetchHTTP2: Fixed gzip alert log messages by configuring HttpClient to not decode gzip encoding from response.

Removals

  • Removed Apache HttpClient 3: If you have custom Heritrix modules you may need to update the following class references in your code:
Removed Replacement
org.apache.commons.httpclient.URIException org.archive.url.URIException
org.apache.commons.httpclient.Header org.archive.format.http.HttpHeader

Note that Apache HttpClient 4 (org.apache.http) was not removed. #652

Dependency Upgrades

  • codemirror: 2.23 → 6
  • easymock: 5.5.0 → removed
  • groovy: 4.0.26 → 4.0.27
  • junit: 5.12.2 → 5.13.1
  • kafka-clients: 3.9.0 → 3.9.1
  • spring: 6.2.6 → 6.2.7
  • webarchive-commons: 1.3.0 → 2.0.1
Source: README.md, updated 2025-06-12