You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(50) |
Oct
(197) |
Nov
(305) |
Dec
(295) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(429) |
Feb
(694) |
Mar
(443) |
Apr
(479) |
May
(357) |
Jun
(74) |
Jul
(218) |
Aug
(162) |
Sep
(156) |
Oct
(340) |
Nov
(132) |
Dec
(224) |
2005 |
Jan
(170) |
Feb
(122) |
Mar
(265) |
Apr
(215) |
May
(139) |
Jun
(247) |
Jul
(179) |
Aug
(116) |
Sep
(103) |
Oct
(125) |
Nov
(97) |
Dec
(221) |
2006 |
Jan
(132) |
Feb
(18) |
Mar
(23) |
Apr
(35) |
May
(71) |
Jun
(268) |
Jul
(220) |
Aug
(376) |
Sep
(181) |
Oct
(71) |
Nov
(131) |
Dec
(172) |
2007 |
Jan
(125) |
Feb
(79) |
Mar
(90) |
Apr
(76) |
May
(91) |
Jun
(64) |
Jul
(113) |
Aug
(96) |
Sep
(40) |
Oct
(30) |
Nov
(85) |
Dec
(56) |
2008 |
Jan
(37) |
Feb
(79) |
Mar
(22) |
Apr
(6) |
May
(13) |
Jun
(22) |
Jul
(83) |
Aug
(50) |
Sep
(8) |
Oct
(32) |
Nov
(55) |
Dec
(28) |
2009 |
Jan
(15) |
Feb
(30) |
Mar
(28) |
Apr
(69) |
May
(82) |
Jun
(19) |
Jul
(64) |
Aug
(71) |
Sep
(53) |
Oct
(84) |
Nov
(105) |
Dec
(40) |
2010 |
Jan
(11) |
Feb
(19) |
Mar
(24) |
Apr
(58) |
May
(15) |
Jun
(35) |
Jul
(14) |
Aug
(13) |
Sep
(31) |
Oct
(15) |
Nov
(39) |
Dec
(10) |
2011 |
Jan
(59) |
Feb
(32) |
Mar
(10) |
Apr
(37) |
May
(20) |
Jun
(21) |
Jul
(39) |
Aug
(9) |
Sep
(31) |
Oct
(29) |
Nov
(3) |
Dec
(1) |
2012 |
Jan
(7) |
Feb
(4) |
Mar
(5) |
Apr
(12) |
May
(5) |
Jun
(8) |
Jul
(9) |
Aug
(6) |
Sep
(15) |
Oct
(1) |
Nov
(3) |
Dec
(9) |
2013 |
Jan
(9) |
Feb
(2) |
Mar
(41) |
Apr
(13) |
May
(9) |
Jun
(20) |
Jul
(5) |
Aug
(22) |
Sep
(5) |
Oct
(3) |
Nov
(13) |
Dec
(8) |
2014 |
Jan
(27) |
Feb
(16) |
Mar
(7) |
Apr
(14) |
May
(10) |
Jun
(2) |
Jul
(16) |
Aug
(6) |
Sep
(6) |
Oct
(11) |
Nov
(7) |
Dec
|
2015 |
Jan
|
Feb
(7) |
Mar
(4) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
(2) |
Sep
(2) |
Oct
(5) |
Nov
(1) |
Dec
|
2016 |
Jan
(15) |
Feb
(5) |
Mar
(4) |
Apr
(1) |
May
(7) |
Jun
(16) |
Jul
(6) |
Aug
(2) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
|
May
(4) |
Jun
(25) |
Jul
|
Aug
|
Sep
(4) |
Oct
(11) |
Nov
(9) |
Dec
(1) |
2018 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(10) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(12) |
Dec
(4) |
2019 |
Jan
(3) |
Feb
(21) |
Mar
(17) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
|
Aug
(65) |
Sep
|
Oct
(4) |
Nov
(7) |
Dec
|
2020 |
Jan
(23) |
Feb
(6) |
Mar
(14) |
Apr
(25) |
May
(11) |
Jun
(9) |
Jul
(7) |
Aug
(7) |
Sep
(1) |
Oct
(4) |
Nov
(4) |
Dec
|
2021 |
Jan
(8) |
Feb
(11) |
Mar
(1) |
Apr
(6) |
May
(30) |
Jun
(60) |
Jul
(43) |
Aug
(23) |
Sep
(16) |
Oct
|
Nov
(7) |
Dec
(13) |
2022 |
Jan
(7) |
Feb
(2) |
Mar
(17) |
Apr
(16) |
May
(9) |
Jun
(2) |
Jul
(18) |
Aug
|
Sep
(3) |
Oct
(1) |
Nov
(2) |
Dec
|
2023 |
Jan
(7) |
Feb
|
Mar
(11) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
(7) |
Oct
(5) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(4) |
Mar
(8) |
Apr
(5) |
May
(5) |
Jun
(12) |
Jul
(2) |
Aug
(12) |
Sep
(25) |
Oct
(47) |
Nov
(46) |
Dec
(3) |
2025 |
Jan
(6) |
Feb
(14) |
Mar
(8) |
Apr
(23) |
May
(34) |
Jun
(44) |
Jul
(8) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Alex O. <no...@gi...> - 2025-06-12 01:50:16
|
Branch: refs/heads/adam/handle-bdb-shutdown-interrupts Home: https://github.com/internetarchive/heritrix3 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-12 01:50:14
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 0494d63ea281908cb15af9b051da99498cdfa262 https://github.com/internetarchive/heritrix3/commit/0494d63ea281908cb15af9b051da99498cdfa262 Author: Adam Miller <ad...@ar...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M engine/src/main/java/org/archive/crawler/framework/CrawlController.java Log Message: ----------- fix: Avoid a NullPointerException when terminate is called again during shutdown Commit: 1ab9b4a2067c3fa51bac4e74f71329d880454dc0 https://github.com/internetarchive/heritrix3/commit/1ab9b4a2067c3fa51bac4e74f71329d880454dc0 Author: Adam Miller <ad...@ar...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M commons/src/main/java/org/archive/bdb/StoredQueue.java M modules/src/main/java/org/archive/modules/fetcher/BdbCookieStore.java Log Message: ----------- fix: avoid bdb environment corruption when terminate is called during shutdown Commit: 5601453a45096db01e377853090c73dc03fac9de https://github.com/internetarchive/heritrix3/commit/5601453a45096db01e377853090c73dc03fac9de Author: Alex Osborne <aos...@nl...> Date: 2025-06-12 (Thu, 12 Jun 2025) Changed paths: M commons/src/main/java/org/archive/bdb/StoredQueue.java M engine/src/main/java/org/archive/crawler/framework/CrawlController.java M modules/src/main/java/org/archive/modules/fetcher/BdbCookieStore.java Log Message: ----------- Merge pull request #659 from internetarchive/adam/handle-bdb-shutdown-interrupts fix: handle bdb shutdown interrupts Compare: https://github.com/internetarchive/heritrix3/compare/61af1d83c48d...5601453a4509 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Adam M. <no...@gi...> - 2025-06-12 00:45:49
|
Branch: refs/heads/adam/handle-bdb-shutdown-interrupts Home: https://github.com/internetarchive/heritrix3 Commit: 0494d63ea281908cb15af9b051da99498cdfa262 https://github.com/internetarchive/heritrix3/commit/0494d63ea281908cb15af9b051da99498cdfa262 Author: Adam Miller <ad...@ar...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M engine/src/main/java/org/archive/crawler/framework/CrawlController.java Log Message: ----------- fix: Avoid a NullPointerException when terminate is called again during shutdown Commit: 1ab9b4a2067c3fa51bac4e74f71329d880454dc0 https://github.com/internetarchive/heritrix3/commit/1ab9b4a2067c3fa51bac4e74f71329d880454dc0 Author: Adam Miller <ad...@ar...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M commons/src/main/java/org/archive/bdb/StoredQueue.java M modules/src/main/java/org/archive/modules/fetcher/BdbCookieStore.java Log Message: ----------- fix: avoid bdb environment corruption when terminate is called during shutdown Compare: https://github.com/internetarchive/heritrix3/compare/5b48d0607075...1ab9b4a2067c To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Adam M. <no...@gi...> - 2025-06-12 00:41:13
|
Branch: refs/heads/adam/handle-bdb-shutdown-interrupts Home: https://github.com/internetarchive/heritrix3 Commit: 866d167da003cc3b1b330d69c82596a216c2f470 https://github.com/internetarchive/heritrix3/commit/866d167da003cc3b1b330d69c82596a216c2f470 Author: Adam Miller <ad...@ar...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M engine/src/main/java/org/archive/crawler/framework/CrawlController.java Log Message: ----------- fix: Avoid a NullPointerException when terminate is called again during shutdown Commit: 5b48d0607075d8da2bd5cb2005ee3c48bb9deaa4 https://github.com/internetarchive/heritrix3/commit/5b48d0607075d8da2bd5cb2005ee3c48bb9deaa4 Author: Adam Miller <ad...@ar...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M commons/src/main/java/org/archive/bdb/StoredQueue.java M modules/src/main/java/org/archive/modules/fetcher/BdbCookieStore.java Log Message: ----------- fix: avoid bdb environment corruption when terminate is called during shutdown Compare: https://github.com/internetarchive/heritrix3/compare/866d167da003%5E...5b48d0607075 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-11 11:39:30
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 61af1d83c48dccba80c6563bb4a5f4095a17f167 https://github.com/internetarchive/heritrix3/commit/61af1d83c48dccba80c6563bb4a5f4095a17f167 Author: Alex Osborne <aos...@nl...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/MitmProxy.java A commons/src/test/java/org/archive/net/MitmProxyTest.java M engine/src/main/java/org/archive/crawler/processor/BrowserProcessor.java Log Message: ----------- MitmProxy: Handle URLs generated by browsers but disallowed by java.net.URI To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-11 08:17:37
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 8a15156fa1b38cc8070552787d57fa6cd402800e https://github.com/internetarchive/heritrix3/commit/8a15156fa1b38cc8070552787d57fa6cd402800e Author: Alex Osborne <aos...@nl...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/WebDriverTimeoutException.java M engine/src/main/java/org/archive/crawler/processor/BrowserProcessor.java Log Message: ----------- BrowserProcessor: Catch WebDriver timeouts and exceptions To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-11 07:55:47
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: e0bae9a1b50caf90f09f09ff8be0fed2959e31f9 https://github.com/internetarchive/heritrix3/commit/e0bae9a1b50caf90f09f09ff8be0fed2959e31f9 Author: Alex Osborne <aos...@nl...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- FetchHTTP2: Fallback to async DNS resolver To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-11 02:55:07
|
Branch: refs/heads/fetchhttp2-http-proxy Home: https://github.com/internetarchive/heritrix3 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-11 02:55:02
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: cc50af72be0d7352305279e0f53707b15519dcf0 https://github.com/internetarchive/heritrix3/commit/cc50af72be0d7352305279e0f53707b15519dcf0 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java M modules/src/test/java/org/archive/modules/fetcher/FetchHTTP2Test.java Log Message: ----------- FetchHTTP2: Add HTTP proxy support Commit: c3f9afe64bda41817ad72f30367dfdd2b8ff88e9 https://github.com/internetarchive/heritrix3/commit/c3f9afe64bda41817ad72f30367dfdd2b8ff88e9 Author: Alex Osborne <aos...@nl...> Date: 2025-06-10 (Tue, 10 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/MitmProxy.java M engine/src/main/java/org/archive/crawler/processor/BrowserProcessor.java M engine/src/test/java/org/archive/crawler/processor/BrowserProcessorTest.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- BrowserProcessor: use FetchHTTP2's configured proxy as an upstream proxy Commit: 9a1bc7f28793aec1af0b495e0f0bf5625225c3ea https://github.com/internetarchive/heritrix3/commit/9a1bc7f28793aec1af0b495e0f0bf5625225c3ea Author: Alex Osborne <aos...@nl...> Date: 2025-06-11 (Wed, 11 Jun 2025) Changed paths: M CHANGELOG.md M commons/src/main/java/org/archive/net/MitmProxy.java M engine/src/main/java/org/archive/crawler/processor/BrowserProcessor.java M engine/src/test/java/org/archive/crawler/processor/BrowserProcessorTest.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java M modules/src/test/java/org/archive/modules/fetcher/FetchHTTP2Test.java Log Message: ----------- Merge pull request #657 from internetarchive/fetchhttp2-http-proxy FetchHTTP2: Add HTTP proxy support Compare: https://github.com/internetarchive/heritrix3/compare/020ea34cb61e...9a1bc7f28793 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-10 00:01:48
|
Branch: refs/heads/fetchhttp2-http-proxy Home: https://github.com/internetarchive/heritrix3 Commit: cc50af72be0d7352305279e0f53707b15519dcf0 https://github.com/internetarchive/heritrix3/commit/cc50af72be0d7352305279e0f53707b15519dcf0 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java M modules/src/test/java/org/archive/modules/fetcher/FetchHTTP2Test.java Log Message: ----------- FetchHTTP2: Add HTTP proxy support Commit: c3f9afe64bda41817ad72f30367dfdd2b8ff88e9 https://github.com/internetarchive/heritrix3/commit/c3f9afe64bda41817ad72f30367dfdd2b8ff88e9 Author: Alex Osborne <aos...@nl...> Date: 2025-06-10 (Tue, 10 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/MitmProxy.java M engine/src/main/java/org/archive/crawler/processor/BrowserProcessor.java M engine/src/test/java/org/archive/crawler/processor/BrowserProcessorTest.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- BrowserProcessor: use FetchHTTP2's configured proxy as an upstream proxy Compare: https://github.com/internetarchive/heritrix3/compare/54853d59f20b...c3f9afe64bda To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 06:42:56
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: e68e4975c95bad2ca3452ab7e264719c73e0f517 https://github.com/internetarchive/heritrix3/commit/e68e4975c95bad2ca3452ab7e264719c73e0f517 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md M docs/bean-reference.rst R engine/src/main/java/org/archive/crawler/processor/Browser.java A engine/src/main/java/org/archive/crawler/processor/BrowserProcessor.java M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml A engine/src/test/java/org/archive/crawler/processor/BrowserProcessorTest.java R engine/src/test/java/org/archive/crawler/processor/BrowserTest.java R modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java A modules/src/main/java/org/archive/modules/behaviors/ExtractLinksBehavior.java R modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java A modules/src/main/java/org/archive/modules/behaviors/ScrollDownBehavior.java Log Message: ----------- Rename to BrowserProcessor and -Behavior After I added them to the Bean reference they seemed out of place with the DecideRules and RecordBuilders. We had two classes called 'Browser' which is potentially confusing. It's also probably good to more clearly differentiate the ExtractLinks browser behavior from the Extractor processors. Commit: 020ea34cb61e335e908347e451fb75fc11b405a3 https://github.com/internetarchive/heritrix3/commit/020ea34cb61e335e908347e451fb75fc11b405a3 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M docs/bean-reference.rst Log Message: ----------- Bean reference: Add record builders under WARCWriterChainProcessor Compare: https://github.com/internetarchive/heritrix3/compare/cf447f3bd957...020ea34cb61e To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 06:19:46
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: cf447f3bd9579f099aee211ee9479ea650625637 https://github.com/internetarchive/heritrix3/commit/cf447f3bd9579f099aee211ee9479ea650625637 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M docs/bean-reference.rst M engine/src/main/java/org/archive/crawler/processor/Browser.java M modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java M modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java Log Message: ----------- Add Browser processor and behaviors to bean reference To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 05:43:08
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 8df957d77a1763eb1fc1f5b43186a876eac62040 https://github.com/internetarchive/heritrix3/commit/8df957d77a1763eb1fc1f5b43186a876eac62040 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md Log Message: ----------- Add CHANGELOG.md entry for Browser processor To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 02:49:42
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 1aa6b82d32cc32131930535baf84388dcd85f8c0 https://github.com/internetarchive/heritrix3/commit/1aa6b82d32cc32131930535baf84388dcd85f8c0 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M README.md Log Message: ----------- Remove stray footnote reference To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 01:17:51
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 6782fc065d988b72b9fff18ee91faf414a6f4c25 https://github.com/internetarchive/heritrix3/commit/6782fc065d988b72b9fff18ee91faf414a6f4c25 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md M commons/pom.xml M engine/pom.xml M pom.xml Log Message: ----------- Update dependencies - **junit-jupiter**: 5.12.2 → 5.13.1 - **spring**: 6.2.6 → 6.2.7 - **codemirror__language**: 6.11.0 → 6.11.1 - **codemirror__view**: 6.36.5 → 6.37.1 Commit: faf251359dd0d692487023fb00a463f3eb86fbc6 https://github.com/internetarchive/heritrix3/commit/faf251359dd0d692487023fb00a463f3eb86fbc6 Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md M contrib/pom.xml M contrib/src/test/java/org/archive/modules/recrawl/wbm/WbmPersistLoadProcessorTest.java Log Message: ----------- WbmPersistLoadProcessorTest: Remove dependency on easymock and re-enable test Instead of mocking the HttpClient let's just run a test web server. This means we can drop easymock entirely as it was only being used for this one disabled test. Compare: https://github.com/internetarchive/heritrix3/compare/52bbd8091b14...faf251359dd0 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 00:36:39
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 722552feef0f7882f569a4d971b415b38797bdc4 https://github.com/internetarchive/heritrix3/commit/722552feef0f7882f569a4d971b415b38797bdc4 Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/MitmProxy.java A commons/src/main/java/org/archive/net/webdriver/BiDiEvent.java A commons/src/main/java/org/archive/net/webdriver/BiDiJson.java A commons/src/main/java/org/archive/net/webdriver/BiDiModule.java A commons/src/main/java/org/archive/net/webdriver/Browser.java A commons/src/main/java/org/archive/net/webdriver/BrowsingContext.java A commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/Network.java A commons/src/main/java/org/archive/net/webdriver/Script.java A commons/src/main/java/org/archive/net/webdriver/Session.java A commons/src/main/java/org/archive/net/webdriver/WebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/WebDriverException.java A commons/src/main/java/org/archive/util/IdleBarrier.java M engine/pom.xml A engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml A engine/src/test/java/org/archive/crawler/processor/BrowserTest.java A engine/src/test/resources/logging.properties M modules/pom.xml A modules/src/main/java/org/archive/modules/behaviors/Behavior.java A modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java A modules/src/main/java/org/archive/modules/behaviors/Page.java A modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- Add Browser processor using WebDriver BiDi The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links. This differs from my previous attempt (ExtractorChrome) in a few ways: - Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. - Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs. Even so in practice I found the proxy method loads pages faster and more reliably, likely because responses can be streamed incrementally, which helps a lot for large resources or server-sent events. - Even when HTTP/2 is unavailable, the new FetchHTTP2 module does connection pooling which makes loading browser requests a lot faster. The original FetchHTTP opened a new connection for every request. - The Browser processor can be configured with a list of behavior beans making it more customizable and extensible. Obvious areas for future development: - More Behavior beans: take screenshots, saveg the rendered DOM, run Browsertrix-compatible behavior scripts - Support for remote WebDrivers (e.g. Selenium Server or cloud services) Commit: 5131ea4a80e77f77ca5703a183fc03407d50d810 https://github.com/internetarchive/heritrix3/commit/5131ea4a80e77f77ca5703a183fc03407d50d810 Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Disable downloads in Firefox and Chrome This hopefully will stop us filling up ~/Downloads with random junk. Commit: 2210cb72628b18520ef0fc4b7f75d09eefa8b32a https://github.com/internetarchive/heritrix3/commit/2210cb72628b18520ef0fc4b7f75d09eefa8b32a Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Handle navigation abort from downloads starting Commit: 1161d87217c083bd9f01e9e77b1f4188be3673ef https://github.com/internetarchive/heritrix3/commit/1161d87217c083bd9f01e9e77b1f4188be3673ef Author: Alex Osborne <aos...@nl...> Date: 2025-06-05 (Thu, 05 Jun 2025) Changed paths: M engine/src/main/java/org/archive/crawler/processor/Browser.java M modules/src/main/java/org/archive/modules/behaviors/Behavior.java M modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java Log Message: ----------- Browser: Add processor report Commit: 52bbd8091b14e2cdd23b5c659a94dd5ce031731b https://github.com/internetarchive/heritrix3/commit/52bbd8091b14e2cdd23b5c659a94dd5ce031731b Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/MitmProxy.java A commons/src/main/java/org/archive/net/webdriver/BiDiEvent.java A commons/src/main/java/org/archive/net/webdriver/BiDiJson.java A commons/src/main/java/org/archive/net/webdriver/BiDiModule.java A commons/src/main/java/org/archive/net/webdriver/Browser.java A commons/src/main/java/org/archive/net/webdriver/BrowsingContext.java A commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/Network.java A commons/src/main/java/org/archive/net/webdriver/Script.java A commons/src/main/java/org/archive/net/webdriver/Session.java A commons/src/main/java/org/archive/net/webdriver/WebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/WebDriverException.java A commons/src/main/java/org/archive/util/IdleBarrier.java M engine/pom.xml A engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml A engine/src/test/java/org/archive/crawler/processor/BrowserTest.java A engine/src/test/resources/logging.properties M modules/pom.xml A modules/src/main/java/org/archive/modules/behaviors/Behavior.java A modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java A modules/src/main/java/org/archive/modules/behaviors/Page.java A modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- Merge pull request #653 from internetarchive/bidi Add Browser processor using WebDriver BiDi Compare: https://github.com/internetarchive/heritrix3/compare/7e15d2fda1bc...52bbd8091b14 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 00:36:36
|
Branch: refs/heads/bidi Home: https://github.com/internetarchive/heritrix3 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 00:35:10
|
Branch: refs/heads/robotstxt-wildcard Home: https://github.com/internetarchive/heritrix3 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 00:34:43
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: c7b7ee18420de21c5deee15a7216e035b0d65e59 https://github.com/internetarchive/heritrix3/commit/c7b7ee18420de21c5deee15a7216e035b0d65e59 Author: Alex Osborne <aos...@nl...> Date: 2025-06-07 (Sat, 07 Jun 2025) Changed paths: M CHANGELOG.md M README.md M docs/configuring-jobs.rst M modules/src/main/java/org/archive/modules/net/RobotsDirectives.java M modules/src/test/java/org/archive/modules/net/RobotstxtTest.java Log Message: ----------- Support * and $ wildcards in robots.txt Commit: 7e15d2fda1bcac53e6b0be125910c02dd8ecffba https://github.com/internetarchive/heritrix3/commit/7e15d2fda1bcac53e6b0be125910c02dd8ecffba Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md M README.md M docs/configuring-jobs.rst M modules/src/main/java/org/archive/modules/net/RobotsDirectives.java M modules/src/test/java/org/archive/modules/net/RobotstxtTest.java Log Message: ----------- Merge pull request #656 from internetarchive/robotstxt-wildcard Support * and $ wildcards in robots.txt Compare: https://github.com/internetarchive/heritrix3/compare/497d79500575...7e15d2fda1bc To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-09 00:34:02
|
Branch: refs/heads/fetchhttp2-http-proxy Home: https://github.com/internetarchive/heritrix3 Commit: 54853d59f20b4ad20113c8952c20d794c0c5b5fb https://github.com/internetarchive/heritrix3/commit/54853d59f20b4ad20113c8952c20d794c0c5b5fb Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M CHANGELOG.md M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java M modules/src/test/java/org/archive/modules/fetcher/FetchHTTP2Test.java Log Message: ----------- FetchHTTP2: Add HTTP proxy support To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-07 01:37:43
|
Branch: refs/heads/robotstxt-wildcard Home: https://github.com/internetarchive/heritrix3 Commit: c7b7ee18420de21c5deee15a7216e035b0d65e59 https://github.com/internetarchive/heritrix3/commit/c7b7ee18420de21c5deee15a7216e035b0d65e59 Author: Alex Osborne <aos...@nl...> Date: 2025-06-07 (Sat, 07 Jun 2025) Changed paths: M CHANGELOG.md M README.md M docs/configuring-jobs.rst M modules/src/main/java/org/archive/modules/net/RobotsDirectives.java M modules/src/test/java/org/archive/modules/net/RobotstxtTest.java Log Message: ----------- Support * and $ wildcards in robots.txt To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-07 01:14:17
|
Branch: refs/heads/robotstxt-wildcard Home: https://github.com/internetarchive/heritrix3 Commit: 1644d054a1cba08d6e0f5e5b7e04fd1db53d6f8f https://github.com/internetarchive/heritrix3/commit/1644d054a1cba08d6e0f5e5b7e04fd1db53d6f8f Author: Alex Osborne <aos...@nl...> Date: 2025-06-07 (Sat, 07 Jun 2025) Changed paths: M README.md M docs/configuring-jobs.rst M modules/src/main/java/org/archive/modules/net/RobotsDirectives.java M modules/src/test/java/org/archive/modules/net/RobotstxtTest.java Log Message: ----------- Support * and $ wildcards in robots.txt To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-07 01:06:23
|
Branch: refs/heads/robotstxt-wildcard Home: https://github.com/internetarchive/heritrix3 Commit: f16db3997b9af6b6ff5292aa2763ee1280ea7b49 https://github.com/internetarchive/heritrix3/commit/f16db3997b9af6b6ff5292aa2763ee1280ea7b49 Author: Alex Osborne <aos...@nl...> Date: 2025-06-06 (Fri, 06 Jun 2025) Changed paths: M README.md M docs/configuring-jobs.rst M modules/src/main/java/org/archive/modules/net/RobotsDirectives.java M modules/src/test/java/org/archive/modules/net/RobotstxtTest.java Log Message: ----------- Support * and $ wildcards in robots.txt To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-05 01:49:02
|
Branch: refs/heads/bidi Home: https://github.com/internetarchive/heritrix3 Commit: 722552feef0f7882f569a4d971b415b38797bdc4 https://github.com/internetarchive/heritrix3/commit/722552feef0f7882f569a4d971b415b38797bdc4 Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/MitmProxy.java A commons/src/main/java/org/archive/net/webdriver/BiDiEvent.java A commons/src/main/java/org/archive/net/webdriver/BiDiJson.java A commons/src/main/java/org/archive/net/webdriver/BiDiModule.java A commons/src/main/java/org/archive/net/webdriver/Browser.java A commons/src/main/java/org/archive/net/webdriver/BrowsingContext.java A commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/Network.java A commons/src/main/java/org/archive/net/webdriver/Script.java A commons/src/main/java/org/archive/net/webdriver/Session.java A commons/src/main/java/org/archive/net/webdriver/WebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/WebDriverException.java A commons/src/main/java/org/archive/util/IdleBarrier.java M engine/pom.xml A engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml A engine/src/test/java/org/archive/crawler/processor/BrowserTest.java A engine/src/test/resources/logging.properties M modules/pom.xml A modules/src/main/java/org/archive/modules/behaviors/Behavior.java A modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java A modules/src/main/java/org/archive/modules/behaviors/Page.java A modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- Add Browser processor using WebDriver BiDi The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links. This differs from my previous attempt (ExtractorChrome) in a few ways: - Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. - Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs. Even so in practice I found the proxy method loads pages faster and more reliably, likely because responses can be streamed incrementally, which helps a lot for large resources or server-sent events. - Even when HTTP/2 is unavailable, the new FetchHTTP2 module does connection pooling which makes loading browser requests a lot faster. The original FetchHTTP opened a new connection for every request. - The Browser processor can be configured with a list of behavior beans making it more customizable and extensible. Obvious areas for future development: - More Behavior beans: take screenshots, saveg the rendered DOM, run Browsertrix-compatible behavior scripts - Support for remote WebDrivers (e.g. Selenium Server or cloud services) Commit: 5131ea4a80e77f77ca5703a183fc03407d50d810 https://github.com/internetarchive/heritrix3/commit/5131ea4a80e77f77ca5703a183fc03407d50d810 Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Disable downloads in Firefox and Chrome This hopefully will stop us filling up ~/Downloads with random junk. Commit: 2210cb72628b18520ef0fc4b7f75d09eefa8b32a https://github.com/internetarchive/heritrix3/commit/2210cb72628b18520ef0fc4b7f75d09eefa8b32a Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Handle navigation abort from downloads starting Commit: 1161d87217c083bd9f01e9e77b1f4188be3673ef https://github.com/internetarchive/heritrix3/commit/1161d87217c083bd9f01e9e77b1f4188be3673ef Author: Alex Osborne <aos...@nl...> Date: 2025-06-05 (Thu, 05 Jun 2025) Changed paths: M engine/src/main/java/org/archive/crawler/processor/Browser.java M modules/src/main/java/org/archive/modules/behaviors/Behavior.java M modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java Log Message: ----------- Browser: Add processor report Compare: https://github.com/internetarchive/heritrix3/compare/ad3b99eef191...1161d87217c0 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2025-06-04 07:01:49
|
Branch: refs/heads/bidi Home: https://github.com/internetarchive/heritrix3 Commit: 5b5b754799568b8f324054bea6505998fc1e17e0 https://github.com/internetarchive/heritrix3/commit/5b5b754799568b8f324054bea6505998fc1e17e0 Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Disable downloads in Firefox and Chrome This hopefully will stop us filling up ~/Downloads with random junk. Commit: ad3b99eef19129074a3581a205158c498808878f https://github.com/internetarchive/heritrix3/commit/ad3b99eef19129074a3581a205158c498808878f Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Handle navigation abort from downloads starting Compare: https://github.com/internetarchive/heritrix3/compare/396f7a06c6cb...ad3b99eef191 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |