From: Alex O. <no...@gi...> - 2025-06-09 00:36:39
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 722552feef0f7882f569a4d971b415b38797bdc4 https://github.com/internetarchive/heritrix3/commit/722552feef0f7882f569a4d971b415b38797bdc4 Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/MitmProxy.java A commons/src/main/java/org/archive/net/webdriver/BiDiEvent.java A commons/src/main/java/org/archive/net/webdriver/BiDiJson.java A commons/src/main/java/org/archive/net/webdriver/BiDiModule.java A commons/src/main/java/org/archive/net/webdriver/Browser.java A commons/src/main/java/org/archive/net/webdriver/BrowsingContext.java A commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/Network.java A commons/src/main/java/org/archive/net/webdriver/Script.java A commons/src/main/java/org/archive/net/webdriver/Session.java A commons/src/main/java/org/archive/net/webdriver/WebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/WebDriverException.java A commons/src/main/java/org/archive/util/IdleBarrier.java M engine/pom.xml A engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml A engine/src/test/java/org/archive/crawler/processor/BrowserTest.java A engine/src/test/resources/logging.properties M modules/pom.xml A modules/src/main/java/org/archive/modules/behaviors/Behavior.java A modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java A modules/src/main/java/org/archive/modules/behaviors/Page.java A modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- Add Browser processor using WebDriver BiDi The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links. This differs from my previous attempt (ExtractorChrome) in a few ways: - Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. - Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs. Even so in practice I found the proxy method loads pages faster and more reliably, likely because responses can be streamed incrementally, which helps a lot for large resources or server-sent events. - Even when HTTP/2 is unavailable, the new FetchHTTP2 module does connection pooling which makes loading browser requests a lot faster. The original FetchHTTP opened a new connection for every request. - The Browser processor can be configured with a list of behavior beans making it more customizable and extensible. Obvious areas for future development: - More Behavior beans: take screenshots, saveg the rendered DOM, run Browsertrix-compatible behavior scripts - Support for remote WebDrivers (e.g. Selenium Server or cloud services) Commit: 5131ea4a80e77f77ca5703a183fc03407d50d810 https://github.com/internetarchive/heritrix3/commit/5131ea4a80e77f77ca5703a183fc03407d50d810 Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Disable downloads in Firefox and Chrome This hopefully will stop us filling up ~/Downloads with random junk. Commit: 2210cb72628b18520ef0fc4b7f75d09eefa8b32a https://github.com/internetarchive/heritrix3/commit/2210cb72628b18520ef0fc4b7f75d09eefa8b32a Author: Alex Osborne <aos...@nl...> Date: 2025-06-04 (Wed, 04 Jun 2025) Changed paths: M commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java M engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/test/java/org/archive/crawler/processor/BrowserTest.java Log Message: ----------- Browser: Handle navigation abort from downloads starting Commit: 1161d87217c083bd9f01e9e77b1f4188be3673ef https://github.com/internetarchive/heritrix3/commit/1161d87217c083bd9f01e9e77b1f4188be3673ef Author: Alex Osborne <aos...@nl...> Date: 2025-06-05 (Thu, 05 Jun 2025) Changed paths: M engine/src/main/java/org/archive/crawler/processor/Browser.java M modules/src/main/java/org/archive/modules/behaviors/Behavior.java M modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java Log Message: ----------- Browser: Add processor report Commit: 52bbd8091b14e2cdd23b5c659a94dd5ce031731b https://github.com/internetarchive/heritrix3/commit/52bbd8091b14e2cdd23b5c659a94dd5ce031731b Author: Alex Osborne <aos...@nl...> Date: 2025-06-09 (Mon, 09 Jun 2025) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/MitmProxy.java A commons/src/main/java/org/archive/net/webdriver/BiDiEvent.java A commons/src/main/java/org/archive/net/webdriver/BiDiJson.java A commons/src/main/java/org/archive/net/webdriver/BiDiModule.java A commons/src/main/java/org/archive/net/webdriver/Browser.java A commons/src/main/java/org/archive/net/webdriver/BrowsingContext.java A commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/Network.java A commons/src/main/java/org/archive/net/webdriver/Script.java A commons/src/main/java/org/archive/net/webdriver/Session.java A commons/src/main/java/org/archive/net/webdriver/WebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/WebDriverException.java A commons/src/main/java/org/archive/util/IdleBarrier.java M engine/pom.xml A engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml A engine/src/test/java/org/archive/crawler/processor/BrowserTest.java A engine/src/test/resources/logging.properties M modules/pom.xml A modules/src/main/java/org/archive/modules/behaviors/Behavior.java A modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java A modules/src/main/java/org/archive/modules/behaviors/Page.java A modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- Merge pull request #653 from internetarchive/bidi Add Browser processor using WebDriver BiDi Compare: https://github.com/internetarchive/heritrix3/compare/7e15d2fda1bc...52bbd8091b14 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |