From: Alex O. <no...@gi...> - 2025-05-23 08:33:57
|
Branch: refs/heads/bidi Home: https://github.com/internetarchive/heritrix3 Commit: e4e62ed737d6616ba5e86da3e008aaaadcfed819 https://github.com/internetarchive/heritrix3/commit/e4e62ed737d6616ba5e86da3e008aaaadcfed819 Author: Alex Osborne <aos...@nl...> Date: 2025-05-23 (Fri, 23 May 2025) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/MitmProxy.java A commons/src/main/java/org/archive/net/webdriver/BiDiEvent.java A commons/src/main/java/org/archive/net/webdriver/BiDiJson.java A commons/src/main/java/org/archive/net/webdriver/BiDiModule.java A commons/src/main/java/org/archive/net/webdriver/Browser.java A commons/src/main/java/org/archive/net/webdriver/BrowsingContext.java A commons/src/main/java/org/archive/net/webdriver/LocalWebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/Network.java A commons/src/main/java/org/archive/net/webdriver/Script.java A commons/src/main/java/org/archive/net/webdriver/Session.java A commons/src/main/java/org/archive/net/webdriver/WebDriverBiDi.java A commons/src/main/java/org/archive/net/webdriver/WebDriverException.java A commons/src/main/java/org/archive/util/IdleBarrier.java M engine/pom.xml A engine/src/main/java/org/archive/crawler/processor/Browser.java M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml A engine/src/test/java/org/archive/crawler/processor/BrowserTest.java A engine/src/test/resources/logging.properties M modules/pom.xml A modules/src/main/java/org/archive/modules/behaviors/Behavior.java A modules/src/main/java/org/archive/modules/behaviors/ExtractLinks.java A modules/src/main/java/org/archive/modules/behaviors/Page.java A modules/src/main/java/org/archive/modules/behaviors/ScrollDown.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP2.java Log Message: ----------- Add Browser processor using WebDriver BiDi The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links. This differs from my previous attempt (ExtractorChrome) in a few ways: - Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. - Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs. Even so in practice I found the proxy method loads pages faster and more reliably, likely because responses can be streamed incrementally, which helps a lot for large resources or server-sent events. - Even when HTTP/2 is unavailable, the new FetchHTTP2 module does connection pooling which makes loading browser requests a lot faster. The original FetchHTTP opened a new connection for every request. - The Browser processor can be configured with a list of behavior beans making it more customizable and extensible. Obvious areas for future development: - More Behavior beans: take screenshots, saveg the rendered DOM, run Browsertrix-compatible behavior scripts - Support for remote WebDrivers (e.g. Selenium Server or cloud services) To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |