You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(50) |
Oct
(197) |
Nov
(305) |
Dec
(295) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(429) |
Feb
(694) |
Mar
(443) |
Apr
(479) |
May
(357) |
Jun
(74) |
Jul
(218) |
Aug
(162) |
Sep
(156) |
Oct
(340) |
Nov
(132) |
Dec
(224) |
2005 |
Jan
(170) |
Feb
(122) |
Mar
(265) |
Apr
(215) |
May
(139) |
Jun
(247) |
Jul
(179) |
Aug
(116) |
Sep
(103) |
Oct
(125) |
Nov
(97) |
Dec
(221) |
2006 |
Jan
(132) |
Feb
(18) |
Mar
(23) |
Apr
(35) |
May
(71) |
Jun
(268) |
Jul
(220) |
Aug
(376) |
Sep
(181) |
Oct
(71) |
Nov
(131) |
Dec
(172) |
2007 |
Jan
(125) |
Feb
(79) |
Mar
(90) |
Apr
(76) |
May
(91) |
Jun
(64) |
Jul
(113) |
Aug
(96) |
Sep
(40) |
Oct
(30) |
Nov
(85) |
Dec
(56) |
2008 |
Jan
(37) |
Feb
(79) |
Mar
(22) |
Apr
(6) |
May
(13) |
Jun
(22) |
Jul
(83) |
Aug
(50) |
Sep
(8) |
Oct
(32) |
Nov
(55) |
Dec
(28) |
2009 |
Jan
(15) |
Feb
(30) |
Mar
(28) |
Apr
(69) |
May
(82) |
Jun
(19) |
Jul
(64) |
Aug
(71) |
Sep
(53) |
Oct
(84) |
Nov
(105) |
Dec
(40) |
2010 |
Jan
(11) |
Feb
(19) |
Mar
(24) |
Apr
(58) |
May
(15) |
Jun
(35) |
Jul
(14) |
Aug
(13) |
Sep
(31) |
Oct
(15) |
Nov
(39) |
Dec
(10) |
2011 |
Jan
(59) |
Feb
(32) |
Mar
(10) |
Apr
(37) |
May
(20) |
Jun
(21) |
Jul
(39) |
Aug
(9) |
Sep
(31) |
Oct
(29) |
Nov
(3) |
Dec
(1) |
2012 |
Jan
(7) |
Feb
(4) |
Mar
(5) |
Apr
(12) |
May
(5) |
Jun
(8) |
Jul
(9) |
Aug
(6) |
Sep
(15) |
Oct
(1) |
Nov
(3) |
Dec
(9) |
2013 |
Jan
(9) |
Feb
(2) |
Mar
(41) |
Apr
(13) |
May
(9) |
Jun
(20) |
Jul
(5) |
Aug
(22) |
Sep
(5) |
Oct
(3) |
Nov
(13) |
Dec
(8) |
2014 |
Jan
(27) |
Feb
(16) |
Mar
(7) |
Apr
(14) |
May
(10) |
Jun
(2) |
Jul
(16) |
Aug
(6) |
Sep
(6) |
Oct
(11) |
Nov
(7) |
Dec
|
2015 |
Jan
|
Feb
(7) |
Mar
(4) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
(2) |
Sep
(2) |
Oct
(5) |
Nov
(1) |
Dec
|
2016 |
Jan
(15) |
Feb
(5) |
Mar
(4) |
Apr
(1) |
May
(7) |
Jun
(16) |
Jul
(6) |
Aug
(2) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
|
May
(4) |
Jun
(25) |
Jul
|
Aug
|
Sep
(4) |
Oct
(11) |
Nov
(9) |
Dec
(1) |
2018 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(10) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(12) |
Dec
(4) |
2019 |
Jan
(3) |
Feb
(21) |
Mar
(17) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
|
Aug
(65) |
Sep
|
Oct
(4) |
Nov
(7) |
Dec
|
2020 |
Jan
(23) |
Feb
(6) |
Mar
(14) |
Apr
(25) |
May
(11) |
Jun
(9) |
Jul
(7) |
Aug
(7) |
Sep
(1) |
Oct
(4) |
Nov
(4) |
Dec
|
2021 |
Jan
(8) |
Feb
(11) |
Mar
(1) |
Apr
(6) |
May
(30) |
Jun
(60) |
Jul
(43) |
Aug
(23) |
Sep
(16) |
Oct
|
Nov
(7) |
Dec
(13) |
2022 |
Jan
(7) |
Feb
(2) |
Mar
(17) |
Apr
(16) |
May
(9) |
Jun
(2) |
Jul
(18) |
Aug
|
Sep
(3) |
Oct
(1) |
Nov
(2) |
Dec
|
2023 |
Jan
(7) |
Feb
|
Mar
(11) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
(7) |
Oct
(5) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(4) |
Mar
(8) |
Apr
(5) |
May
(5) |
Jun
(12) |
Jul
(2) |
Aug
(12) |
Sep
(25) |
Oct
(47) |
Nov
(46) |
Dec
(3) |
2025 |
Jan
(6) |
Feb
(14) |
Mar
(8) |
Apr
(23) |
May
(34) |
Jun
(44) |
Jul
(8) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Neil M. <no...@gi...> - 2024-03-14 01:40:06
|
Branch: refs/heads/ait-qa Home: https://github.com/internetarchive/heritrix3 Commit: 6de3580794d70838d73612aa9c200bbbdde3a7c6 https://github.com/internetarchive/heritrix3/commit/6de3580794d70838d73612aa9c200bbbdde3a7c6 Author: Neil Minton <ne...@ar...> Date: 2024-03-13 (Wed, 13 Mar 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Modify process array. Dealing with: > java.io.IOException: Cannot run program "nice -n10": error=2, No such file or directory To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Neil M. <no...@gi...> - 2024-03-13 16:54:44
|
Branch: refs/heads/ait-qa Home: https://github.com/internetarchive/heritrix3 Commit: d608bb0b43ba19a2b2e5ad1338f4d32928216216 https://github.com/internetarchive/heritrix3/commit/d608bb0b43ba19a2b2e5ad1338f4d32928216216 Author: Neil Minton <ne...@ar...> Date: 2024-03-13 (Wed, 13 Mar 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Make video extraction nicer. Commit: 7eed2bc03caf25660c819b6d88b26e82da8f9e79 https://github.com/internetarchive/heritrix3/commit/7eed2bc03caf25660c819b6d88b26e82da8f9e79 Author: Neil Minton <ne...@ar...> Date: 2024-03-13 (Wed, 13 Mar 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Update logging and comments. Compare: https://github.com/internetarchive/heritrix3/compare/a1500586a88c...7eed2bc03caf To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Alex O. <no...@gi...> - 2024-02-21 05:51:45
|
Branch: refs/heads/pdfbox Home: https://github.com/internetarchive/heritrix3 Commit: c467065b7981b5a88cad675cc78a987d6f9e15eb https://github.com/internetarchive/heritrix3/commit/c467065b7981b5a88cad675cc78a987d6f9e15eb Author: Alex Osborne <aos...@nl...> Date: 2024-02-21 (Wed, 21 Feb 2024) Changed paths: A modules/src/test/java/org/archive/modules/extractor/PDFParserTest.java A modules/src/test/resources/org/archive/crawler/modules/extractor/PDFParserTest.pdf Log Message: ----------- Add PDFParserTest Commit: 3700628cb242f9788bd758beff95eb0c986e6305 https://github.com/internetarchive/heritrix3/commit/3700628cb242f9788bd758beff95eb0c986e6305 Author: Alex Osborne <aos...@nl...> Date: 2024-02-21 (Wed, 21 Feb 2024) Changed paths: M commons/pom.xml R dist/src/main/licenses/itext.LICENSE M modules/pom.xml M modules/src/main/java/org/archive/modules/extractor/ExtractorPDF.java M modules/src/main/java/org/archive/modules/extractor/PDFParser.java Log Message: ----------- Switch PDFParser and ExtractorPDF to pdfbox Commit: 19dd5c00de076f242997e6f79741fd181bfd2a2b https://github.com/internetarchive/heritrix3/commit/19dd5c00de076f242997e6f79741fd181bfd2a2b Author: Alex Osborne <aos...@nl...> Date: 2024-02-21 (Wed, 21 Feb 2024) Changed paths: M contrib/pom.xml M contrib/src/main/java/org/archive/modules/extractor/ExtractorPDFContent.java M contrib/src/main/resources/log4j.xml M contrib/src/test/resources/log4j.xml M dist/src/main/conf/logging.properties Log Message: ----------- Switch ExtractorPDFContent to pdfbox Compare: https://github.com/internetarchive/heritrix3/compare/c467065b7981%5E...19dd5c00de07 To unsubscribe from these emails, change your notification settings at https://github.com/internetarchive/heritrix3/settings/notifications |
From: Neil M. <no...@gi...> - 2024-02-09 11:31:55
|
Branch: refs/heads/master-ait-contrib Home: https://github.com/internetarchive/heritrix3 Commit: 3bf1c70eef0a4511594a72b719823f98db1a3ba9 https://github.com/internetarchive/heritrix3/commit/3bf1c70eef0a4511594a72b719823f98db1a3ba9 Author: Barbara Miller <ga...@us...> Date: 2024-02-09 (Fri, 09 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Update ExtractorYoutubeDL.java Commit: 75b18e6117ffac7bc1f557281b641a71a1e1d4ea https://github.com/internetarchive/heritrix3/commit/75b18e6117ffac7bc1f557281b641a71a1e1d4ea Author: Barbara Miller <ba...@ar...> Date: 2024-02-09 (Fri, 09 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- yt-dlp (not youtube-dl) Commit: 4b390404f016e5b353250121554a3c36fba64258 https://github.com/internetarchive/heritrix3/commit/4b390404f016e5b353250121554a3c36fba64258 Author: Barbara Miller <ba...@ar...> Date: 2024-02-09 (Fri, 09 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- more updates for newer yt-dlp Commit: ca079845c81843f0ecf394158e999d4ae2f56635 https://github.com/internetarchive/heritrix3/commit/ca079845c81843f0ecf394158e999d4ae2f56635 Author: Neil Minton <ne...@ar...> Date: 2024-02-09 (Fri, 09 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Merge pull request #574 from internetarchive/yt-dlp Update ExtractorYoutubeDL.java for yt-dlp v.2023.07.06 and better Compare: https://github.com/internetarchive/heritrix3/compare/b25d71339e3c...ca079845c818 |
From: Neil M. <no...@gi...> - 2024-02-09 11:29:23
|
Branch: refs/heads/yt-dlp Home: https://github.com/internetarchive/heritrix3 Commit: 3bf1c70eef0a4511594a72b719823f98db1a3ba9 https://github.com/internetarchive/heritrix3/commit/3bf1c70eef0a4511594a72b719823f98db1a3ba9 Author: Barbara Miller <ga...@us...> Date: 2024-02-09 (Fri, 09 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Update ExtractorYoutubeDL.java Commit: 75b18e6117ffac7bc1f557281b641a71a1e1d4ea https://github.com/internetarchive/heritrix3/commit/75b18e6117ffac7bc1f557281b641a71a1e1d4ea Author: Barbara Miller <ba...@ar...> Date: 2024-02-09 (Fri, 09 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- yt-dlp (not youtube-dl) Commit: 4b390404f016e5b353250121554a3c36fba64258 https://github.com/internetarchive/heritrix3/commit/4b390404f016e5b353250121554a3c36fba64258 Author: Barbara Miller <ba...@ar...> Date: 2024-02-09 (Fri, 09 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- more updates for newer yt-dlp Compare: https://github.com/internetarchive/heritrix3/compare/3bf1c70eef0a%5E...4b390404f016 |
From: Barbara M. <no...@gi...> - 2024-02-06 18:21:43
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: d50158c0124ae99207f8a21bc7e87f6c49c72521 https://github.com/internetarchive/heritrix3/commit/d50158c0124ae99207f8a21bc7e87f6c49c72521 Author: Barbara Miller <ga...@us...> Date: 2023-09-06 (Wed, 06 Sep 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Update ExtractorYoutubeDL.java Commit: bcfed0c1f7b638a3386f2e48351bcaa802c410a1 https://github.com/internetarchive/heritrix3/commit/bcfed0c1f7b638a3386f2e48351bcaa802c410a1 Author: Barbara Miller <ba...@ar...> Date: 2023-09-29 (Fri, 29 Sep 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- yt-dlp (not youtube-dl) Commit: e4c9eb210ad03d592cab67cce0d5e57cd556e3c6 https://github.com/internetarchive/heritrix3/commit/e4c9eb210ad03d592cab67cce0d5e57cd556e3c6 Author: Barbara Miller <ba...@ar...> Date: 2023-11-07 (Tue, 07 Nov 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- more updates for newer yt-dlp Commit: 8549031e59a43d7b4f2763f42d84b589806c9176 https://github.com/internetarchive/heritrix3/commit/8549031e59a43d7b4f2763f42d84b589806c9176 Author: Barbara Miller <325...@us...> Date: 2024-02-06 (Tue, 06 Feb 2024) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Merge pull request #565 from internetarchive/yt-dlp-20230706 update ExtractorYoutubeDL.java for yt-dlp v.2023.07.06 and better Compare: https://github.com/internetarchive/heritrix3/compare/d3ca691f8d11...8549031e59a4 |
From: Barbara M. <no...@gi...> - 2023-11-07 22:10:49
|
Branch: refs/heads/ait-qa Home: https://github.com/internetarchive/heritrix3 Commit: a1500586a88cbb28badfd32823b3a82766fb91a7 https://github.com/internetarchive/heritrix3/commit/a1500586a88cbb28badfd32823b3a82766fb91a7 Author: Barbara Miller <ba...@ar...> Date: 2023-11-07 (Tue, 07 Nov 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- more updates for newer yt-dlp |
From: Barbara M. <no...@gi...> - 2023-11-07 22:09:48
|
Branch: refs/heads/yt-dlp-20230706 Home: https://github.com/internetarchive/heritrix3 Commit: e4c9eb210ad03d592cab67cce0d5e57cd556e3c6 https://github.com/internetarchive/heritrix3/commit/e4c9eb210ad03d592cab67cce0d5e57cd556e3c6 Author: Barbara Miller <ba...@ar...> Date: 2023-11-07 (Tue, 07 Nov 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- more updates for newer yt-dlp |
From: Alex O. <no...@gi...> - 2023-10-11 02:22:50
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: d3ca691f8d111d9f9bd1f4ec9eecbbcf02942679 https://github.com/internetarchive/heritrix3/commit/d3ca691f8d111d9f9bd1f4ec9eecbbcf02942679 Author: Alex Osborne <aos...@nl...> Date: 2023-10-11 (Wed, 11 Oct 2023) Changed paths: M .github/workflows/maven.yml Log Message: ----------- Update CI tests from JDK 20 to 21 |
From: Alex O. <no...@gi...> - 2023-10-11 02:13:44
|
Branch: refs/heads/jdk21 Home: https://github.com/internetarchive/heritrix3 Commit: 3d25da919b5153f8b7684bb1a77ff080f8fbea86 https://github.com/internetarchive/heritrix3/commit/3d25da919b5153f8b7684bb1a77ff080f8fbea86 Author: Alex Osborne <aos...@nl...> Date: 2023-10-11 (Wed, 11 Oct 2023) Changed paths: M .github/workflows/maven.yml Log Message: ----------- test jdk21 branch |
From: Alex O. <no...@gi...> - 2023-10-11 02:12:09
|
Branch: refs/heads/jdk21 Home: https://github.com/internetarchive/heritrix3 Commit: 50bc53a6fd8a957b8c235c7958dd1a48a1b325aa https://github.com/internetarchive/heritrix3/commit/50bc53a6fd8a957b8c235c7958dd1a48a1b325aa Author: Alex Osborne <aos...@nl...> Date: 2023-10-11 (Wed, 11 Oct 2023) Changed paths: M .github/workflows/maven.yml Log Message: ----------- Update CI tests from JDK 20 to 21 |
From: Alex O. <no...@gi...> - 2023-10-11 02:10:18
|
Branch: refs/heads/jdk20 Home: https://github.com/internetarchive/heritrix3 |
From: Alex O. <no...@gi...> - 2023-10-11 02:10:14
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 1c07e43ade91b4586d00b5c978b596e72b4f5801 https://github.com/internetarchive/heritrix3/commit/1c07e43ade91b4586d00b5c978b596e72b4f5801 Author: Alex Osborne <aos...@nl...> Date: 2023-09-19 (Tue, 19 Sep 2023) Changed paths: M modules/pom.xml Log Message: ----------- Update groovy to 4.0.15 for JDK 20 support Commit: 1a27551e60cff9ba1f0eab7e8a0a83e5ea00b0e0 https://github.com/internetarchive/heritrix3/commit/1a27551e60cff9ba1f0eab7e8a0a83e5ea00b0e0 Author: Alex Osborne <aos...@nl...> Date: 2023-09-19 (Tue, 19 Sep 2023) Changed paths: M .github/workflows/maven.yml Log Message: ----------- Update CI tests from JDK 19 to 20 Commit: bd6c6bbff58832ad8c0fc69ac1dd6d442f69c450 https://github.com/internetarchive/heritrix3/commit/bd6c6bbff58832ad8c0fc69ac1dd6d442f69c450 Author: Alex Osborne <aos...@nl...> Date: 2023-10-11 (Wed, 11 Oct 2023) Changed paths: M .github/workflows/maven.yml M modules/pom.xml Log Message: ----------- Merge pull request #566 from internetarchive/jdk20 Update groovy to 4.0.15 for JDK 20 support Compare: https://github.com/internetarchive/heritrix3/compare/8563f491a5b3...bd6c6bbff588 |
From: Barbara M. <no...@gi...> - 2023-09-29 23:13:19
|
Branch: refs/heads/ait-qa Home: https://github.com/internetarchive/heritrix3 Commit: e86a5fedfe1529029120751abe86f73b7f6363c2 https://github.com/internetarchive/heritrix3/commit/e86a5fedfe1529029120751abe86f73b7f6363c2 Author: Barbara Miller <ba...@ar...> Date: 2023-09-29 (Fri, 29 Sep 2023) Changed paths: R .github/workflows/docker-image.yml R .github/workflows/m2-settings.xml R .github/workflows/maven.yml A .travis.yml M CHANGELOG.md M README.md R SECURITY.md M commons/pom.xml M commons/src/main/java/org/archive/bdb/BdbModule.java M commons/src/main/java/org/archive/io/CrawlerJournal.java R commons/src/main/java/org/archive/net/ClientSFTP.java M commons/src/main/java/org/archive/spring/PathSharingContext.java M commons/src/main/java/org/archive/util/BloomFilter64bit.java A commons/src/main/java/org/archive/util/CLibrary.java A commons/src/main/java/org/archive/util/FilesystemLinkMaker.java M commons/src/main/java/org/archive/util/KeyTool.java M commons/src/main/java/org/archive/util/UriUtils.java M commons/src/test/java/org/archive/util/BloomFilterTest.java M commons/src/test/java/org/archive/util/ObjectIdentityBdbManualCacheTest.java M commons/src/test/java/org/archive/util/UriUtilsTest.java M contrib/pom.xml M contrib/src/main/assembly/dist.xml M contrib/src/main/java/org/archive/crawler/frontier/AMQPUrlReceiver.java R contrib/src/main/java/org/archive/modules/extractor/ExtractorChrome.java M contrib/src/main/java/org/archive/modules/extractor/ExtractorPDFContent.java M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java R contrib/src/main/java/org/archive/net/chrome/ChromeClient.java R contrib/src/main/java/org/archive/net/chrome/ChromeException.java R contrib/src/main/java/org/archive/net/chrome/ChromeProcess.java R contrib/src/main/java/org/archive/net/chrome/ChromeRequest.java R contrib/src/main/java/org/archive/net/chrome/ChromeWindow.java R contrib/src/main/java/org/archive/net/chrome/InterceptedRequest.java R contrib/src/test/java/org/archive/modules/extractor/ExtractorChromeTest.java R contrib/src/test/java/org/archive/net/chrome/ChromeClientTest.java A dist/README.txt M dist/pom.xml M dist/src/main/assembly/dist.xml A dist/src/main/licenses/mg4j-1.0.1.LICENSE R docker/Dockerfile R docker/Dockerfile.contrib R docker/Makefile R docker/README.md R docker/docker-compose.yml R docker/entrypoint.sh R docs/_ext/beandoc.py M docs/api.rst R docs/bean-reference.rst M docs/conf.py R docs/configuring-jobs.rst R docs/getting-started.rst R docs/glossary.rst M docs/index.rst R docs/operating.rst M docs/requirements.txt M engine/pom.xml M engine/src/main/java/org/archive/crawler/Heritrix.java M engine/src/main/java/org/archive/crawler/framework/ActionDirectory.java M engine/src/main/java/org/archive/crawler/framework/CheckpointService.java M engine/src/main/java/org/archive/crawler/framework/CrawlController.java M engine/src/main/java/org/archive/crawler/framework/CrawlJob.java M engine/src/main/java/org/archive/crawler/framework/Engine.java M engine/src/main/java/org/archive/crawler/framework/Frontier.java M engine/src/main/java/org/archive/crawler/framework/Scoper.java M engine/src/main/java/org/archive/crawler/framework/ToeThread.java M engine/src/main/java/org/archive/crawler/frontier/AbstractFrontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbFrontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java M engine/src/main/java/org/archive/crawler/frontier/QueueAssignmentPolicy.java M engine/src/main/java/org/archive/crawler/frontier/WorkQueueFrontier.java M engine/src/main/java/org/archive/crawler/frontier/precedence/BaseQueuePrecedencePolicy.java M engine/src/main/java/org/archive/crawler/frontier/precedence/BaseUriPrecedencePolicy.java M engine/src/main/java/org/archive/crawler/frontier/precedence/HopsUriPrecedencePolicy.java M engine/src/main/java/org/archive/crawler/frontier/precedence/PreloadedUriPrecedencePolicy.java M engine/src/main/java/org/archive/crawler/frontier/precedence/SuccessCountsQueuePrecedencePolicy.java M engine/src/main/java/org/archive/crawler/monitor/DiskSpaceMonitor.java M engine/src/main/java/org/archive/crawler/postprocessor/CandidatesProcessor.java M engine/src/main/java/org/archive/crawler/postprocessor/DispositionProcessor.java M engine/src/main/java/org/archive/crawler/postprocessor/ReschedulingProcessor.java M engine/src/main/java/org/archive/crawler/postprocessor/SupplementaryLinksScoper.java M engine/src/main/java/org/archive/crawler/prefetch/FrontierPreparer.java M engine/src/main/java/org/archive/crawler/prefetch/PreconditionEnforcer.java M engine/src/main/java/org/archive/crawler/prefetch/Preselector.java M engine/src/main/java/org/archive/crawler/prefetch/QuotaEnforcer.java M engine/src/main/java/org/archive/crawler/processor/HashCrawlMapper.java M engine/src/main/java/org/archive/crawler/reporting/CrawlSummaryReport.java M engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java M engine/src/main/java/org/archive/crawler/restlet/BaseResource.java M engine/src/main/java/org/archive/crawler/restlet/BeanBrowseResource.java M engine/src/main/java/org/archive/crawler/restlet/EngineApplication.java M engine/src/main/java/org/archive/crawler/restlet/EngineResource.java M engine/src/main/java/org/archive/crawler/restlet/JobRelatedResource.java M engine/src/main/java/org/archive/crawler/restlet/JobResource.java M engine/src/main/java/org/archive/crawler/restlet/PagedRepresentation.java M engine/src/main/java/org/archive/crawler/restlet/RateLimitGuard.java M engine/src/main/java/org/archive/crawler/restlet/ScriptResource.java M engine/src/main/java/org/archive/crawler/restlet/ScriptingConsole.java M engine/src/main/java/org/archive/crawler/util/RecoveryLogMapper.java M engine/src/main/resources/org/archive/crawler/migrate/migrate-template-crawler-beans.cxml M engine/src/main/resources/org/archive/crawler/restlet/Beans.ftl M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml M engine/src/test/java/org/archive/crawler/prefetch/QuotaEnforcerTest.java M engine/src/test/java/org/archive/crawler/restlet/XmlMarshallerTest.java M engine/src/test/java/org/archive/crawler/selftest/StatisticsSelfTest.java M modules/pom.xml M modules/src/main/java/org/archive/modules/CoreAttributeConstants.java M modules/src/main/java/org/archive/modules/CrawlMetadata.java M modules/src/main/java/org/archive/modules/CrawlURI.java M modules/src/main/java/org/archive/modules/Processor.java M modules/src/main/java/org/archive/modules/canonicalize/RegexRule.java M modules/src/main/java/org/archive/modules/credential/CredentialStore.java M modules/src/main/java/org/archive/modules/deciderules/ContentLengthDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/DecideRuleSequence.java M modules/src/main/java/org/archive/modules/deciderules/IpAddressSetDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/PathologicalPathDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/ResourceNoLongerThanDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/SchemeNotInSetDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/TooManyHopsDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/TooManyPathSegmentsDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/TransclusionDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/surt/SurtPrefixedDecideRule.java M modules/src/main/java/org/archive/modules/extractor/Extractor.java M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/ExtractorImpliedURI.java M modules/src/main/java/org/archive/modules/extractor/ExtractorPDF.java R modules/src/main/java/org/archive/modules/extractor/ExtractorRobotsTxt.java R modules/src/main/java/org/archive/modules/extractor/ExtractorSitemap.java M modules/src/main/java/org/archive/modules/extractor/ExtractorUniversal.java M modules/src/main/java/org/archive/modules/extractor/HTMLLinkContext.java M modules/src/main/java/org/archive/modules/extractor/HTTPContentDigest.java M modules/src/main/java/org/archive/modules/extractor/Hop.java M modules/src/main/java/org/archive/modules/extractor/JerichoExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/LinkContext.java M modules/src/main/java/org/archive/modules/fetcher/AbstractCookieStore.java M modules/src/main/java/org/archive/modules/fetcher/BdbCookieStore.java M modules/src/main/java/org/archive/modules/fetcher/FetchDNS.java M modules/src/main/java/org/archive/modules/fetcher/FetchFTP.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTP.java M modules/src/main/java/org/archive/modules/fetcher/FetchHTTPRequest.java R modules/src/main/java/org/archive/modules/fetcher/FetchSFTP.java M modules/src/main/java/org/archive/modules/fetcher/FetchWhois.java M modules/src/main/java/org/archive/modules/fetcher/SimpleCookieStore.java R modules/src/main/java/org/archive/modules/fetcher/SocksSSLSocketFactory.java R modules/src/main/java/org/archive/modules/fetcher/SocksSocketFactory.java M modules/src/main/java/org/archive/modules/forms/ExtractorHTMLForms.java M modules/src/main/java/org/archive/modules/forms/FormLoginProcessor.java M modules/src/main/java/org/archive/modules/net/RobotsPolicy.java R modules/src/main/java/org/archive/modules/net/RobotsTxtOnlyPolicy.java M modules/src/main/java/org/archive/modules/net/Robotstxt.java M modules/src/main/java/org/archive/modules/recrawl/AbstractPersistProcessor.java M modules/src/main/java/org/archive/modules/warc/BaseWARCRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/DnsResponseRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/FtpControlConversationRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/HttpResponseRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/MetadataRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/RevisitRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/WhoisResponseRecordBuilder.java M modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java M modules/src/main/java/org/archive/modules/writer/WriterPoolProcessor.java R modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java M modules/src/test/java/org/archive/modules/extractor/JerichoExtractorHTMLTest.java M modules/src/test/java/org/archive/modules/extractor/UnitTestUriLoggerModule.java M modules/src/test/java/org/archive/modules/fetcher/CookieFetchHTTPIntegrationTest.java M modules/src/test/java/org/archive/modules/fetcher/CookieStoreTest.java M modules/src/test/java/org/archive/modules/fetcher/FetchFTPTest.java M modules/src/test/java/org/archive/modules/fetcher/FetchHTTPTest.java R modules/src/test/java/org/archive/modules/fetcher/FetchHTTPTestServers.java A modules/src/test/java/org/archive/modules/fetcher/FetchHTTPTests.java M modules/src/test/java/org/archive/modules/forms/FormLoginProcessorTest.java M modules/src/test/java/org/archive/modules/recrawl/ContentDigestHistoryTest.java M pom.xml Log Message: ----------- Revert "Merge branch 'yt-dlp-20230706' into ait-qa" This reverts commit 1074d95e2d5850ec026377c3fb68b77e8fa7caba, reversing changes made to 122417c0fbc2ee3cfd99f08a06dc334279ce5dd4. Commit: e19bdc0af1b8c73d7de062b2b730d127e79978fc https://github.com/internetarchive/heritrix3/commit/e19bdc0af1b8c73d7de062b2b730d127e79978fc Author: Barbara Miller <ba...@ar...> Date: 2023-09-29 (Fri, 29 Sep 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- yt-dlp (not youtube-dl) Compare: https://github.com/internetarchive/heritrix3/compare/1074d95e2d58...e19bdc0af1b8 |
From: Barbara M. <no...@gi...> - 2023-09-29 22:49:05
|
Branch: refs/heads/ait-qa Home: https://github.com/internetarchive/heritrix3 Commit: 8ddaf41db27a358f99e989656b885a00fc7faa45 https://github.com/internetarchive/heritrix3/commit/8ddaf41db27a358f99e989656b885a00fc7faa45 Author: kristinn <kri...@la...> Date: 2019-07-05 (Fri, 05 Jul 2019) Changed paths: M commons/pom.xml M modules/pom.xml Log Message: ----------- Add crawler-commons dependency. Bump commons-io dependency version to match crawler-commons. Commit: 0f7675ab5edce6c78a85e2c657beab229161bc70 https://github.com/internetarchive/heritrix3/commit/0f7675ab5edce6c78a85e2c657beab229161bc70 Author: kristinn <kri...@la...> Date: 2019-07-05 (Fri, 05 Jul 2019) Changed paths: M modules/src/main/java/org/archive/modules/extractor/Hop.java M modules/src/main/java/org/archive/modules/extractor/LinkContext.java Log Message: ----------- Add support for new hopptype, (M)anifest Commit: 5941d59b0e4684b0d3f3bc643f17228086f78671 https://github.com/internetarchive/heritrix3/commit/5941d59b0e4684b0d3f3bc643f17228086f78671 Author: kristinn <kri...@la...> Date: 2019-07-05 (Fri, 05 Jul 2019) Changed paths: A modules/src/main/java/org/archive/modules/extractor/ExtractorRobotsTxt.java Log Message: ----------- Add extractor to get sitemap url from robots.txt Commit: ba8f669e1c2802683695348d3d0fd21390c100ee https://github.com/internetarchive/heritrix3/commit/ba8f669e1c2802683695348d3d0fd21390c100ee Author: kristinn <kri...@la...> Date: 2019-07-05 (Fri, 05 Jul 2019) Changed paths: A modules/src/main/java/org/archive/modules/extractor/ExtractorSitemap.java Log Message: ----------- Add extractor that handles sitemaps Commit: 308cee85a3c0217e5c7f8f71df5eacb5ba9835d9 https://github.com/internetarchive/heritrix3/commit/308cee85a3c0217e5c7f8f71df5eacb5ba9835d9 Author: kristinn <kri...@la...> Date: 2019-07-05 (Fri, 05 Jul 2019) Changed paths: M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml Log Message: ----------- Add sitemap extraction to default profile Commit: 45cd46e030ad035a9acb6a3b8b6690b7025be75d https://github.com/internetarchive/heritrix3/commit/45cd46e030ad035a9acb6a3b8b6690b7025be75d Author: Andrew Jackson <And...@bl...> Date: 2019-08-11 (Sun, 11 Aug 2019) Changed paths: M commons/pom.xml M pom.xml Log Message: ----------- Try upping to BDB JE 7.5.11 Commit: 7dab6ec2750095bdc6439daaf09458bb37522e36 https://github.com/internetarchive/heritrix3/commit/7dab6ec2750095bdc6439daaf09458bb37522e36 Author: Andrew Jackson <And...@bl...> Date: 2019-08-11 (Sun, 11 Aug 2019) Changed paths: M modules/src/test/java/org/archive/modules/fetcher/CookieStoreTest.java Log Message: ----------- Avoid using Thread.interrupt as this freaks BDB-JE. Commit: f63570f9778fc06fac4be5f19eb8818e4a1954e9 https://github.com/internetarchive/heritrix3/commit/f63570f9778fc06fac4be5f19eb8818e4a1954e9 Author: Andrew Jackson <And...@bl...> Date: 2019-08-12 (Mon, 12 Aug 2019) Changed paths: M README.md M engine/src/main/java/org/archive/crawler/Heritrix.java M engine/src/main/java/org/archive/crawler/restlet/RateLimitGuard.java Log Message: ----------- Merge branch 'master' into upgrade-bdb-je Commit: 73db3ac6288b7f372eadd59916bcc111c3289c4f https://github.com/internetarchive/heritrix3/commit/73db3ac6288b7f372eadd59916bcc111c3289c4f Author: Andrew Jackson <And...@bl...> Date: 2019-10-17 (Thu, 17 Oct 2019) Changed paths: M engine/src/main/java/org/archive/crawler/frontier/AssignmentLevelSurtQueueAssignmentPolicy.java M engine/src/main/java/org/archive/crawler/frontier/URIAuthorityBasedQueueAssignmentPolicy.java M engine/src/main/java/org/archive/crawler/prefetch/QuotaEnforcer.java M engine/src/main/java/org/archive/crawler/restlet/EnhDirectoryResource.java M modules/src/test/java/org/archive/modules/fetcher/CookieStoreTest.java Log Message: ----------- Merge branch 'master' into upgrade-bdb-je Commit: c611c46c8a931edc295e8ceae4deed846cc8cd58 https://github.com/internetarchive/heritrix3/commit/c611c46c8a931edc295e8ceae4deed846cc8cd58 Author: Nicholas Clarke <ni...@kb...> Date: 2019-11-15 (Fri, 15 Nov 2019) Changed paths: M engine/src/main/java/org/archive/crawler/framework/CrawlJob.java M engine/src/main/java/org/archive/crawler/framework/Frontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbFrontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java M engine/src/test/java/org/archive/crawler/prefetch/QuotaEnforcerTest.java Log Message: ----------- Merged frontier-management with upgrade to bdb 7 Commit: f22a97875d4d1d7e6517df9c01e4ac096eb11896 https://github.com/internetarchive/heritrix3/commit/f22a97875d4d1d7e6517df9c01e4ac096eb11896 Author: Colin Rosenthal <cs...@st...> Date: 2019-11-15 (Fri, 15 Nov 2019) Changed paths: A dist/heritrix.iml M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java A modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java Log Message: ----------- Added a timeout to crawlertrap regex matching Commit: 718cc5ae3258f93752d999eab35b53f0871c0810 https://github.com/internetarchive/heritrix3/commit/718cc5ae3258f93752d999eab35b53f0871c0810 Author: Colin Rosenthal <cs...@st...> Date: 2019-11-15 (Fri, 15 Nov 2019) Changed paths: M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java A modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java Log Message: ----------- Updated manually to new SNAPSHOT version Commit: f76ec2f3623e7fe58da4ed91d90e30691c82d39d https://github.com/internetarchive/heritrix3/commit/f76ec2f3623e7fe58da4ed91d90e30691c82d39d Author: Colin Rosenthal <cs...@st...> Date: 2019-11-15 (Fri, 15 Nov 2019) Changed paths: A dist/heritrix.iml Log Message: ----------- Merge remote-tracking branch 'origin/crawltrap-regex-timeout' into crawltrap-regex-timeout Commit: bf5842c02ed3516713d792b358c7c0a7c8bc52bd https://github.com/internetarchive/heritrix3/commit/bf5842c02ed3516713d792b358c7c0a7c8bc52bd Author: Colin Rosenthal <cs...@st...> Date: 2019-11-15 (Fri, 15 Nov 2019) Changed paths: A .idea/libraries/Maven__com_101tec_zkclient_0_7.xml A .idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml A .idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml A .idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml A .idea/libraries/Maven__com_sleepycat_je_4_1_6.xml A .idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml A .idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml A .idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml A .idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml A .idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml A .idea/libraries/Maven__junit_junit_4_10.xml A .idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml A .idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml A .idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml A .idea/libraries/Maven__org_apache_directory_api_api_asn1_api_1_0_0_M20.xml A .idea/libraries/Maven__org_apache_directory_api_api_util_1_0_0_M20.xml A .idea/libraries/Maven__org_apache_directory_server_apacheds_i18n_2_0_0_M15.xml A .idea/libraries/Maven__org_apache_directory_server_apacheds_kerberos_codec_2_0_0_M15.xml A .idea/libraries/Maven__org_apache_hadoop_hadoop_annotations_2_5_0_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_hadoop_hadoop_auth_2_5_0_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_hadoop_hadoop_common_2_5_0_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_hadoop_hadoop_core_2_5_0_mr1_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_hbase_hbase_client_0_98_6_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_hbase_hbase_common_0_98_6_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_hbase_hbase_protocol_0_98_6_cdh5_3_5.xml A .idea/libraries/Maven__org_apache_kafka_kafka_2_10_0_9_0_0.xml A .idea/libraries/Maven__org_apache_kafka_kafka_clients_0_9_0_0.xml A .idea/libraries/Maven__org_apache_zookeeper_zookeeper_3_4_5_cdh5_3_5.xml A .idea/libraries/Maven__org_cloudera_htrace_htrace_core_2_04.xml A .idea/libraries/Maven__org_eclipse_jetty_jetty_http_9_4_19_v20190610.xml A .idea/libraries/Maven__org_eclipse_jetty_jetty_io_9_4_19_v20190610.xml A .idea/libraries/Maven__org_eclipse_jetty_jetty_security_9_4_19_v20190610.xml A .idea/libraries/Maven__org_eclipse_jetty_jetty_server_9_4_19_v20190610.xml A .idea/libraries/Maven__org_eclipse_jetty_jetty_servlet_9_4_19_v20190610.xml A .idea/libraries/Maven__org_eclipse_jetty_jetty_util_9_4_19_v20190610.xml A .idea/libraries/Maven__org_glassfish_jaxb_jaxb_runtime_2_3_1.xml A .idea/libraries/Maven__org_glassfish_jaxb_txw2_2_3_1.xml A .idea/libraries/Maven__org_hamcrest_hamcrest_core_1_1.xml A .idea/libraries/Maven__org_jvnet_staxex_stax_ex_1_8.xml A .idea/libraries/Maven__org_mortbay_jetty_jetty_6_1_26_cloudera_4.xml A .idea/libraries/Maven__org_netpreserve_commons_webarchive_commons_1_1_8.xml A .idea/libraries/Maven__org_restlet_jse_org_restlet_2_4_0.xml A .idea/libraries/Maven__org_restlet_jse_org_restlet_ext_crypto_2_4_0.xml A .idea/libraries/Maven__org_restlet_jse_org_restlet_ext_jetty_2_4_0.xml A .idea/libraries/Maven__org_restlet_jse_org_restlet_ext_xml_2_4_0.xml A .idea/libraries/Maven__org_scala_lang_scala_library_2_10_5.xml A .idea/libraries/Maven__org_slf4j_slf4j_api_1_7_12.xml A .idea/libraries/Maven__org_slf4j_slf4j_log4j12_1_7_6.xml A .idea/libraries/Maven__org_xerial_snappy_snappy_java_1_1_1_7.xml R dist/heritrix.iml Log Message: ----------- removed .iml file Commit: 5e536862db47d6d3f4d28c853e56e09c646d42bb https://github.com/internetarchive/heritrix3/commit/5e536862db47d6d3f4d28c853e56e09c646d42bb Author: Colin Rosenthal <cs...@st...> Date: 2019-11-15 (Fri, 15 Nov 2019) Changed paths: R .idea/libraries/Maven__com_101tec_zkclient_0_7.xml R .idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml R .idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml R .idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml R .idea/libraries/Maven__com_sleepycat_je_4_1_6.xml R .idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml R .idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml R .idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml R .idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml R .idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml R .idea/libraries/Maven__junit_junit_4_10.xml R .idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml R .idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml R .idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml R .idea/libraries/Maven__org_apache_directory_api_api_asn1_api_1_0_0_M20.xml R .idea/libraries/Maven__org_apache_directory_api_api_util_1_0_0_M20.xml R .idea/libraries/Maven__org_apache_directory_server_apacheds_i18n_2_0_0_M15.xml R .idea/libraries/Maven__org_apache_directory_server_apacheds_kerberos_codec_2_0_0_M15.xml R .idea/libraries/Maven__org_apache_hadoop_hadoop_annotations_2_5_0_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_hadoop_hadoop_auth_2_5_0_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_hadoop_hadoop_common_2_5_0_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_hadoop_hadoop_core_2_5_0_mr1_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_hbase_hbase_client_0_98_6_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_hbase_hbase_common_0_98_6_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_hbase_hbase_protocol_0_98_6_cdh5_3_5.xml R .idea/libraries/Maven__org_apache_kafka_kafka_2_10_0_9_0_0.xml R .idea/libraries/Maven__org_apache_kafka_kafka_clients_0_9_0_0.xml R .idea/libraries/Maven__org_apache_zookeeper_zookeeper_3_4_5_cdh5_3_5.xml R .idea/libraries/Maven__org_cloudera_htrace_htrace_core_2_04.xml R .idea/libraries/Maven__org_eclipse_jetty_jetty_http_9_4_19_v20190610.xml R .idea/libraries/Maven__org_eclipse_jetty_jetty_io_9_4_19_v20190610.xml R .idea/libraries/Maven__org_eclipse_jetty_jetty_security_9_4_19_v20190610.xml R .idea/libraries/Maven__org_eclipse_jetty_jetty_server_9_4_19_v20190610.xml R .idea/libraries/Maven__org_eclipse_jetty_jetty_servlet_9_4_19_v20190610.xml R .idea/libraries/Maven__org_eclipse_jetty_jetty_util_9_4_19_v20190610.xml R .idea/libraries/Maven__org_glassfish_jaxb_jaxb_runtime_2_3_1.xml R .idea/libraries/Maven__org_glassfish_jaxb_txw2_2_3_1.xml R .idea/libraries/Maven__org_hamcrest_hamcrest_core_1_1.xml R .idea/libraries/Maven__org_jvnet_staxex_stax_ex_1_8.xml R .idea/libraries/Maven__org_mortbay_jetty_jetty_6_1_26_cloudera_4.xml R .idea/libraries/Maven__org_netpreserve_commons_webarchive_commons_1_1_8.xml R .idea/libraries/Maven__org_restlet_jse_org_restlet_2_4_0.xml R .idea/libraries/Maven__org_restlet_jse_org_restlet_ext_crypto_2_4_0.xml R .idea/libraries/Maven__org_restlet_jse_org_restlet_ext_jetty_2_4_0.xml R .idea/libraries/Maven__org_restlet_jse_org_restlet_ext_xml_2_4_0.xml R .idea/libraries/Maven__org_scala_lang_scala_library_2_10_5.xml R .idea/libraries/Maven__org_slf4j_slf4j_api_1_7_12.xml R .idea/libraries/Maven__org_slf4j_slf4j_log4j12_1_7_6.xml R .idea/libraries/Maven__org_xerial_snappy_snappy_java_1_1_1_7.xml Log Message: ----------- Removed mistaken additions Commit: bc6a15b41dd7d5b48f51b8ee5fb40884717df276 https://github.com/internetarchive/heritrix3/commit/bc6a15b41dd7d5b48f51b8ee5fb40884717df276 Author: Jonathan Leitschuh <Jon...@gm...> Date: 2020-02-11 (Tue, 11 Feb 2020) Changed paths: M commons/pom.xml M engine/pom.xml Log Message: ----------- Use HTTPS to resolve dependencies in Maven Build where possible Commit: 6fcc5e808a790ff9e8bf640e9f88bab7b71214ab https://github.com/internetarchive/heritrix3/commit/6fcc5e808a790ff9e8bf640e9f88bab7b71214ab Author: Alex Osborne <aos...@nl...> Date: 2020-02-13 (Thu, 13 Feb 2020) Changed paths: M contrib/pom.xml Log Message: ----------- Exclude hbase-client's guava 12 transitive dependency Guava 12 from hbase-client is closer to the root of the dependency tree than guava 17 from webarchive-commons so Maven prefers it. But recent changes to heritrix-commons rely on classes in the newer version of Guava. So let's ensure webarchive-commons wins. Hopefully this doesn't break the hbase module, I have no way of testing it. Fixes #311 Commit: 66535370ff64dd3ff9334664fca876e6e2052fce https://github.com/internetarchive/heritrix3/commit/66535370ff64dd3ff9334664fca876e6e2052fce Author: Alex Osborne <aos...@nl...> Date: 2020-02-28 (Fri, 28 Feb 2020) Changed paths: M contrib/pom.xml Log Message: ----------- Merge pull request #312 from internetarchive/guava-dep Exclude hbase-client's guava 12 transitive dependency Commit: 60e40531802ca84eef89b6c6fcd04e5540adc390 https://github.com/internetarchive/heritrix3/commit/60e40531802ca84eef89b6c6fcd04e5540adc390 Author: Colin Rosenthal <cs...@st...> Date: 2020-03-03 (Tue, 03 Mar 2020) Changed paths: M modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java Log Message: ----------- Commented this test back in to make Travis happy Commit: e54258756a10af47b8ae5ce40ff65c21ef3da835 https://github.com/internetarchive/heritrix3/commit/e54258756a10af47b8ae5ce40ff65c21ef3da835 Author: Colin Rosenthal <cs...@st...> Date: 2020-03-03 (Tue, 03 Mar 2020) Changed paths: M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java Log Message: ----------- Fixed regex timeout handling following suggestion https://github.com/internetarchive/heritrix3/pull/290#discussion_r366711640 Commit: 977f608c746202e27d1160d02d9a54008a203564 https://github.com/internetarchive/heritrix3/commit/977f608c746202e27d1160d02d9a54008a203564 Author: Alex Osborne <aos...@nl...> Date: 2020-03-03 (Tue, 03 Mar 2020) Changed paths: M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java A modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java Log Message: ----------- Merge pull request #290 from netarchivesuite/crawltrap-regex-timeout Crawltrap regex timeout Commit: 7f80b4fd983f1fa72f37bd43491974f3f9557ca5 https://github.com/internetarchive/heritrix3/commit/7f80b4fd983f1fa72f37bd43491974f3f9557ca5 Author: Andy Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M commons/pom.xml M modules/src/test/java/org/archive/modules/fetcher/CookieStoreTest.java M pom.xml Log Message: ----------- Merge pull request #281 from ukwa/upgrade-bdb-je WIP: Upgrade BDB JE to version 7.5.11 Commit: e0c82f70e98ea0845614b2057e2f47139ff25b0e https://github.com/internetarchive/heritrix3/commit/e0c82f70e98ea0845614b2057e2f47139ff25b0e Author: Andy Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M engine/src/main/java/org/archive/crawler/framework/CrawlJob.java M engine/src/main/java/org/archive/crawler/framework/Frontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbFrontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java M engine/src/test/java/org/archive/crawler/prefetch/QuotaEnforcerTest.java Log Message: ----------- Merge pull request #289 from netarchivesuite/bdb-frontier-access Bdb frontier access Commit: 348b5330bc21cbdae0554f3ef75c1991baf328ca https://github.com/internetarchive/heritrix3/commit/348b5330bc21cbdae0554f3ef75c1991baf328ca Author: Tim Hennekey <the...@hu...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M commons/src/main/java/org/archive/util/BloomFilter64bit.java Log Message: ----------- Utilize the `d` parameter There are uses of this class that necessitate a lower false positive rate than the value that was hard-coded. By making use of the parameter, as indicated in the JavaDoc, the resulting delegate should meet the demands of the caller. Commit: b4494a7ffb3fb51f3759e1ff8783eb8b3c5ce793 https://github.com/internetarchive/heritrix3/commit/b4494a7ffb3fb51f3759e1ff8783eb8b3c5ce793 Author: Andrew Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M commons/src/main/java/org/archive/bdb/BdbModule.java Log Message: ----------- Use the Wayback Machine to repair a link to Oracle docs. Commit: d5fad011221b5c2312fa112596c412a9b8119432 https://github.com/internetarchive/heritrix3/commit/d5fad011221b5c2312fa112596c412a9b8119432 Author: Andrew Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M CHANGELOG.md Log Message: ----------- Re-sync of changelog prior to release. Commit: 9f7887a5fa49a5cb052edc3285578d80e2245873 https://github.com/internetarchive/heritrix3/commit/9f7887a5fa49a5cb052edc3285578d80e2245873 Author: Andy Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M commons/src/main/java/org/archive/util/BloomFilter64bit.java Log Message: ----------- Merge pull request #314 from hennekey/parameterize-bloom-filter Utilize the `d` parameter Commit: fd514b7bf65f3d9eaed2ad5f76cdcd666875664c https://github.com/internetarchive/heritrix3/commit/fd514b7bf65f3d9eaed2ad5f76cdcd666875664c Author: Alex Osborne <aos...@nl...> Date: 2020-03-05 (Thu, 05 Mar 2020) Changed paths: M commons/src/main/java/org/archive/bdb/BdbModule.java Log Message: ----------- Merge pull request #315 from ukwa/fix-oracle-doc-link Use the Wayback Machine to repair a link to Oracle docs. Commit: 799c58e291f35dd35b01af042b0f8f360237ed12 https://github.com/internetarchive/heritrix3/commit/799c58e291f35dd35b01af042b0f8f360237ed12 Author: Andrew Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M commons/src/main/java/org/archive/bdb/BdbModule.java M commons/src/main/java/org/archive/util/BloomFilter64bit.java Log Message: ----------- Merge branch 'master' of github.com:internetarchive/heritrix3 Commit: 384da2e82f58236646fd64472ec53179513a530d https://github.com/internetarchive/heritrix3/commit/384da2e82f58236646fd64472ec53179513a530d Author: Andrew Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M pom.xml Log Message: ----------- Disable doclint Commit: a1bdcb1be73a76b0a4811c3a1df76175dc1a3d11 https://github.com/internetarchive/heritrix3/commit/a1bdcb1be73a76b0a4811c3a1df76175dc1a3d11 Author: Andrew Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M commons/pom.xml M contrib/pom.xml M dist/pom.xml M engine/pom.xml M modules/pom.xml M pom.xml Log Message: ----------- [maven-release-plugin] prepare release 3.4.0-20200304 Commit: ff05de44b6f5906ff5deef9aaf3a436a51f5c17d https://github.com/internetarchive/heritrix3/commit/ff05de44b6f5906ff5deef9aaf3a436a51f5c17d Author: Andrew Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M commons/pom.xml M contrib/pom.xml M dist/pom.xml M engine/pom.xml M modules/pom.xml M pom.xml Log Message: ----------- [maven-release-plugin] prepare for next development iteration Commit: fc1de3066866c8a73437a6f39ac7c30c21bf8579 https://github.com/internetarchive/heritrix3/commit/fc1de3066866c8a73437a6f39ac7c30c21bf8579 Author: Andrew Jackson <And...@bl...> Date: 2020-03-04 (Wed, 04 Mar 2020) Changed paths: M CHANGELOG.md Log Message: ----------- Updated changelog post release Commit: 40725372d67d9b30cafe86defa954c066e2d52ef https://github.com/internetarchive/heritrix3/commit/40725372d67d9b30cafe86defa954c066e2d52ef Author: Leslie Bellony <les...@bn...> Date: 2020-04-01 (Wed, 01 Apr 2020) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/ClientSFTP.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/HBasePersistLoadProcessor.java M modules/pom.xml M modules/src/main/java/org/archive/modules/deciderules/SchemeNotInSetDecideRule.java A modules/src/main/java/org/archive/modules/fetcher/FetchSFTP.java M modules/src/main/java/org/archive/modules/recrawl/AbstractPersistProcessor.java M modules/src/main/java/org/archive/modules/warc/FtpControlConversationRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/MetadataRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/RevisitRecordBuilder.java M modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java M modules/src/main/java/org/archive/modules/writer/WriterPoolProcessor.java Log Message: ----------- Add support for the SFTP protocol Commit: 0a2f57fa73e65ee36c82c3556b849669b590c793 https://github.com/internetarchive/heritrix3/commit/0a2f57fa73e65ee36c82c3556b849669b590c793 Author: Barbara Miller <ga...@me...> Date: 2020-04-03 (Fri, 03 Apr 2020) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- best medium-ish size Commit: 6710c8dba38cf8fce6281e36fc6ae0dd1bece5ed https://github.com/internetarchive/heritrix3/commit/6710c8dba38cf8fce6281e36fc6ae0dd1bece5ed Author: Clara Wiatrowski <cla...@bn...> Date: 2020-04-10 (Fri, 10 Apr 2020) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/HTMLLinkContext.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java Log Message: ----------- Add parsing for HTML tags (data-src, data-srcset, data-original and data-original-set) Commit: b61f7e574e964a44273567ac7f3d08d709b029e9 https://github.com/internetarchive/heritrix3/commit/b61f7e574e964a44273567ac7f3d08d709b029e9 Author: Clara Wiatrowski <cla...@bn...> Date: 2020-04-17 (Fri, 17 Apr 2020) Changed paths: M engine/src/main/java/org/archive/crawler/reporting/CrawlSummaryReport.java M engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java Log Message: ----------- Add real crawlStatus in the crawlReport Commit: 75dc16e8ca59ff92f611ab88b697eda2ff066d43 https://github.com/internetarchive/heritrix3/commit/75dc16e8ca59ff92f611ab88b697eda2ff066d43 Author: Alex Osborne <aos...@nl...> Date: 2020-04-20 (Mon, 20 Apr 2020) Changed paths: M docs/api.rst Log Message: ----------- Document 'Get Job Status' endpoint Commit: 33e637b887ae9436663256808f32a438e705f24f https://github.com/internetarchive/heritrix3/commit/33e637b887ae9436663256808f32a438e705f24f Author: Alex Osborne <aos...@nl...> Date: 2020-04-20 (Mon, 20 Apr 2020) Changed paths: M docs/api.rst Log Message: ----------- Correct indentation to 3-spaces for reST Commit: f442f76784464e187a2e4635abcf5b2cf66dd4b4 https://github.com/internetarchive/heritrix3/commit/f442f76784464e187a2e4635abcf5b2cf66dd4b4 Author: Alex Osborne <aos...@nl...> Date: 2020-04-20 (Mon, 20 Apr 2020) Changed paths: M docs/api.rst Log Message: ----------- Add missing newline Commit: f19dacfe679c17c3322a8b354e38cb83fa5e2eed https://github.com/internetarchive/heritrix3/commit/f19dacfe679c17c3322a8b354e38cb83fa5e2eed Author: Alex Osborne <aos...@nl...> Date: 2020-04-20 (Mon, 20 Apr 2020) Changed paths: M docs/api.rst Log Message: ----------- Document GET /engine Commit: 70a77d931a3456c99278348724f9a14796daae10 https://github.com/internetarchive/heritrix3/commit/70a77d931a3456c99278348724f9a14796daae10 Author: Neil Minton <ne...@ar...> Date: 2020-04-20 (Mon, 20 Apr 2020) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Merge pull request #325 from internetarchive/yt-dl-format youtube-dl: request best medium-ish size format Commit: eea3914059e5245d52b5c122fbdda66a50ea5883 https://github.com/internetarchive/heritrix3/commit/eea3914059e5245d52b5c122fbdda66a50ea5883 Author: Clara Wiatrowski <cla...@bn...> Date: 2020-04-21 (Tue, 21 Apr 2020) Changed paths: M engine/src/main/java/org/archive/crawler/reporting/CrawlSummaryReport.java Log Message: ----------- Add missing import Commit: cc29bb97d1582d7a9966fa70c84b00d451dee861 https://github.com/internetarchive/heritrix3/commit/cc29bb97d1582d7a9966fa70c84b00d451dee861 Author: Clara Wiatrowski <cla...@bn...> Date: 2020-04-22 (Wed, 22 Apr 2020) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java Log Message: ----------- Revert ExtractorHTML Commit: 0d1f4ea84ffc708f075a5c2a1ef0c651671361c8 https://github.com/internetarchive/heritrix3/commit/0d1f4ea84ffc708f075a5c2a1ef0c651671361c8 Author: Clara Wiatrowski <cla...@bn...> Date: 2020-04-22 (Wed, 22 Apr 2020) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java Log Message: ----------- Add parsing case data-* Commit: 5e948883fa0fed0b7a325efed2463c2137c8b736 https://github.com/internetarchive/heritrix3/commit/5e948883fa0fed0b7a325efed2463c2137c8b736 Author: Clara Wiatrowski <cla...@bn...> Date: 2020-04-22 (Wed, 22 Apr 2020) Changed paths: M modules/src/test/java/org/archive/modules/extractor/JerichoExtractorHTMLTest.java Log Message: ----------- Add overrides in JerichoExtractorHTMLTest Commit: 7a6ea789dbe53d2572e646e7aa1e73aa41a695a9 https://github.com/internetarchive/heritrix3/commit/7a6ea789dbe53d2572e646e7aa1e73aa41a695a9 Author: Alex Osborne <aos...@nl...> Date: 2020-04-22 (Wed, 22 Apr 2020) Changed paths: M engine/src/main/java/org/archive/crawler/reporting/CrawlSummaryReport.java M engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java Log Message: ----------- Merge pull request #326 from clawia/crawl-status-and-report Add real crawlStatus in the crawlReport Commit: aad54c3cb30501e033d2f0c94d347dbdf426a432 https://github.com/internetarchive/heritrix3/commit/aad54c3cb30501e033d2f0c94d347dbdf426a432 Author: Leslie Bellony <les...@bn...> Date: 2020-04-23 (Thu, 23 Apr 2020) Changed paths: M modules/src/main/java/org/archive/modules/fetcher/FetchSFTP.java Log Message: ----------- rename variable + javadoc Commit: 63ca9d4c6a778060c0a87281f17e38cdebee7d5e https://github.com/internetarchive/heritrix3/commit/63ca9d4c6a778060c0a87281f17e38cdebee7d5e Author: Leslie Bellony <les...@bn...> Date: 2020-04-23 (Thu, 23 Apr 2020) Changed paths: M commons/src/main/java/org/archive/net/ClientSFTP.java Log Message: ----------- Remove static fields and byte loop + rename variable Commit: 8dae1960992124077ccc53ae8d033d1bd6715c88 https://github.com/internetarchive/heritrix3/commit/8dae1960992124077ccc53ae8d033d1bd6715c88 Author: Alex Osborne <aos...@nl...> Date: 2020-04-24 (Fri, 24 Apr 2020) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/HTMLLinkContext.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java M modules/src/test/java/org/archive/modules/extractor/JerichoExtractorHTMLTest.java Log Message: ----------- Merge pull request #323 from clawia/master Add parsing for HTML tags (data-*) Commit: 7764f9265f32656fa3eedccd47e6eb00da6b4063 https://github.com/internetarchive/heritrix3/commit/7764f9265f32656fa3eedccd47e6eb00da6b4063 Author: Alex Osborne <aos...@nl...> Date: 2020-04-24 (Fri, 24 Apr 2020) Changed paths: M commons/pom.xml A commons/src/main/java/org/archive/net/ClientSFTP.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/HBasePersistLoadProcessor.java M modules/pom.xml M modules/src/main/java/org/archive/modules/deciderules/SchemeNotInSetDecideRule.java A modules/src/main/java/org/archive/modules/fetcher/FetchSFTP.java M modules/src/main/java/org/archive/modules/recrawl/AbstractPersistProcessor.java M modules/src/main/java/org/archive/modules/warc/FtpControlConversationRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/MetadataRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/RevisitRecordBuilder.java M modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java M modules/src/main/java/org/archive/modules/writer/WriterPoolProcessor.java Log Message: ----------- Merge pull request #320 from bnfleb/sftp Add support for the SFTP protocol Commit: 18bee39a4c8164de4b861d41c5392d2e02fde471 https://github.com/internetarchive/heritrix3/commit/18bee39a4c8164de4b861d41c5392d2e02fde471 Author: morokosi <mor...@gm...> Date: 2020-04-30 (Thu, 30 Apr 2020) Changed paths: M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java M modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java Log Message: ----------- fix discarded future value Commit: 13ba0b2b67c842904176d7a3c273d5240443a6e9 https://github.com/internetarchive/heritrix3/commit/13ba0b2b67c842904176d7a3c273d5240443a6e9 Author: morokosi <mor...@gm...> Date: 2020-04-30 (Thu, 30 Apr 2020) Changed paths: M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java M modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java Log Message: ----------- optimize imports Commit: b84d443f00a75d68fa022c61ba5b8a106bd6344a https://github.com/internetarchive/heritrix3/commit/b84d443f00a75d68fa022c61ba5b8a106bd6344a Author: Alex Osborne <aos...@nl...> Date: 2020-05-01 (Fri, 01 May 2020) Changed paths: M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java M modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java Log Message: ----------- Merge pull request #328 from morokosi/fix-matchesregexdeciderule Fix match result is always false in MatchesListRegexDecideRule Commit: b4962d25a8b4110beec1c121ab9923b2663ef3b1 https://github.com/internetarchive/heritrix3/commit/b4962d25a8b4110beec1c121ab9923b2663ef3b1 Author: dengliming <lim...@gm...> Date: 2020-05-12 (Tue, 12 May 2020) Changed paths: M .travis.yml Log Message: ----------- Remove deprecated sudo setting. Commit: 176f796bebb208d3cf5f604654596f49e2eb5e38 https://github.com/internetarchive/heritrix3/commit/176f796bebb208d3cf5f604654596f49e2eb5e38 Author: Andrew Jackson <And...@bl...> Date: 2020-05-18 (Mon, 18 May 2020) Changed paths: M CHANGELOG.md Log Message: ----------- Updated changelog. Commit: ef38dc08c357bcb2353bbfbcd00a2c3370aad9c7 https://github.com/internetarchive/heritrix3/commit/ef38dc08c357bcb2353bbfbcd00a2c3370aad9c7 Author: Andrew Jackson <And...@bl...> Date: 2020-05-18 (Mon, 18 May 2020) Changed paths: M commons/pom.xml M contrib/pom.xml M dist/pom.xml M engine/pom.xml M modules/pom.xml M pom.xml Log Message: ----------- [maven-release-plugin] prepare release 3.4.0-20200518 Commit: e215b89963915315c124a9373caf92512c68cfa7 https://github.com/internetarchive/heritrix3/commit/e215b89963915315c124a9373caf92512c68cfa7 Author: Andrew Jackson <And...@bl...> Date: 2020-05-18 (Mon, 18 May 2020) Changed paths: M commons/pom.xml M contrib/pom.xml M dist/pom.xml M engine/pom.xml M modules/pom.xml M pom.xml Log Message: ----------- [maven-release-plugin] prepare for next development iteration Commit: 87c349fe13dd76dd28ee1e6079fedc8f5e2b51ce https://github.com/internetarchive/heritrix3/commit/87c349fe13dd76dd28ee1e6079fedc8f5e2b51ce Author: Andrew Jackson <And...@bl...> Date: 2020-05-18 (Mon, 18 May 2020) Changed paths: M CHANGELOG.md Log Message: ----------- Re-sync change log. Commit: c0fe2064a1c3aad6ec7316c7d7f6cdbc7bef1b6b https://github.com/internetarchive/heritrix3/commit/c0fe2064a1c3aad6ec7316c7d7f6cdbc7bef1b6b Author: Adam Miller <ad...@ar...> Date: 2020-05-20 (Wed, 20 May 2020) Changed paths: M modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java Log Message: ----------- Merge branch 'fixes-ftp-response-record-builder' Commit: 60daef90b2a7881d35737ed64155fcdf401df33a https://github.com/internetarchive/heritrix3/commit/60daef90b2a7881d35737ed64155fcdf401df33a Author: Adam Miller <ad...@ar...> Date: 2020-05-20 (Wed, 20 May 2020) Changed paths: M CHANGELOG.md M commons/pom.xml M commons/src/main/java/org/archive/bdb/BdbModule.java A commons/src/main/java/org/archive/net/ClientSFTP.java M commons/src/main/java/org/archive/util/Base32.java M commons/src/main/java/org/archive/util/BloomFilter64bit.java M contrib/pom.xml M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/HBasePersistLoadProcessor.java M docs/api.rst M engine/pom.xml M engine/src/main/java/org/archive/crawler/framework/CrawlJob.java M engine/src/main/java/org/archive/crawler/framework/Frontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbFrontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java M engine/src/main/java/org/archive/crawler/reporting/CrawlSummaryReport.java M engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java M engine/src/main/java/org/archive/crawler/restlet/PagedRepresentation.java M engine/src/test/java/org/archive/crawler/prefetch/QuotaEnforcerTest.java M engine/src/test/java/org/archive/crawler/util/BloomUriUniqFilterTest.java M modules/pom.xml M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/SchemeNotInSetDecideRule.java M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/HTMLLinkContext.java A modules/src/main/java/org/archive/modules/fetcher/FetchSFTP.java M modules/src/main/java/org/archive/modules/recrawl/AbstractPersistProcessor.java M modules/src/main/java/org/archive/modules/warc/FtpControlConversationRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/MetadataRecordBuilder.java M modules/src/main/java/org/archive/modules/warc/RevisitRecordBuilder.java M modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java M modules/src/main/java/org/archive/modules/writer/WriterPoolProcessor.java A modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java M modules/src/test/java/org/archive/modules/extractor/JerichoExtractorHTMLTest.java M modules/src/test/java/org/archive/modules/fetcher/CookieStoreTest.java M pom.xml Log Message: ----------- Merge branch 'master' into fixes-extractor-multiple-regex-matcher-recycle Commit: c4644d4e5c10e3172b965d11a6ef73221be4d49b https://github.com/internetarchive/heritrix3/commit/c4644d4e5c10e3172b965d11a6ef73221be4d49b Author: Alex Osborne <aos...@nl...> Date: 2020-06-01 (Mon, 01 Jun 2020) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorMultipleRegex.java M modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java Log Message: ----------- Merge pull request #335 from internetarchive/fixes-extractor-multiple-regex-matcher-recycle Fixes extractor multiple regex matcher recycle Commit: 2e6b76d517f9708355b2e6d12adbb2001eac7306 https://github.com/internetarchive/heritrix3/commit/2e6b76d517f9708355b2e6d12adbb2001eac7306 Author: Alex Osborne <aos...@nl...> Date: 2020-06-01 (Mon, 01 Jun 2020) Changed paths: M .travis.yml Log Message: ----------- Merge pull request #333 from dengliming/patch-1 Remove deprecated sudo setting. Commit: 33a3f9a58f6d8873cccc6fcbebfa50c196dd2335 https://github.com/internetarchive/heritrix3/commit/33a3f9a58f6d8873cccc6fcbebfa50c196dd2335 Author: Andrew Jackson <And...@bl...> Date: 2020-06-01 (Mon, 01 Jun 2020) Changed paths: M CHANGELOG.md Log Message: ----------- Updated changelog based on master. Commit: 2505effadc09acb05ee2fc1a38347b155c251c1a https://github.com/internetarchive/heritrix3/commit/2505effadc09acb05ee2fc1a38347b155c251c1a Author: Neil Minton <ne...@ar...> Date: 2020-09-01 (Tue, 01 Sep 2020) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java Log Message: ----------- Strip quotes from URL value. Commit: adac067ea74b5a89f631ef771e2f598819bac6c4 https://github.com/internetarchive/heritrix3/commit/adac067ea74b5a89f631ef771e2f598819bac6c4 Author: Kristinn Sigurðsson <kri...@la...> Date: 2020-09-02 (Wed, 02 Sep 2020) Changed paths: M README.md Log Message: ----------- Update link to robots.txt meta documentation Commit: 0649751456f66e512bab01782604ef421d0ce586 https://github.com/internetarchive/heritrix3/commit/0649751456f66e512bab01782604ef421d0ce586 Author: Barbara Miller <ga...@me...> Date: 2021-01-04 (Mon, 04 Jan 2021) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- --no-cache-dir, --no-playlist Commit: ba0d51882a3a19705dd98781dcfc6d5adbdc261f https://github.com/internetarchive/heritrix3/commit/ba0d51882a3a19705dd98781dcfc6d5adbdc261f Author: Alex Osborne <aos...@nl...> Date: 2021-02-15 (Mon, 15 Feb 2021) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Merge pull request #359 from internetarchive/ait-ydl-updates ait youtube-dl options Commit: 5a641413029aecce353337aa7b03a6003359dff0 https://github.com/internetarchive/heritrix3/commit/5a641413029aecce353337aa7b03a6003359dff0 Author: Alex Osborne <aos...@nl...> Date: 2021-02-15 (Mon, 15 Feb 2021) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java Log Message: ----------- Merge pull request #352 from BitBaron/absolute-meta-urls Strip quotes from URL value. Commit: 5f52a14afcb08b788213eed986d9cc3e47438d80 https://github.com/internetarchive/heritrix3/commit/5f52a14afcb08b788213eed986d9cc3e47438d80 Author: Alex Osborne <aos...@nl...> Date: 2021-02-15 (Mon, 15 Feb 2021) Changed paths: M commons/pom.xml Log Message: ----------- Upgrade to Spring 5.3.3 Spring 3.0.5 is 10 years old now and is incompatible with classes using Java 9+ bytecode and even on Java 8 classes using lambdas (see #337). We remove spring-asm as it is now part of spring-core module. Commit: b10b33821f4a71eecd6b8656d2c52325684fa075 https://github.com/internetarchive/heritrix3/commit/b10b33821f4a71eecd6b8656d2c52325684fa075 Author: Alex Osborne <aos...@nl...> Date: 2021-02-15 (Mon, 15 Feb 2021) Changed paths: M README.md Log Message: ----------- Fix broken robots.txt link Commit: 35bb3286f4d441538ac7d74cd65835abaaff2736 https://github.com/internetarchive/heritrix3/commit/35bb3286f4d441538ac7d74cd65835abaaff2736 Author: Alex Osborne <aos...@nl...> Date: 2021-02-15 (Mon, 15 Feb 2021) Changed paths: M README.md Log Message: ----------- Clarify robots protocol support Closes #351, #353 Commit: 37ce8d694590b0cf8cbe0a38a58c5f8ee719c4f0 https://github.com/internetarchive/heritrix3/commit/37ce8d694590b0cf8cbe0a38a58c5f8ee719c4f0 Author: Alex Osborne <aos...@nl...> Date: 2021-02-23 (Tue, 23 Feb 2021) Changed paths: M commons/pom.xml Log Message: ----------- Merge pull request #366 from ato/upstream-spring-5.3.3 Upgrade to Spring 5.3.3 Commit: 5ec7fb87d996f3b4f92f32deff10f2c15f7e5df0 https://github.com/internetarchive/heritrix3/commit/5ec7fb87d996f3b4f92f32deff10f2c15f7e5df0 Author: webdev4422 <674...@us...> Date: 2021-02-26 (Fri, 26 Feb 2021) Changed paths: M engine/src/main/java/org/archive/crawler/Heritrix.java Log Message: ----------- Fix misspell in comments Commit: 39e9e72b31b821ecefb5a14cf3bbbd8a490dc770 https://github.com/internetarchive/heritrix3/commit/39e9e72b31b821ecefb5a14cf3bbbd8a490dc770 Author: Phani Dharmavarapu <vdh...@hu...> Date: 2021-04-14 (Wed, 14 Apr 2021) Changed paths: M contrib/src/main/java/org/archive/modules/recrawl/FetchHistoryHelper.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchema.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchema.java A contrib/src/test/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchemaTest.java A contrib/src/test/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchemaTest.java Log Message: ----------- Use seconds from the FetchCompletedTime for LastModified time LastModified field in the FetchHistory table is expected to be in seconds, but sometimes is stored in millis causing the parsing logic to give a date in the future Commit: 52093eb25f8f17b48eea5acad06bb6eaeb99adf4 https://github.com/internetarchive/heritrix3/commit/52093eb25f8f17b48eea5acad06bb6eaeb99adf4 Author: Kristinn Sigurdsson <kri...@la...> Date: 2021-04-20 (Tue, 20 Apr 2021) Changed paths: M modules/src/main/java/org/archive/modules/extractor/ExtractorRobotsTxt.java Log Message: ----------- Only copy source tag if not null. Commit: 145d46dd8dc261c83718dcd56720c794a09e59e9 https://github.com/internetarchive/heritrix3/commit/145d46dd8dc261c83718dcd56720c794a09e59e9 Author: Phani Dharmavarapu <vdh...@hu...> Date: 2021-04-21 (Wed, 21 Apr 2021) Changed paths: M contrib/src/main/java/org/archive/modules/recrawl/FetchHistoryHelper.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchema.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchema.java Log Message: ----------- Remove change to helper class as it can be done in schema Commit: 71b306517294287f8599bd2b9d5c8e34322e984b https://github.com/internetarchive/heritrix3/commit/71b306517294287f8599bd2b9d5c8e34322e984b Author: Alex Osborne <aos...@nl...> Date: 2021-05-11 (Tue, 11 May 2021) Changed paths: M engine/src/main/java/org/archive/crawler/restlet/JobRelatedResource.java Log Message: ----------- Handle empty Optionals when browsing beans addPresentableNestedNames() recursively walks the properties of beans calling getters. Likely unintentionally this includes calling getClass() and recursing into the reflection API. When running on JDK 11 the reflection API has some methods (e.g. in java.lang.module) that return instances of Optional. With the newer version of Spring's BeanWrapperImpl encounters an Optional it attempts to unwrap it and throws IllegalArgumentException if it is empty. Let's fix this in two ways. Firstly let's avoid walking the reflection API entirely as its irrelevant for the purposes of bean browsing by not inspecting the properties of instances of java.lang.Class. Secondly in case any Heritrix beans start using Optional in future lets also handle the empty case by also not inspecting, the same as we do for null. Fixes #376 Reported-By: Lauren Ko <lau...@un...> Commit: ac33e2ef554c85f904806b92c099b32bf95cb46e https://github.com/internetarchive/heritrix3/commit/ac33e2ef554c85f904806b92c099b32bf95cb46e Author: Lauren Ko <lau...@un...> Date: 2021-05-11 (Tue, 11 May 2021) Changed paths: M engine/src/main/resources/org/archive/crawler/restlet/Beans.ftl Log Message: ----------- Avoid error when bean properties have no url available Commit: 065e62fc1fb65eee5beeb88ecfa15eed245dc912 https://github.com/internetarchive/heritrix3/commit/065e62fc1fb65eee5beeb88ecfa15eed245dc912 Author: Alex Osborne <aos...@nl...> Date: 2021-05-12 (Wed, 12 May 2021) Changed paths: M engine/src/main/java/org/archive/crawler/restlet/JobRelatedResource.java Log Message: ----------- Merge pull request #377 from internetarchive/bean-browser-optional-fix Handle empty Optionals when browsing beans Commit: a09479364a5b1ebabc025d7048dd00a0a3ae9c28 https://github.com/internetarchive/heritrix3/commit/a09479364a5b1ebabc025d7048dd00a0a3ae9c28 Author: Alex Osborne <aos...@nl...> Date: 2021-05-12 (Wed, 12 May 2021) Changed paths: M engine/src/main/resources/org/archive/crawler/restlet/Beans.ftl Log Message: ----------- Merge pull request #379 from ldko/bean-url-missing-fix Avoid error when bean properties have no url available Commit: 045b2516da91768af42ddd6736b4b8ff9455f691 https://github.com/internetarchive/heritrix3/commit/045b2516da91768af42ddd6736b4b8ff9455f691 Author: Andy Jackson <an...@an...> Date: 2021-05-20 (Thu, 20 May 2021) Changed paths: M commons/pom.xml M engine/pom.xml M modules/src/main/java/org/archive/modules/fetcher/FetchDNS.java Log Message: ----------- Update to latest version of dnsjava, for #344 This commit updates to version 3.3.1 of the dnsjava library. Incidentally, the restlet.org certificate is currently invalid, so this patch switched to maven.restlet.com. Commit: c7b1dfa1982ce4f9d1a905912cf4b3c003302cec https://github.com/internetarchive/heritrix3/commit/c7b1dfa1982ce4f9d1a905912cf4b3c003302cec Author: Andy Jackson <And...@bl...> Date: 2021-05-20 (Thu, 20 May 2021) Changed paths: M engine/src/main/java/org/archive/crawler/Heritrix.java Log Message: ----------- Merge pull request #368 from webdev4422/patch-1 Fix misspell in comments Commit: b2a07b290f0b537fe936c31fb7386de4f047e04d https://github.com/internetarchive/heritrix3/commit/b2a07b290f0b537fe936c31fb7386de4f047e04d Author: Andy Jackson <And...@bl...> Date: 2021-05-20 (Thu, 20 May 2021) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java M modules/src/main/java/org/archive/modules/writer/WARCWriterChainProcessor.java Log Message: ----------- Merge pull request #348 from internetarchive/fixes-leaky-file-handles Fixes leaky file handles Commit: c1bcdd9f9eef0c1f2c6081fe98b0710ddab66f4a https://github.com/internetarchive/heritrix3/commit/c1bcdd9f9eef0c1f2c6081fe98b0710ddab66f4a Author: Andy Jackson <And...@bl...> Date: 2021-05-20 (Thu, 20 May 2021) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java M modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java Log Message: ----------- Merge pull request #341 from internetarchive/noplaylist-ydl youtube-dl --no-playlist Commit: 396467c535e7ab655d89bb6b8dcad00f058b47b2 https://github.com/internetarchive/heritrix3/commit/396467c535e7ab655d89bb6b8dcad00f058b47b2 Author: Andy Jackson <And...@bl...> Date: 2021-05-20 (Thu, 20 May 2021) Changed paths: M .gitignore M .travis.yml M CHANGELOG.md A LICENSE M README.md M commons/pom.xml M commons/src/main/java/org/archive/bdb/BdbModule.java A commons/src/main/java/org/archive/net/ClientSFTP.java M commons/src/main/java/org/archive/util/Base32.java M commons/src/main/java/org/archive/util/BloomFilter64bit.java R commons/src/main/java/org/archive/util/IdentityCacheableWrapper.java R commons/src/main/java/org/archive/util/ObjectIdentityBdbCache.java A commons/src/test/java/org/archive/util/IdentityCacheableWrapper.java R commons/src/test/java/org/archive/util/ObjectIdentityBdbCacheTest.java M commons/src/test/java/org/archive/util/ObjectIdentityBdbManualCacheTest.java M commons/src/test/resources/log4j.xml M contrib/pom.xml M contrib/src/main/java/org/archive/crawler/reporting/XmlCrawlSummaryReport.java M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java M contrib/src/main/java/org/archive/modules/postprocessor/WARCLimitEnforcer.java M contrib/src/main/java/org/archive/modules/recrawl/FetchHistoryHelper.java M contrib/src/main/java/org/archive/modules/recrawl/TroughContentDigestHistory.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/HBasePersistLoadProcessor.java M contrib/src/main/resources/log4j.xml M contrib/src/test/java/org/archive/modules/recrawl/wbm/WbmPersistLoadProcessorTest.java M contrib/src/test/resources/log4j.xml M dist/pom.xml M dist/src/main/conf/logging.properties M dist/src/test/resources/log4j.xml M docs/api.rst M engine/pom.xml M engine/src/main/java/org/archive/crawler/Heritrix.java M engine/src/main/java/org/archive/crawler/framework/CrawlJob.java M engine/src/main/java/org/archive/crawler/framework/Frontier.java M engine/src/main/java/org/archive/crawler/frontier/AssignmentLevelSurtQueueAssignmentPolicy.java M engine/src/main/java/org/archive/crawler/frontier/BdbFrontier.java M engine/src/main/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java M engine/src/main/java/org/archive/crawler/frontier/URIAuthorityBasedQueueAssignmentPolicy.java M engine/src/main/java/org/archive/crawler/prefetch/QuotaEnforcer.java M engine/src/main/java/org/archive/crawler/reporting/CrawlSummaryReport.java M engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java M engine/src/main/java/org/archive/crawler/restlet/BaseResource.java M engine/src/main/java/org/archive/crawler/restlet/BeanBrowseResource.java M engine/src/main/java/org/archive/crawler/restlet/EditRepresentation.java M engine/src/main/java/org/archive/crawler/restlet/EngineApplication.java M engine/src/main/java/org/archive/crawler/restlet/EngineResource.java M engine/src/main/java/org/archive/crawler/restlet/EnhDirectory.java M engine/src/main/java/org/archive/crawler/restlet/EnhDirectoryResource.java M engine/src/main/java/org/archive/crawler/restlet/Flash.java M engine/src/main/java/org/archive/crawler/restlet/JobRelatedResource.java M engine/src/main/java/org/archive/crawler/restlet/JobResource.java M engine/src/main/java/org/archive/crawler/restlet/PagedRepresentation.java M engine/src/main/java/org/archive/crawler/restlet/RateLimitGuard.java M engine/src/main/java/org/archive/crawler/restlet/ReportGenResource.java M engine/src/main/java/org/archive/crawler/restlet/ScriptResource.java M engine/src/main/java/org/archive/crawler/restlet/XmlMarshaller.java M engine/src/main/resources/org/archive/crawler/restlet/Beans.ftl M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml M engine/src/test/java/org/archive/crawler/prefetch/QuotaEnforcerTest.java M engine/src/test/java/org/archive/crawler/selftest/CheckpointSelfTest.java M engine/src/test/java/org/archive/crawler/selftest/FormAuthSelfTest.java M engine/src/test/java/org/archive/crawler/selftest/FormLoginSelfTest.java M engine/src/test/java/org/archive/crawler/selftest/HttpAuthSelfTest.java M engine/src/test/java/org/archive/crawler/selftest/SelfTestBase.java M engine/src/test/java/org/archive/crawler/selftest/StatisticsSelfTest.java M engine/src/test/java/org/archive/crawler/selftest/UserAgentSelfTest.java M engine/src/test/java/org/archive/crawler/util/BloomUriUniqFilterTest.java M engine/src/test/java/org/archive/modules/fetcher/FormAuthTest.java M engine/src/test/resources/log4j.xml M modules/pom.xml M modules/src/main/java/org/archive/modules/CrawlURI.java M modules/src/main/java/org/archive/modules/deciderules/MatchesListRegexDecideRule.java M modules/src/main/java/org/archive/modules/deciderules/SchemeNotInSetDecideRule.java M modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java M modules/src/main/java/org/archive/modules/extractor/ExtractorMultipleRegex.java M modules/src/main/java/org/archive/modules/extractor/HTMLLinkContext.java A modules/src/main/java/org/archive/modules/fetcher/FetchSFTP.java M modules/src/main/java/org/archive/modules/recrawl/AbstractPersistProcessor.java M modules/src/main/java/org/archive/modules/recrawl/FetchHistoryProcessor.java M modules/src/main/java/org/archive/modules/recrawl/RecrawlAttributeConstants.java A modules/src/main/java/org/archive/modules/warc/BaseWARCRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/DnsResponseRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/FtpControlConversationRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/HttpRequestRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/HttpResponseRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/MetadataRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/RevisitRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/WARCRecordBuilder.java A modules/src/main/java/org/archive/modules/warc/WhoisResponseRecordBuilder.java A modules/src/main/java/org/archive/modules/writer/BaseWARCWriterProcessor.java A modules/src/main/java/org/archive/modules/writer/WARCWriterChainProcessor.java M modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java M modules/src/main/java/org/archive/modules/writer/WriterPoolProcessor.java A modules/src/test/java/org/archive/modules/deciderules/MatchesListRegexDecideRuleTest.java M modules/src/test/java/org/archive/modules/extractor/ExtractorHTMLTest.java M modules/src/test/java/org/archive/modules/extractor/JerichoExtractorHTMLTest.java M modules/src/test/java/org/archive/modules/fetcher/CookieFetchHTTPIntegrationTest.java M modules/src/test/java/org/archive/modules/fetcher/CookieStoreTest.java M modules/src/test/java/org/archive/modules/fetcher/FetchHTTPTest.java M modules/src/test/java/org/archive/modules/fetcher/FetchHTTPTests.java M modules/src/test/java/org/archive/modules/recrawl/ContentDigestHistoryTest.java A modules/src/test/java/org/archive/modules/writer/WARCWriterChainProcessorTest.java M modules/src/test/resources/log4j.xml M pom.xml Log Message: ----------- Merge branch 'master' into sitemaps Commit: d7869de8d978c062d1f720ab34818c0a3e64ae5e https://github.com/internetarchive/heritrix3/commit/d7869de8d978c062d1f720ab34818c0a3e64ae5e Author: Andy Jackson <And...@bl...> Date: 2021-05-20 (Thu, 20 May 2021) Changed paths: M commons/pom.xml M engine/src/main/resources/org/archive/crawler/restlet/profile-crawler-beans.cxml M modules/pom.xml A modules/src/main/java/org/archive/modules/extractor/ExtractorRobotsTxt.java A modules/src/main/java/org/archive/modules/extractor/ExtractorSitemap.java M modules/src/main/java/org/archive/modules/extractor/Hop.java M modules/src/main/java/org/archive/modules/extractor/LinkContext.java Log Message: ----------- Merge pull request #262 from kris-sigur/sitemaps Support for extracting URLs in sitemaps Commit: c0dbfb2a0a22c31fb6f14d990bc68baaf4d556fb https://github.com/internetarchive/heritrix3/commit/c0dbfb2a0a22c31fb6f14d990bc68baaf4d556fb Author: Andy Jackson <And...@bl...> Date: 2021-05-20 (Thu, 20 May 2021) Changed paths: M commons/pom.xml M engine/pom.xml M modules/src/main/java/org/archive/modules/fetcher/FetchDNS.java Log Message: ----------- Merge pull request #383 from ukwa/upgrade-dnsjava Update to latest version of dnsja... [truncated message content] |
From: Barbara M. <no...@gi...> - 2023-09-29 22:41:56
|
Branch: refs/heads/yt-dlp-20230706 Home: https://github.com/internetarchive/heritrix3 Commit: bcfed0c1f7b638a3386f2e48351bcaa802c410a1 https://github.com/internetarchive/heritrix3/commit/bcfed0c1f7b638a3386f2e48351bcaa802c410a1 Author: Barbara Miller <ba...@ar...> Date: 2023-09-29 (Fri, 29 Sep 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- yt-dlp (not youtube-dl) |
From: Alex O. <no...@gi...> - 2023-09-19 02:21:39
|
Branch: refs/heads/jdk20 Home: https://github.com/internetarchive/heritrix3 Commit: 1c07e43ade91b4586d00b5c978b596e72b4f5801 https://github.com/internetarchive/heritrix3/commit/1c07e43ade91b4586d00b5c978b596e72b4f5801 Author: Alex Osborne <aos...@nl...> Date: 2023-09-19 (Tue, 19 Sep 2023) Changed paths: M modules/pom.xml Log Message: ----------- Update groovy to 4.0.15 for JDK 20 support Commit: 1a27551e60cff9ba1f0eab7e8a0a83e5ea00b0e0 https://github.com/internetarchive/heritrix3/commit/1a27551e60cff9ba1f0eab7e8a0a83e5ea00b0e0 Author: Alex Osborne <aos...@nl...> Date: 2023-09-19 (Tue, 19 Sep 2023) Changed paths: M .github/workflows/maven.yml Log Message: ----------- Update CI tests from JDK 19 to 20 Compare: https://github.com/internetarchive/heritrix3/compare/5006efb0e382...1a27551e60cf |
From: Alex O. <no...@gi...> - 2023-09-19 02:11:48
|
Branch: refs/heads/jdk20 Home: https://github.com/internetarchive/heritrix3 Commit: fa346716a8f2d6d0727de9ed688f6a8af9947d8f https://github.com/internetarchive/heritrix3/commit/fa346716a8f2d6d0727de9ed688f6a8af9947d8f Author: Alex Osborne <aos...@nl...> Date: 2023-09-19 (Tue, 19 Sep 2023) Changed paths: M modules/pom.xml Log Message: ----------- Update groovy to 4.0.15 for JDK 20 support Commit: 5006efb0e3828bb8a47f6dd47c104d471d6aa52f https://github.com/internetarchive/heritrix3/commit/5006efb0e3828bb8a47f6dd47c104d471d6aa52f Author: Alex Osborne <aos...@nl...> Date: 2023-09-19 (Tue, 19 Sep 2023) Changed paths: M .github/workflows/maven.yml Log Message: ----------- Update CI tests from JDK 19 to 20 Compare: https://github.com/internetarchive/heritrix3/compare/0aab1498d9d1...5006efb0e382 |
From: Barbara M. <no...@gi...> - 2023-09-06 18:48:41
|
Branch: refs/heads/ait-qa Home: https://github.com/internetarchive/heritrix3 Commit: 122417c0fbc2ee3cfd99f08a06dc334279ce5dd4 https://github.com/internetarchive/heritrix3/commit/122417c0fbc2ee3cfd99f08a06dc334279ce5dd4 Author: Barbara Miller <ga...@us...> Date: 2023-09-06 (Wed, 06 Sep 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Update ExtractorYoutubeDL.java |
From: Barbara M. <no...@gi...> - 2023-09-06 18:25:32
|
Branch: refs/heads/yt-dlp-20230706 Home: https://github.com/internetarchive/heritrix3 Commit: d50158c0124ae99207f8a21bc7e87f6c49c72521 https://github.com/internetarchive/heritrix3/commit/d50158c0124ae99207f8a21bc7e87f6c49c72521 Author: Barbara Miller <ga...@us...> Date: 2023-09-06 (Wed, 06 Sep 2023) Changed paths: M contrib/src/main/java/org/archive/modules/extractor/ExtractorYoutubeDL.java Log Message: ----------- Update ExtractorYoutubeDL.java |
From: Alex O. <no...@gi...> - 2023-05-11 07:51:33
|
Branch: refs/heads/jdk20 Home: https://github.com/internetarchive/heritrix3 Commit: 0aab1498d9d118ea8b5a1ae2741d397ce1b9e6d4 https://github.com/internetarchive/heritrix3/commit/0aab1498d9d118ea8b5a1ae2741d397ce1b9e6d4 Author: Alex Osborne <aos...@nl...> Date: 2023-05-11 (Thu, 11 May 2023) Changed paths: M .github/workflows/maven.yml Log Message: ----------- Github action: test against JDK 20 instead of 19 |
From: Andy J. <no...@gi...> - 2023-03-10 13:12:57
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: 7cd4a914c308542ffdd0699ffb7c92042fd0126d https://github.com/internetarchive/heritrix3/commit/7cd4a914c308542ffdd0699ffb7c92042fd0126d Author: claire9910 <115...@us...> Date: 2022-10-15 (Sat, 15 Oct 2022) Changed paths: M contrib/pom.xml Log Message: ----------- update com.itextpdf:itextpdf 5.5.0 to 5.5.12 Commit: 8563f491a5b355c39a89f51b17c76aaa84752a8a https://github.com/internetarchive/heritrix3/commit/8563f491a5b355c39a89f51b17c76aaa84752a8a Author: Andy Jackson <an...@an...> Date: 2023-03-10 (Fri, 10 Mar 2023) Changed paths: M contrib/pom.xml Log Message: ----------- Merge pull request #536 from claire9910/oscs_fix_cd5c2voau51t5k2gfi30 fix(sec): upgrade com.itextpdf:itextpdf to 5.5.12 Compare: https://github.com/internetarchive/heritrix3/compare/a3952be18165...8563f491a5b3 |
From: Andy J. <no...@gi...> - 2023-03-10 13:06:17
|
Branch: refs/heads/revert-375-fix-last-modified-millis Home: https://github.com/internetarchive/heritrix3 |
From: Andy J. <no...@gi...> - 2023-03-10 13:06:17
|
Branch: refs/heads/master Home: https://github.com/internetarchive/heritrix3 Commit: d073b43b485dd58e14fe96a8cbc372477f2dfc53 https://github.com/internetarchive/heritrix3/commit/d073b43b485dd58e14fe96a8cbc372477f2dfc53 Author: Andy Jackson <And...@bl...> Date: 2023-03-10 (Fri, 10 Mar 2023) Changed paths: M contrib/src/main/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchema.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchema.java R contrib/src/test/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchemaTest.java R contrib/src/test/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchemaTest.java Log Message: ----------- Revert "Use seconds from the FetchCompletedTime for LastModified time" Commit: a3952be18165ab9f28b68b9a37b69089f7363d4b https://github.com/internetarchive/heritrix3/commit/a3952be18165ab9f28b68b9a37b69089f7363d4b Author: Andy Jackson <an...@an...> Date: 2023-03-10 (Fri, 10 Mar 2023) Changed paths: M contrib/src/main/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchema.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchema.java R contrib/src/test/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchemaTest.java R contrib/src/test/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchemaTest.java Log Message: ----------- Merge pull request #552 from internetarchive/revert-375-fix-last-modified-millis Revert "Use seconds from the FetchCompletedTime for LastModified time" Compare: https://github.com/internetarchive/heritrix3/compare/d948364105bf...a3952be18165 |
From: Andy J. <no...@gi...> - 2023-03-10 13:05:56
|
Branch: refs/heads/revert-375-fix-last-modified-millis Home: https://github.com/internetarchive/heritrix3 Commit: d073b43b485dd58e14fe96a8cbc372477f2dfc53 https://github.com/internetarchive/heritrix3/commit/d073b43b485dd58e14fe96a8cbc372477f2dfc53 Author: Andy Jackson <And...@bl...> Date: 2023-03-10 (Fri, 10 Mar 2023) Changed paths: M contrib/src/main/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchema.java M contrib/src/main/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchema.java R contrib/src/test/java/org/archive/modules/recrawl/hbase/MultiColumnRecrawlDataSchemaTest.java R contrib/src/test/java/org/archive/modules/recrawl/hbase/SingleColumnJsonRecrawlDataSchemaTest.java Log Message: ----------- Revert "Use seconds from the FetchCompletedTime for LastModified time" |