-
The ARCWriter cannot handle records larger than 2 GB.
There are 3 sub-problems here:
1) the maxsize is an int, thereby setting the maxsize of an ARCFile to max. 2 GB
2) The write() and getMetaLine() methods have an int as the recordLength, thereby setting the limit for each record to 2GB.
3) The write() methods give out strange error messages,
given an negative recordLength (due to...
2007-03-07 13:42:52 UTC in Heritrix: Internet Archive Web Crawler
-
It is no longer possible to override the method
CrawlLogIterator.parseLine(), becauuse the method is
now static, even though the javadoc for the method says
it can be overridden.
2006-11-09 16:17:50 UTC in DeDuplicator (Heritrix add-on)
-
Logged In: YES
user_id=19063
The correct fix should of course be:
protected boolean checkQuotas(final CrawlURI curi,
final CrawlSubstats.HasCrawlSubstats hasStats,
final int CAT) {
if (hasstats == null) {
return false;
}
CrawlSubstats substats = hasStats.getSubstats();.
2006-11-09 16:05:59 UTC in Heritrix: Internet Archive Web Crawler
-
I get an NPE in quotaEnforcer.checkQuotas whenever I
crawl the Danish site: http://tv.sputnik.dk.
2006-11-07T15:39:50.692Z -5 -
clsid:A9FC132B-096D-460B-B7D5-1DB0FAE0C062 XRE
http://tv.sputnik.dk/?returnurl=http://tv.sputnik.dk/player/license/channel/2089519/clip/1782918.html&cancelurl=http://tv.sputnik.dk/page/2040550/channel/2089519/category/720677-0/clip/1782918/index.html...
2006-11-09 16:01:44 UTC in Heritrix: Internet Archive Web Crawler
-
Logged In: YES
user_id=19063
The summary was cut a bit short. The summary was supposed to
say:
The UURI class may throw NullPointerException in
getReferencedHost().
2006-08-21 10:49:25 UTC in Heritrix: Internet Archive Web Crawler
-
In the case where a parsed URI returns null in both
getHost() and getScheme(),
there will be a NullPointerException, because the
method assumes that if getHost() returns null,
getScheme will not.
The method is obviously specifically made for dns:-type
URIs, but will fail in
some strange unparsable URLs, as in for instance
"http//www.test.foo" (where the : is missing).
The method...
2006-08-21 10:16:21 UTC in Heritrix: Internet Archive Web Crawler
-
svc committed patchset 25 of module searchengine to the NWA Toolset CVS repository, changing 4 files.
2004-08-10 14:18:34 UTC in NWA Toolset
-
svc committed patchset 24 of module searchengine to the NWA Toolset CVS repository, changing 2 files.
2004-08-09 13:46:40 UTC in NWA Toolset
-
svc committed patchset 23 of module searchengine to the NWA Toolset CVS repository, changing 1 files.
2004-08-09 13:45:22 UTC in NWA Toolset
-
svc committed patchset 19 of module retriever to the NWA Toolset CVS repository, changing 1 files.
2004-08-03 10:59:41 UTC in NWA Toolset