Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 document size limit not working - ID: 1014732
Last Update: Comment added ( karl-ia )

I notice that setting document size limit does not take
effect.


Igor Ranitovic ( ia_igor ) - 2004-08-23 21:30

7

Closed

Fixed

Michael Stack

General

1.0.1

Public


Comments ( 8 )

Date: 2007-03-14 00:15
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-238 -- please add further
comments at that location.


Date: 2004-08-25 02:06
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added a closeConnection onto our HttpMethod overrides.

Here is commit message used on HEAD and when backporting to
heritrix_1_0:


* src/java/org/archive/crawler/fetcher/FetchHTTP.java
(innerProcess): Call new closeConnection if we exceed
time or size limits.
* src/java/org/archive/httpclient/HttpRecorderMethod.java
* src/java/org/archive/httpclient/HttpRecorderGetMethod.java
Factored out common code to new class HttpRecorderMethod.
* src/java/org/archive/httpclient/CloseConnectionMarker.java
* src/java/org/archive/httpclient/HttpRecorderMethod.java
Added.


Date: 2004-08-24 23:39
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

(Closed by mistake. Reopening).

So though we notice we've read too much and jump out of the
readFullyOrUntil method, method.releaseConnection keeps
reading the content; it doesn't give up (Did this ever
work?). Looking for a means of forcing down the connection
though HttpClient does its best to hide the Connection from
direct manipulation.


Date: 2004-08-24 21:33
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Just downloaded 1.0.0 and tried it against archive.org. I
let it run ten minutes with upper bound of 100k (102400).
It works most of the time
(http://crawler.archive.org/checkstyle-report.html) Found a
page where it ain't working. Investigating.









Date: 2004-08-24 20:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tried 10k and 100k and it seems to be working fine. Here is
log of 100k test:

20040824203333235 200 103622
http://www.archive.org/audio/netlabels.php?PHPSESSID=92977bec5b4d22cb481dc57689d1e249
L http://www.archive.org/ text/html #001 56203
C773M6Z63Q7BNZR3I6ESHSXUTD5KWEUI lengthTrunc

Here is the arc file entry metadata line:

http://www.archive.org/audio/netlabels.php?PHPSESSID=92977bec5b4d22cb481dc57689d1e249
209.237.235.228 20040824203237 text/html 104002

Mozilla info says page is 125.03 KB (128027 bytes).

Calling Igor for more info. Maybe its in 1.0.0 only?


Date: 2004-08-24 20:17
Sender: nobody

Logged In: NO

Seems to be working for small sizes:

20040824200840769 200 4096 http://www.archive.org/ -
- text/html #001 43091 YI56HW65M2VFZA3GI7XYXYU2F2GNJ2SJ
lengthTrunc,3t
20040824200947709 200 4395
http://www.archive.org/images/logo.jpg E
http://www.archive.org/ image/jpeg #001 13402
N4GANKKA4FOUVMOC5FYPZ45YKMDY7CZU lengthTrunc

Will try other sizes.


Date: 2004-08-24 00:04
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This is the FetchHTTP max-length-bytes setting.

(I also see evidence in the NARA-MIL test crawl that the
timeout-seconds setting may not be working, either. (A
thread has been stuck in a normal HTTP fetch for hours.)
This timeout problem may, however, be a problem with the
socket timeouts on initial connect, rather than the
during-content-body-read timeout that we implement.)


Date: 2004-08-23 23:32
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

I forgot to mention that this was with 1.0 and document size
limit was 102400 (100KB)


Attached File

No Files Currently Attached

Changes ( 11 )

Field Old Value Date By
close_date 2004-08-24 21:33 2004-08-25 02:06 stack-sf
status_id Open 2004-08-25 02:06 stack-sf
resolution_id None 2004-08-25 02:06 stack-sf
status_id Closed 2004-08-24 23:39 stack-sf
resolution_id Works For Me 2004-08-24 23:39 stack-sf
close_date - 2004-08-24 21:33 stack-sf
status_id Open 2004-08-24 21:33 stack-sf
resolution_id None 2004-08-24 21:33 stack-sf
assigned_to gojomo 2004-08-24 00:00 gojomo
assigned_to nobody 2004-08-23 23:53 gojomo
priority 6 2004-08-23 23:52 gojomo