Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 CrawlURI serialization bloated; should be slimmed - ID: 1208747
Last Update: Comment added ( karl-ia )

Users have commented about size of crawler state (BDB)
directories on disk. In once case, where crawl only
saves primary text, state has been observed as 2X size
of ARCs. That's a little odd, even given ARC
compression, as links are only a small fraction of all
text.

Some monitoring in BDBMultipleWorkQueues suggests
CrawlURI instances serialized to the database are
taking 1K or more in 1.4, growing over time (as
deeper/longer URIs come to predominate).

Stepping in debugger suggests 50% of more of the size
is due to very bloated UURI serialization (mostly due
to inherited URI state). A smaller hit is taken
serializing empty AList and outLinks collections.




Gordon Mohr ( gojomo ) - 2005-05-25 20:58

5

Closed

None

Gordon Mohr

None

1.6.0

Public


Comments ( 3 )

Date: 2007-03-14 01:42
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-939 -- please add further
comments at that location.


Date: 2005-08-04 22:21
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Closing. Easy reductions in serialized size have been
implemented.


Date: 2005-05-25 21:14
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Several steps have made a 60%+ improvement in 10-20K URL
crawl tests (driving average serialized size down from ~1100
bytes to ~420bytes). As serialized CrawlURIs are largest
component of disk-based state (and BDB in-memory queue),
this could offer significant memory/performance benefits.

Commit comment:

Improvement for [ 1208747 ] CrawlURI serialization bloated;
should be slimmed
* CandidateURI.java
use custom serialization which more compactly stores
UURI instances and empty alist
* CrawlURI.java
use custom serialization which more compactly stores
empty outLinks


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
artifact_group_id None 2005-09-23 21:08 gojomo
status_id Open 2005-08-04 22:21 gojomo
close_date - 2005-08-04 22:21 gojomo
summary CrawlURI serialization is bloated 2005-05-25 21:14 gojomo