Users have commented about size of crawler state (BDB)
directories on disk. In once case, where crawl only
saves primary text, state has been observed as 2X size
of ARCs. That's a little odd, even given ARC
compression, as links are only a small fraction of all
text.
Some monitoring in BDBMultipleWorkQueues suggests
CrawlURI instances serialized to the database are
taking 1K or more in 1.4, growing over time (as
deeper/longer URIs come to predominate).
Stepping in debugger suggests 50% of more of the size
is due to very bloated UURI serialization (mostly due
to inherited URI state). A smaller hit is taken
serializing empty AList and outLinks collections.
Gordon Mohr
None
1.6.0
Public
|
Date: 2007-03-14 01:42
|
|
Date: 2005-08-04 22:21 Logged In: YES |
|
Date: 2005-05-25 21:14 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| artifact_group_id | None | 2005-09-23 21:08 | gojomo |
| status_id | Open | 2005-08-04 22:21 | gojomo |
| close_date | - | 2005-08-04 22:21 | gojomo |
| summary | CrawlURI serialization is bloated | 2005-05-25 21:14 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use