Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

8 Quick resume without real recovery / Checkpointing - ID: 1190974
Last Update: Comment added ( karl-ia )

Heritrix should be able to resume crashed crawls
quickly, without actually having to recover each URI
line-by-line (which basically is re-crawling without
network I/O, impractical if you already have crawled
millions of pages).

Especially for the BdbFrontier, this might be easy if
the set of URIs already included (BdbUriUniqFilter) and
the queue containing all the pending URIs
(BdbMultipleWorkQueues) can be openend and re-used.

The number of URIs included, pending etc. could be set
by re-counting the queue's contents (hard with Bdb) or
simply taken out of the progress-statistics logfile
(fast, but probably a bit inaccurate).


Christian Kohlschütter ( ck-heritrix ) - 2005-04-27 12:53

8

Closed

None

Karl Thiessen

None

1.6.0

Public


Comments ( 20 )

Date: 2007-03-14 01:40
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-922 -- please add further
comments at that location.


Date: 2005-10-07 17:31
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

From sleepycat on the way I'd implemented getCount (since
fixed as part of
http://sourceforge.net/tracker/?group_id=73833&atid=539099&func=detail&aid=1312934):

>Recovering a crawler checkpoint, I've observed the
following phenomeon
>in 2.0.54 and 2.0.83. In a few instances, on recover, I
(used to) get
>count of bdbje db elements. I used the below getCount
method after
>reading Linda's blog entry from 2004/12/17/12.46.07. If
the dbs are
>small, the getCount works grand. But, oddly, if db is big
(say > 500k
>entries), I seem to go into the db.getStats call and never
come back out
>(I let it run overnight and it never emerged).
>
>
I wonder if it's just taking a very long time, because
dbstat is not actually fast at all. It currently does a full
table scan, which can require a lot of random i/o, and also
does a variety of extra checks to generate stats. We
recently added a -v (verbose) flag to a number of our
utilities because we found that we needed a progress
indicator; it sounds like we should have added it to the
call to generate stats too.

We are working on a new tree walker, and today found that it
sped up a table scan that was taking about 1.5 hours up to 3
minutes. But that's in our current code line, and is still
experimental, because it returns values in non-key order. We
are thinking of using this new construct for activities like
count, or stats. That won't be ready for a while though, so
I'm afraid that you're left with your workaround for a now.

Regards,

Linda


Date: 2005-09-30 22:58
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I've been doing some testing on crawling008. The copy of
bdb files might be getting stuck when were' up around the
4million docs-downloaded mark. Checking. Meantime, its
taking an hour to checkpoint when 2.5m downloaded and 25m
discovered. Negligable time to serialize the various
classes, about 4minutes for bdb to checkpoint itself. Bulk
of time, almost 50minutes is consumed copying the 2336 10meg
bdb files out to the checkpoint directory. Karl has given
me crawling013 to do more checkpoint testing.


Date: 2005-09-14 01:09
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Assigning to Karl to test.

HOW TO TEST:

See user_manual.html for how checkpointing works.

Things to try:

+ Big crawl. Pause. Checkpoint. Save the reports.
Terminate job. Terminate crawler. Restart crawler.
Restore from checkpoint and make the setting 'Pause on
start'. Compare the reports and logs to ensure we're at the
very point at which the checkpoint was done.
+ Continue crawling. Does it seem to be picking up from the
checkpoint ok?
+ Would be good to setup regular checkpointing from JMX.
Then after 7 or 8 checkpoints, does recovery from checkpoint
8 work ok?
+ Would be good to get some timings checkpointing after
we've been crawling a few days: i.e. Checkpointing big
dataset (FYI: The bloom filter gets checkpointed if you're
using this).


Date: 2005-09-14 00:50
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Made bloom filter checkpoint.

More on [ 1190974 ] Quick resume without real recovery
Add checkpointing of bloom filter already seen.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
If bloom filter already-seen is in use, serialize it on
checkpoint and
then deserialize it when initializing if we're doing a
checkpoint recover.
Takes a good while doing both operations.
* src/java/org/archive/crawler/frontier/WorkQueueFrontier.java
Added 'this.'
* src/java/org/archive/crawler/util/BloomUriUniqFilter.java
Added serializable interface.



Date: 2005-09-13 22:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added doc. and UI.

More on '[ 1190974 ] Quick resume without real recovery'.
Added UI. Put recover-log and checkpoint recovery into a
single 'Based on
recovery' screen.
* src/articles/user_manual.xml
Added doc. on checkpointing.
* src/java/org/archive/crawler/admin/CrawlJob.java
Formatting.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
Refactoring so that the setting of the recovery happens
after crawljob
creation rather than down deep just before we create the
job.
This allows me to populate recover-path with recover.gz
or checkpoint dir
up in the method where I know which of the two I'm to do
(When it was
being set down deep all I had was a directory with no
means to know if
it was the recovery log dir or a checkpoint dir).
(createNewJob, newJob): Changed API.
(updateRecoveryPaths): Added override that only throws FCE.
* src/java/org/archive/crawler/datamodel/CrawlOrder.java
Line lengths.
Changed help text on recover-path so it allows
checkpoint dir.
* src/webapps/admin/jobs.jsp
Added link to new page 'Based on recovery'.
* src/webapps/admin/jobs/basedon.jsp
Don't show 'recover' link in here any more. Its
functionality has been
moved to the 'recovery.jsp' page.
* src/webapps/admin/jobs/new.jsp
Pass checkpoint name if we're to recover a checkpoint.
* src/webapps/admin/jobs/recovery.jsp
New recovery pages.



Date: 2005-09-08 18:00
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Advice from sleepycat Mark Hayes on how to do checkpoint of
bdbje:

stack wrote:

> Yes. We definetly want this.
>
> Looks like close and reopen Environment won't work. I was
thinking that since the environment close takes care of
stuff like shutting down background threads and the running
of a checkpoint, I was thinking a call to close the
environment followed by a backup of all bdbje logs finishing
with a reopen of the environment would be a good way to go.
But I need to be able to shutdown dbs first -- and get
their config. so that I reopen them after w/ the right
setup. I can get list of db names from the Environment but
doesn't seem to be a way of obtaining the db instances
(Maybe I'm missing something in the API).


Right, to get a Database instance you would have to open it,
and you need the config to open it, so I think that's not
going to work!

> So, looks like I'll be relying on your new checkpoint
config API additions to get the newly added 'no-delta'
checkpoint.


Before the new API is available you can use the internal
code I sent earlier -- it does the same thing.

> Meantime, some questions:
>
> 1. Reading code, Environment.sync() is a checkpoint.
Looks like I don't need to call a checkpoint after calling a
sync?


Right, they're almost the same thing -- just do one or the
other, not both. With the new API, you'll need to do a
checkpoint not a sync, because the sync() method has no
config parameter. Don't worry -- it's fine to do a
checkpoint even though you're not using transactions.

> 2. I've forgotten how I disable cleaner thread: Setting
je.cleaner.bytesInterval to Long.MAX_VALUE? Is it same for
the Checkpointer thread? How do I disable the evictor?
Looks like not possible since 2.0?


The config parameters for disabling the daemon threads are:

je.env.runCheckpointer=false
je.env.runCleaner=false
je.env.runINCompressor=false

The evictor is disabled by default in 2.0 so you don't need
this:
je.env.runEvictor=false

> 3. Hows this for an order in which to go about a bdbe
checkpoint:
>
> Save off the evictor setting and then disable it.


Don't need to disable the evictor.

> Save off the checkpointer setting and then disable it.
> Save off the cleaner setting and then disable it.


Yes (above) and also disable the INCompressor.

> Environment.sync()


Better to use the new checkpoint API, or the internal code I
sent you.

> Rotate off bdbje logs in order.
> Restore evictor, checkpointer, and cleaner.


Yes, sounds right.

> Good on you Mark.
> St.Ack


You too!
Mark


Date: 2005-09-07 22:34
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Another commit:

More on '[ 1190974 ] Quick resume without real recovery'.
Add rotating of recover log. Add option to disable recover
log. Add
saving of settings on checkpoint.
* src/java/org/archive/crawler/admin/CrawlJob.java
(rotateLogs): Deprecated. Go via checkpointing to get
logs rotated.
*
src/java/org/archive/crawler/checkpoint/CheckpointContext.java
Javadoc.
* src/java/org/archive/crawler/framework/CrawlController.java
Javadoc.
(copySettings): Add copying over of settings on checkpoint.
(checkpointBdb): Remove the clean log suggested by bdbje
doc. Takes too
long. Remove checkpointing too since looks like this is
what the sync
does (Waiting on feedback from bdbje folks).
* src/java/org/archive/crawler/framework/ToeThread.java
Allow for frontierjournal being null.
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
Add option that turns off recover log. Add checkpointing
(rotation) of recover log.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Call parent checkpointing method.
* src/java/org/archive/crawler/frontier/FrontierJournal.java
(checkpoint): Added. Tell those interested in frontier
when its
checkpointing.
* src/java/org/archive/crawler/frontier/RecoveryJournal.java
Add implementation of new checkpoint method. Rotate off
the recover
log and open a new one.
Linelengths.
* src/java/org/archive/io/GenerationFileHandler.java
Javadoc.



Date: 2005-07-07 21:30
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Datapoint: Crawl of ~3million crawled took 3.5 hours to
checkpoint. Too long. On other hand, the checkpoint
cleaned up half the log files 1671 of ~3400 total.

Need to look at how long recovers take with
non-cleaned/checkpointed bdbdb.


Date: 2005-07-07 16:24
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Below is current status:

To checkpoint, you must first pause the crawler. You then
run access '/console/action.jsp?action=checkpoint' -- I'll
add a link to the UI that shows when the crawler is paused
later -- or you can do run the JMX checkpoint invocation.

To recover from a checkpoint, set the crawl order
'/crawl-order/recover-path' to point at the checkpoint
directory you want to recover from.

When checkpointing is invoked, the CrawlController takes
care of syncing and persisting out 'core' objects. 'Core'
objects include the overall bdb environment, the BigMap
instances, the StatisticsTracker and the CheckpointContext
itself. Non-core objects register as being interested in
CrawlStatusListener events -- e.g. start, stop, pause, etc.
-- and do their checkpoint persisting actions on invocation
of the newly added CrawlStatusListener#crawlCheckpoint(File
checkpointDir). For example, on crawlCheckpoint invocation,
the ARCWriterProcessor writes a state file into the
checkpoint directory w/ the current ARC serial number and
BdbFrontier perisists its in-memory queue state.

If at any stage an exception is thrown, the checkpoint is
aborted and marked invalid. Recovery against an invalid
checkpoint is not possible.

Checkpointing, all bdb logs are copied to the checkpoint
directory. The crawl logs are rotated and just left in the
logs directory distingushed by a checkpoint-name suffix
(e.g. crawl.log.00003).

When recovering, when the CrawlController determines we're
in a checkpoint recover mode, it will take care of
resusitating core objects and state. Non-core objects ask
the CrawlController if its in checkpoint recovery mode when
it suits -- usually somewhere in Module#initialize or in
Processor#initialTask -- and take care of their own revival
of previous state (CrawlController#getCheckpointRecover
returns the current checkpoint being recovered).

Outstanding:

+ Needs lots of testing (There is at least one issue where
state of seeds is not being properly recovered that needs
fixing).
+ Add bloomfilter checkpointing.
+ Checkpointing and recover needs to be fast. Here's issues
in current approach that hamper fast-checkpointing/recover:
++ At each checkpoint, I'm currently asking bdb to do a
clean of its logs. This is taking a long time -- e.g.
20minutes on a crawl of 1.1 million urls -- and after the
cleaning has completed, perhaps 5-10% of logs have been
removed. The general idea is compact the 'state' that needs
to be saved. Its looking like the bdb cleaner thread is
managing to keep up -- or, we're just manufacturing logs
releasing few -- so this time-consuming task may not be
worth the small log compaction it achieves. Needs more
research.
++ Another time-consuming task checkpointing is the
java-copy of often thousands of bdb log files to the
checkpoint backup directory, and again, on recover, copy of
the log files back under the new crawljob state dir. This
latter copy is being done so that on recover we don't
destroy the checkpoint. Since bdb corruption is unlikely,
probably shouldn't copy bdb logs at all: The recovery recipe
would then include the operator setting the
/crawl-order/state-path directory to point at the old crawls
state directory (or a copy of the old state directory).
++ BdbFrontier persists its queue state by writing to bdb
databases. Probably better serializing the state Maps and
Sets to disk; may be faster and the bdb dbs have enough to
do as it is.
+ TODO: Saving settings/seeds at time of checkpoint.
+ TODO: Option to disable recover log writing.
+ For non-core objects, would be better if CrawlController
'told' registered items that we were in checkpoint recover
mode rather than have the non-core objects poll. There is a
bit of chicken-and-egg issue though in that non-core objects
interested in checkpoint recover need to tell/register their
interest first. Usually this is done in
ModuleType#initialize meaning the crawlCheckpointRecover
notice would have to arrive sometime after the initialize
registration -- a notice too-late in most instances. One
thing to try would be serializing the Set of registered
CrawlStatusListeners; the CrawlController could read this
Set back in when in checkpoint recover mode and send out a
CrawlStatusListener#crawlCheckpointRecover(CrawlController
cc, Checkpoint cp) BEFORE invocation of
ModuleType#initialize/Processor#initialTasks.

Nice-to-haves:
+ Checkpointing that doesn't require a pause.


Date: 2005-07-07 05:56
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Another related commit.

Testing shows the seed status not making across (Seems
broken since today).

Write up how its works for user and for developer who wants
to join in checkpointing (and recovering).

Javadoc, comments and cleanup of '[ 1190974 ] Quick resume
without real
recovery'
* src/java/org/archive/crawler/admin/StatisticsTracker.java
Move management of StatisticsTracker checkpointing and
recovery to
CrawlController (CC does ST and Bdb environment -- all
other modules
have to do their own checkpoint persistance and recovery).
* src/java/org/archive/crawler/checkpoint/CheckpointContext.java
Javadoc. Add a 'r' prefix to checkpoints made in a crawl
thats already
been recovered.
* src/java/org/archive/crawler/datamodel/CrawlOrder.java
Mention the (experimental) recovery from checkpoint.
Later add more
exposition on what checkpointings about.
* src/java/org/archive/crawler/framework/CrawlController.java
Javadoc and comments around checkpointing. Cleanup.
Move in here
management of StatisticsTracker checkpointing and recovery.
* src/java/org/archive/util/FileUtils.java
(copyFiles): Take a Set instead of a List.


Date: 2005-07-07 01:09
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

More on this. See commit below.

All needs testing now.

May need a 'quick checkpoint', a checkpoint that doesn't do
bdb clean and perhaps relies on something else to do the bdb
file copy (20 minutes to checkpoint a crawl of 1000 bdb
files and 1.2 million urls crawled -- using bdb alreadyseen).

More toward '[ 1190974 ] Quick resume without real recovery'
Added serializing/deserializing StatisticsTracker. Made
checkpointing instead
be a CrawlStatusListener method (Removed interface
Checkpointable). Have UI show we're running checkpointing.
Moved checkpoint utility methods to CheckpointContext from
Checkpoint (Let Checkpoint be just for describing completed
checkpoints -- previous I was using this object during
checkpointing. Rather, use CheckpointContext for this).
* src/java/org/archive/crawler/admin/CrawlJob.java
New crawl job state 'Checkpointing'.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
Let checkpointing be legit state. On crawlCheckpoint,
set our state
to be checkpointing.
* src/java/org/archive/crawler/admin/SeedRecord.java
Make serializable.
* src/java/org/archive/crawler/admin/StatisticsTracker.java
Removed references to Checkpointable. Its been removed.
Set timers in crawlStarted.
Removed unthrown exceptions. Javadoc fixups.
* src/java/org/archive/crawler/checkpoint/Checkpoint.java
Move utility to CheckpointContext. Removed storing of
arc writer serial number (Let that be done by arc writer
processor when it checkpoints).
(writeObjectToFile, readObjectFromFile,
getBdbSubDirectory, getJeLogsFilter,
getClassCheckpointFilename): Removed. Equivs. added
to CheckpointContext.
* src/java/org/archive/crawler/checkpoint/CheckpointContext.java
Utility to aid checkpointing (Mostly moved from Checkpoint).
(getBdbSubDirectory, getJeLogsFilter,
getClassCheckpointFile, getClassCheckpointFilename,
writeObjectToFile, readObjectFromFile): Added.
* src/java/org/archive/crawler/datamodel/BigMapFactory.java
Removed unused checkpoint param.
* src/java/org/archive/crawler/event/CrawlStatusListener.java
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Added crawlCheckpoint.
* src/java/org/archive/crawler/framework/AbstractTracker.java
Formatting.
* src/java/org/archive/crawler/framework/CrawlController.java
Remove references to Checkpointable.
Added revival of StatisticsTracker.
Javadoc.
(sendCheckpointEvent): Added..
* src/java/org/archive/crawler/frontier/BdbFrontier.java
No need to register with CrawlController fact that we're
checkpointable because its gone away.
(getCheckpointDbName): Removed. Unused.
* src/java/org/archive/crawler/frontier/HostQueuesFrontier.java
(crawlCheckpoint): Added.
* src/java/org/archive/crawler/settings/ComplexType.java
Make parent and settings transient. They'll be restored if
on resusitation, the object revived is added to
appropriate place in the
CrawlOrder.
* src/java/org/archive/crawler/settings/CrawlerSettingsTest.java
Added test for TextField serialization.
* src/java/org/archive/crawler/settings/ModuleType.java
Formatting.
* src/java/org/archive/crawler/settings/TextField.java
Made it serializable.
* src/java/org/archive/crawler/writer/ARCWriterProcessor.java
Write out ARCWriter serial number on checkpointing.
Read it back in
if checkpoint recover.
* src/java/org/archive/crawler/framework/AbstractTracker.java
Formatting.
* src/java/org/archive/crawler/framework/CrawlController.java
Remove references to Checkpointable.
Added revival of StatisticsTracker.
Javadoc.
(sendCheckpointEvent): Added..
* src/java/org/archive/crawler/frontier/BdbFrontier.java
No need to register with CrawlController fact that we're
checkpointable because its gone away.
(getCheckpointDbName): Removed. Unused.
* src/java/org/archive/crawler/frontier/HostQueuesFrontier.java
(crawlCheckpoint): Added.
* src/java/org/archive/crawler/settings/ComplexType.java
Make parent and settings transient. They'll be restored if
on resusitation, the object revived is added to
appropriate place in the
CrawlOrder.
* src/java/org/archive/crawler/settings/CrawlerSettingsTest.java
Added test for TextField serialization.
* src/java/org/archive/crawler/settings/ModuleType.java



Date: 2005-07-06 17:31
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Yet another commit related to this issue. Mostly cleanup.
Ran into serialization wall trying to serialize out
StatisticsTracker. Thats next obstacle to tackle:

* src/conf/heritrix.properties
Removed duplicate entries.
Removed section on bdb fronter and on cached big map
(Let section on
big map factory cover the latter and the former is only
logging levels
mentioned earlier in log).
* src/java/org/archive/crawler/admin/CrawlJob.java
(configureForResume): Unused. Removed.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
(resumeJobFromCheckpoint): Removed. Unused.
* src/java/org/archive/crawler/admin/StatisticsTracker.java
Started to add checkpointing but ran into serialization
issues.
Not yet resolved.
* src/java/org/archive/crawler/checkpoint/Checkpoint.java
Made serializable.
(writeObjectPlusToFile, readObjectPlusFromFile): Renamed.
(readObject): Added custom that checks validity.
(getClassCheckpointFilename): Added here from
CrawlController.
* src/java/org/archive/crawler/checkpoint/CheckpointContext.java
Added serial version id. Changed param name.
* src/java/org/archive/crawler/framework/AbstractTracker.java
Made crawlcontroller transient.
* src/java/org/archive/crawler/framework/CrawlController.java
Removed unused imports.
Moved getClassCheckpointFilename from here to Checkpoint.
Use new names of Checkpoint object
serializers/deserializers.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Added FINER logging.


Date: 2005-07-06 02:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Annother commit related to this issue:

More on '[ 1190974 ] Quick resume without real recovery'.
Checkpointing and then recover from checkpointing basically
working. Needs
loads of testing which I'm doing next. Currently you ask
for checkpointing
by doing '/console/action.jsp?action=checkpoint' or by
calling JMX checkpoint.
For now, to recover, pass path to the checkpoint directory
in the recover
path setting.
* src/conf/heritrix.properties
Remove Heritrix.recover property. Flag is instead
whether the recover
log parameter is a directory. If it is, we assume
checkpoint recover.
(Needs better mechanism).
* src/java/org/archive/crawler/Heritrix.java
Remove sympathetic isCheckpointRecover. No longer
needed. Instead
we test if recover log is a directory.
* src/java/org/archive/crawler/admin/CrawlJob.java
Allow checkpointing via JMX.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
Checkpointing needs cleaning up here in CrawlJobHandler.
Assumes falsely
that CrawlController is deserialized when recovering
from checkpoint.
* src/java/org/archive/crawler/admin/StatisticsTracker.java
removed line.
* src/java/org/archive/crawler/checkpoint/Checkpoint.java
Added utility method that returns filename filter for
bdb je log files.
Added carrying of arc writer serial number here in the
checkpoint
(Too awkward serializing/deserializing ARCWriter*
classes). Foresee
Checkpoint carrying other state info.
* src/java/org/archive/crawler/checkpoint/CheckpointContext.java
Made more data members transient.
(noteResumed) Renamed as checkpointRecoverFixup.
Add a 'rcvr-' prefix to checkpoint dirs made from
checkpoints.
* src/java/org/archive/crawler/checkpoint/Checkpointable.java
(recover): Removed. Recover will be a task distributed
over components
-- at least for now.
* src/java/org/archive/crawler/datamodel/BigMapFactory.java
Remove recover. Recover is done when we reopen bigmaps atop
a bdb db that hasn't been cleared.
* src/java/org/archive/crawler/datamodel/CachedBdbBigMap.java
* src/java/org/archive/crawler/event/CrawlStatusListener.java
Added empty line.
* src/java/org/archive/crawler/datamodel/CrawlOrder.java
Added TODO to help message for ATTR_RECOVER_PATH.
* src/java/org/archive/crawler/framework/CrawlController.java
Write out CheckpointContext and Checkpoint when
checkpointing.
Copy bdblogs out to checkpoint dir and then back under
the state dir
when recovering (Might not be the best idea when loads
of logs).
(checkpointRecover): New data member. Set if we're in
recover mode.
(setupCheckpointRecover, restoreCheckpointingContext,
getClassCheckpointName, runFrontierRecover,
getCheckpointRecover,
isCheckpointRecover): Added.
(readFrom): Removed.
* src/java/org/archive/crawler/framework/Scoper.java
Spacing.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Implement checkpointable. If we are in checkpoint
recover, don't
truncate bdbs before opening and resurrect queue state.
(initialize): Register that we are checkpointable.
(checkpoint): Added.
*
src/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java
* src/java/org/archive/crawler/util/BdbUriUniqFilter.java
Don't truncate db if asked to recycle content.
* src/java/org/archive/crawler/writer/ARCWriterProcessor.java
Mark metadata as transient.
* src/java/org/archive/io/arc/ARCWriter.java
Added accessors/setters for serialno.
* src/java/org/archive/util/CachedBdbMap.java
Formatting. Also sometimes db is empty.
* src/java/org/archive/util/FileUtils.java
Added workaround file copy for when FileChannel#writeTo
fails.
(Needs investigation).



Date: 2005-06-25 02:26
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

First cut at this feature just committed. See message below.

Works for simple crawl. Need to test for large crawl. Need
to see how long persisting mem to disk takes and how long
writing new dbs to save queue state takes.

Assumption: Moved in process content into ready queues.

Need to better integrate w/ current 'recover' option also
need to look at stats and make sure they're coherent after
restart (Some stats currently are continued, others will be
starting fresh).

Currently switchable by setting isBdbRecover in
heritrix.properties. To make it work, set the 'state'
option pointing at the bdb directory you want to recover from.

First cut at '[ 1190974 ] Quick resume without real recovery'
Basically working. Needs more testing.
* src/conf/heritrix.properties
Added commented out class logging levels set to info.
Added temporary flag, isCheckpointing, so can turn
on/off checkpointing
while testing..
* src/java/org/archive/crawler/Heritrix.java
(isCheckpointing): Added.
* src/java/org/archive/crawler/datamodel/CachedBdbBigMap.java
Add switchable syncing of BigMaps with disk instead of
clearing the
map content before closing.
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
(crawlEnded): Added logging.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Add switchable writing of all queue state info to bdb
and then
rereading from bdb on initialization.
(Putter): Interface that allows me to treat a Set, a
LinkedQueue, and
a Map all the same.
(restoreLinkedQueue, put,
saveStringKeys,saveStringKeysToBdb): Added.
*
src/java/org/archive/crawler/frontier/BdbMultipleWorkQueues.java
Don't truncate db if checkpointing.
(getCount): Added.
* src/java/org/archive/crawler/frontier/WorkQueueFrontier.java
Added logging of allqueues content.
Null out queue instances (We weren't nulling all).
* src/java/org/archive/crawler/util/BdbUriUniqFilter.java
Added logging.
src/java/org/archive/util/CachedBdbMap.java
Used System.currentTimeMillis rather than construct a
date object.


Date: 2005-06-22 16:33
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Upped priority. Assigned to myself.


Date: 2005-06-21 13:41
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

What Christian describes here is (more or less) what the
ARFrontier does. When created it opens up a specially named
database. If that database already exists (meaning that the
state directory was pointed at an old crawl) it will iterate
through the list queues and open each of them in turn. This
returns the Frontier to its original state (minus progress
statistics).

A similar approach would work for the BdbFrontier. It would
impose the need to maintain some state information in a Bdb,
namely what queues exist (this may be more complex in the
BdbFrontier then the ARFrontier due to how BdbWorkQueues are
constructed). The stats db could even include progress info.

If done correctly, there would be no need for
"Checkpointing" since Heritrix could rely on Bdb's crash
recovery mechanisms.

I would STRONGLY urge that this approach be taken to cope
with crash recovery and allowing crawls to be fully
suspended and resumed.


Date: 2005-04-28 15:56
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Yes. We'll put proposal for checkpointing up on the list
for discussion (Up to this we've mostly shuttled about
proposals amongst ourselves).

Bdb has its own notion of checkpointing . Yes, I'd imagine
that while its running its checkpointing, the bdb db is
unavailable but there'd be benefits checkpointing bdb on an
interval (One minor one would be a checkpointed bdb db will
go much faster. My understanding also has it that
checkpointing gives bdb db has a chance to discard old log
entries -- shrink its state). We should try out both options.

Thanks for comments C.


Date: 2005-04-28 09:02
Sender: ck-heritrix

Logged In: YES
user_id=1220421

Yes and no. Checkpointing is one solution to the problem.
I just thought about a simple re-opening of the bdb databases plus
parsing the last few lines of progress-statistics.
Checkpointing instead would probably require bdb to keep large
transactions during a checkpoint interval, eventually slowing things
down, wouldn't it?

We should discuss details on how to implement such recovery
functionality (reopening bdb and/or checkpointing functionality as
mentioned in RFE [ 1020778 ]) on the mailing list.



Date: 2005-04-27 15:07
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Hey Christian:

Would the feature we call 'checkpointing' describe what the
above is asking for? Checkpointing is a facility whereby
Heritrix on a configurable period serializes all of its
state in a manner that facilitates easy crawl resumption.
W/ checkpointing in place, recovering from a crash would
only require replay from the last checkpoint rather than
replay from crawl start (The amount of replay should be
relatively little when much of the crawl state is kept out
in bdb). We intend adding checkpointing after 1.4.0.


Attached File

No Files Currently Attached

Changes ( 10 )

Field Old Value Date By
status_id Open 2005-12-02 17:29 stack-sf
close_date - 2005-12-02 17:29 stack-sf
artifact_group_id None 2005-09-23 22:10 gojomo
artifact_group_id 1.6.0 2005-09-23 20:58 gojomo
summary Quick resume without real recovery 2005-09-23 20:58 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
priority 7 2005-09-23 20:40 gojomo
assigned_to stack-sf 2005-09-14 01:09 stack-sf
priority 5 2005-06-22 16:33 stack-sf
assigned_to nobody 2005-06-22 16:33 stack-sf