From: "bergmark_d" <bergmark@C...>
Date: Sun Jan 16, 2005 12:29 pm
Subject: Postselector
ADVERTISEMENT
Hi, all -- One nice feature of Mercator was the ability
for the
CrawlURI (docBundle in Mercator) to carry history along
with it. Thus
you could not only make the crawl uri pass information
from one
processor to the next, but a parent URI could pass
information to
child URIs.
For adaptive focused crawling, it is very helpful to be
able to pass
miscellaneous information from parent pages to children.
In Heritrix, I can see where attributes could be copied
into child
URIs -- in the handleLinkCollection method of the
Postselector class,
for example, when the CandidateURI is formed. So I
would like to
extend Postselector and over-ride just this one method,
but it is
private.
How do you all decide which methods to make protected
and which
private? Maybe it would be nice to allow the
application programmer
to add attributes to the CandidateURI before it goes
onto the
frontier, perhaps by some call-back or by factoring out
a protected
and over-ridable method that decorates the CandidateURI.
Or perhaps there is some better way to add historical
information to
candidate URIs before the parent CrawlURI is lost
forever and before
the child CandidateURIs are added to the frontier?
Thanks for any thoughts/info before I go copying
humongous swatches of
code out of the Postselector into my own custom
Postselector.
D.
Michael Stack
API
None
Public
|
Date: 2007-03-14 01:38
|
|
Date: 2005-01-19 00:43 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-01-19 00:43 | stack-sf |
| close_date | - | 2005-01-19 00:43 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use