Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 Work with Kris to integrate revisiting frontier - ID: 1119580
Last Update: Comment added ( karl-ia )

Will require changes to Heritrix -- to CrawlURI in
particular.


Michael Stack ( stack-sf ) - 2005-02-09 20:14

7

Closed

None

Kristinn Sigurdsson

API

None

Public


Comments ( 17 )

Date: 2007-03-14 01:38
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-892 -- please add further
comments at that location.


Date: 2005-04-01 00:38
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Talked w/ Kris this morning. Below is excerpt from yahoo
transcript (Also talked on phone). Discussed containing the
ARF changes inside of WaitEvaluator (Kris explained why
Refinements makes more sense than doing it all in WE).

Later discussed options w/ Gordon. Sent Kris email --
included below -- on our preference: That we not make the
Refinements change. Kris said OK and will tomorrow look
more at WaitEvaluator -- perhaps adding a few.

I then backed out the Refinements changes.

Opened new issue on revisiting passing richer object than
UURI into Refinements: '[ 1174533 ] Pass object richer than
UURI to Refinements.'.

Closing this issue as done (Added note to release notes
mentioning the new frontier).


EMAIL:
Chatted w/ Gordon. We went over the options.

General thought was that this this is a bunch of change just
to get richer object out to a single refinement criteria.
True, it could lead to interesting Refinements usage later
if new Refinements took advantage of the richer objects
available to them but there has been no call for such a
facility before this particular ARF need.
Other thoughts, were that this change is happening too close
to release (our fault for waiting on the integration), and
that because of its fundamental nature, it would be better
to take more time turning trying the various approaches
(E.g. What are repercussions using Refinements in this
fashion and is this appropriate use, should settings system
be taking CrawlURIs rather than UURIs, should CrawlURI
subclass UURI, etc.).

We also thought that even with a single WaitEvaluator, one
that had same rules for all types, is a significant
addition, suffiicent enough for a first release (I know you
think differently). We also talked about your suggestion of
multiple WEs -- e.g. named WE1, WE2, WE3 -- w/ people adding
them as they need them configuring each w/ a regex for the
mimetype it is to handle, or custom WEs such as a TextWE, a
GraphicsWE, etc. (These would either have the regex
hardcoded or a default regex adjustable by user), and either
of these options seemed preferable to adding the Refinements
change (Gordon is playing w/ settings trying to make it show
multiple types -- you maybe able to have a WE that has a
subconfiguration, a subconfiguration per type to handle --
like filters -- again so you could contain the WE needs w/i
WE rather than have them bleed over into Refinements. More
on this later if you're interested). True, multiple WEs
doesn't have the flexibility of the Refinements version of
settings but it seems like it covers most usage scenarios
and contains the ARF w/i its processors.

The patch as is has spiralled all over Heritrix. It could
be made work but seem illconsidered and smells strongly of
'hack'.
St.Ack


YAHOO:
...
kristsi25: Just to clarify, in my books CandidateURI and
CrawlURI are effectively the same thing and both represent a
UURI or rather a URI. I tend to feel we would have been
better off making CrawURI or CandidateURI subclass UURI but
that's neither here nor there.
kristsi25: So 'CandidateURIs' do exist outside the
Postselector, even if only as CrawlURIs in practice
kristsi25: Most calls (non null) to the canon start with
getting the UURI out of the Candidate or CrawlURI so passing
them straight in seems (at some level) logical.
stackarchiveorg: CandidateURI is a CrawlURI in embryo. It
should have life only to do the postselector scope test. If
it passes it becomes a full CrawlURI.
stackarchiveorg: Are you addressing my WaitEvaluator
question -- why can't it have all the config. it needs
rather than change Refinements?
kristsi25: Re:WaitEvaluator. Look at it logically. We want
to be able to assign, based on content mimetype, different
parameters for each of about 4-5 settings. Just think about
how you can squeeze that into the current configuration
framework.
kristsi25: It is easy to have like, this is initial wait
interval for text and this for pictures and hard code it
like that but that is extremely inflexible (not to mention
annoying if you do not wish to differantiate)
stackarchiveorg: Please explain more so I can better understand?
stackarchiveorg: If it can be done inside in WaitEvaluator
w/o having to have these ugly ripple changes all over, lets
do that.
kristsi25: I'm trying to explain but it is not going well.
stackarchiveorg: Smile
kristsi25: My point is this: Doing it inside the
WaitEvaluator will EITHER mean MUCH less flexibility
(limiting its usefulness) OR be EXTREMELY complex (and ugly)
and probably still not as flexable and most likely BOTH.
kristsi25: Excuse the shouting.
stackarchiveorg: Ok. Tell me more about the usage scenarios
(I turned down the voume over here so shout all you want).
stackarchiveorg: Maybe then I can see why it can't live
inside in WaitEvaluator.
stackarchiveorg: What did you try to do inside WE and it
didn't work?
kristsi25: Ok, the basic premise is this. Text documents
(html) have much higher change rates then say image
documents. Basically the documents file type has a lot to
say about the expected change rate.
stackarchiveorg: yep
kristsi25: So the WaitEvaluator has generic settings,
initial wait, change factors etc.
kristsi25: The idea is that by using refinements with
content type regular expressions you can selectively (based
on your own experience, preferences etc.) set different
values for different mime types. With me so far?
stackarchiveorg: i is
kristsi25: Ok, now can you think of a way that will allow
you full flexibily in deciding that for this mime type I'm
going to overwrite like this and for that mime type I'm
overriding like that, without, in any way, limiting what
mime types you are dealing with?
kristsi25: Basically, you'd need key value pairs for EACH
and EVERY setting, plus a default if none of the reg expr.
keys fit. This gets extremely messy and it can be very hard
to figure out what rules apply to say image/*
kristsi25: I'm not even sure if the key value pair setting
is still in the configuration framework. It was but it
hasn't been used I think.
stackarchiveorg: Its there. But hard part would be you need
key to many values.
stackarchiveorg: A regex key to many values.
kristsi25: Right, and you can't do that. You'd need to
replicate the bl***y key over each setting. Like I said,
very messy.
kristsi25: The refinements were supposed to handle this sort
of thing. And they do, except they didn't have access to the
CrawlURIs.
kristsi25: None of this would be a problem if CandidateURI
subclassed UURI...
kristsi25: (which I've always felt it should)
kristsi25: This is taking too long. Can I call you?
stackarchiveorg: Yep. 415 864 3571
kristsi25: Ok, I just need to "go through channels" takes a
few mins.
stackarchiveorg: no prob.
stackarchiveorg: Having C*URI subclass UURI would
necessitate major rewrite.
stackarchiveorg: Isn't using Refinements going to be messy
too -- how can you tell what waitevaluator settings are in
place for image/jpeg on domain archive.org?
kristsi25: Your phone should be ringing any moment now
kristsi25: (I hope)


Date: 2005-03-31 10:42
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

More on "Make ARFrontier subclass AbstractFrontier"

I've gone through the AbstractFrontier and discovered the
following issues (some trival, others more complex) which
makes it very difficult to have ARFrontier subclass it.

1. IP politeness setting.

The ARFrontier doesn't react very well to IP politness. This
is because intially, when no IP is known, host named queues
are created for seed, robots.txt and DNS lookup, THEN IP
named queues are created for subsequent URIs. The orignal
host named queue disappears in a snapshot frontier, but the
ARFrontier keeps it around. Would need to be able to rename
queues. Even that might not be sufficient. So IP based
politeness is not available for the ARFrontier and wont be
for the foreseeable future.


2. Bandwidth limiters

Not as much af an issue, the current ARFrontier does not use
this. Unsure how this would fit.


3. Preference embeds

The ARFrontier currently uses the same default (1) as the
AbstractFrontier, however this should be changed to 0 since
preferencing embeds leads to excessive crawling of images
embedded in frequently changing pages. Preferencing embeds
should of course remain possible, just not the default for
AR. I do not belive that the default value can be overridden
by a subclass.


4. Use of recovery journal

The ARFrontier does not support his feature nor does it have
any need of it. Recovery of existing jobs can be achived by
creating a new job with the same "state" directory as an
older crawl. This should reopen the BDBs used by the
previous job.


5. Statistics

There are some general issues with the various statistics
counters. These could no doubt be overcome by overriding
certain methods.


6. noteAboutToEmit

This method accepts two parameters CrawlURI and
BdbWorkQueue. The ARFrontier does not use BdbWorkQueues, nor
does it need to do any of what is done in this method. It
could probably just ignore this method. In fact, the very
implementation specific nature of this method indicates to
me that it belongs in BdbFrontier.


There may be further issues, these are just the ones that
came up during a quick look. Ultimately I agree that the
ARFrontier should subclass AbstractFrontier, but I suggest
that this be done post 1.4.0.



Date: 2005-03-31 09:29
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

Following up on:

>Made below commit. Refinements taking other UURIs is still
>not right and will back out the change tomorrow unless
>addressed.

I've fixed it. (Patch sent to Stack.)

There are 4 places where we will still have to make a
CandidateURI. One of them is in the tests, two in the
considerAlreadyIncluded methods of Bdb and HostFrontiers.
The BdbFrontier was already creating CrawlURIs so that's
hardly an issue, the HostQueuesFrontier is deprecated so
THATs hardly an issue, and besides, these methods are only
called by the import of recovery logs.

That leaves the creation of CandidateURIs when doing
comparisons agains the via. There is simply no CandidateURI
in that situation since the via is a UURI. Changing it to a
CandidateURI was a bigger step then I was willing to take.
This seems to be called with some frequency, but not for
every URI. About 50 occurances with 2300 URIs discovered
(1500 crawled). Crawling unchanged default profile. That's
something like 1 CandidateURI created for every 50 that are
discovered (ratio confirmed again at 6500 discovered URIs).
I believe this is acceptable, but you may feel differently.
The only way around this is to carry a CandidateURI in the
via. I believe that is more costly (as it would apply to all
discovered CandidateURIs) then this 1 in 50 business.


Date: 2005-03-31 08:35
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

In response to:

>On Kris's 4 above, can you not add lists of mimetype regexes
>to WaitEvaluator rather than have the regexes out as
>refinements?

It becomes very complicated or very inflexable that way.
Sure you can have a fixed number of 'types' that each have
different assignments, but there is no flexibility in that
at all. Using the refinements is a much more elegant solution.

I'm working on making the Canon stuff pass CandidateURIs.
Looks like it should be possible. I'll on update this later.


Date: 2005-03-31 04:20
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Made below commit. Refinements taking other UURIs is still
not right and will back out the change tomorrow unless
addressed.

The below commit passes CandidateURIs into refinements which
is better than having to pass CrawlURI but is still not good
because it means CandidateURIs have a life outside of
postprocessing -- makes it harder to do away with them.

But main complaint is that Canonicalization is still using
the settings system passing UURIs; unless we ripple the
CandidateURI change up into the Canonicalization system,
we're making new CandidateURI for each Canonicalization
call. Thats a lot of calls. Even if this gets addressed,
ComplexType needs review. There is redundancy in the info
passed when we do a getSettings that takes a host name
string and a candidateuri (The CandidateURI has all the host
info that could ever be needed).

On Kris's 4 above, can you not add lists of mimetype regexes
to WaitEvaluator rather than have the regexes out as
refinements?

Update of
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/settings/refinements
In directory
sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv19311/src/java/org/archive/crawler/settings/refinements

Modified Files:
Criteria.java PortnumberCriteria.java Refinement.java
RegularExpressionCriteria.java TimespanCriteria.java
TimespanCriteriaTest.java
Added Files:
ContentTypeRegExprCriteria.java
Log Message:
More on ARF integration. This commit is so Kris has
something to look at in
the morning. It restores the passing of a richer object than
UURI down into
refinements only we're passing CandidateURI rather than
CrawlURI.

Still needs work. We're doing too many creations of
CandidateURI objects
inside in ComplexType#getCandidateURI method
(Canonicalization is culprit).
Nees to be fixed before releease.
* src/java/org/archive/crawler/datamodel/CrawlServer.java
(getSettings): Takes a CrawlURI instead of UURI.
* src/java/org/archive/crawler/datamodel/CrawlURI.java
Javadoc. Formatting.
* src/java/org/archive/crawler/settings/ComplexType.java
Pass CandidateURI rather than UURI down to settings.
Changed Context
to carry CandidateURI instead of UURI.
(getCandidateURI): Added. Can log how often we create a new
CandidateURI (TOO OFTEN).
*
src/java/org/archive/crawler/settings/CrawlSettingsSAXHandler.java
*
src/java/org/archive/crawler/settings/CrawlSettingsSAXSource.java
Restore the parse for ContentType handler.
* src/java/org/archive/crawler/settings/CrawlerSettings.java
(getParent): Take a CandidateURI.
* src/java/org/archive/crawler/settings/SettingsHandler.java
Take a CandidateURI in getSettings.
*
src/java/org/archive/crawler/settings/refinements/ContentTypeRegExprCriteria.java
Readded and take a CandidateURI instead of CrawlURI.
*
src/java/org/archive/crawler/settings/refinements/Criteria.java
*
src/java/org/archive/crawler/settings/refinements/PortnumberCriteria.java
*
src/java/org/archive/crawler/settings/refinements/Refinement.java
*
src/java/org/archive/crawler/settings/refinements/RegularExpressionCriteria.java
*
src/java/org/archive/crawler/settings/refinements/TimespanCriteria.java
Take a CandidateURI rather than a UURI.
*
src/java/org/archive/crawler/settings/refinements/TimespanCriteriaTest.java
Formatting.
* src/webapps/admin/jobs/refinements/criteria.jsp
Add in content type refinement.


Date: 2005-03-30 09:03
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

Responding to:

>(1) Ideally, anything that's only of use to some crawl
>scenarios would go into the CrawlURI attribute/keyed 'AList'
>data -- only things necessary for 'all' (or almost all)
>crawls become instance variables. (If we felt the idioms for
>storing things into the AList were stable, even more things
>could go there -- like 'outlinks' which is for now an
>instance variable.) So I'd prefer this 'change-detected
>state' go into the AList.

I believe this DOES apply to all crawls (potentially). That
is to say, all processors should be aware of the posibility
that they are being handed an unchanged document. Writers,
indexers, extractors, evaluators etc. could ALL use this
information (with default behavior on UNKOWN state). Having
them dig up an Alist item that may be used by some (but not
all) crawl strategies isn't going to work. If the state is
in the CrawlURI on the other hand, processors can much more
effectively respect this setting.


Date: 2005-03-30 08:33
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

Ok, replying in order to Stack's latest comments:

1. Make ARFrontier subclass AbstractFrontier or BdbFrontier.

Answer: BdbFrontier, never. Too incompatible. I looked at
the AbstractFrontier way back oktober and decided that it
was (at that time) unsuitable for this purpose. This MAY
have changed. I have a feeling though, that at least SOME
modifications would be needed for the AbstractFrontier for
this to work. I get the feeling that it's designed a little
too much with the BdbFrontier in mind.

The truth is however that in incremental or continouos
(whichever term you prefer) frontier handles things (even
statistics) differently. Which is why I made a seperate
frontier rather then trying to squeeze everything into the
BdbFrontier with all the hazzle that would entail.

Maintaining the code shouldn't be a huge issue unless there
is an API change to the Frontier however...


2. The CandidateURI/CrawlURI changes.

It works as it is, so I'm happy enough. I still think it's
messier then it needs to be, but much MUCH better then the
quick fix I'd made.


3. CrawlURIs to refinements

The short of it is, this is NEEDED. Just want to make that
clear. Being able to refine settings based on content type
is essential.

Ultimately, at some point in the settings we are going to
have to (under some circumstances) create a CrawlURI from a
UURI if this is to work. The simple truth is that there may
be no CrawlURI behind it. This is very RARE in practice.
Most calls to the settings framework come up via a path that
provides a CrawlURI.



>>Third, CrawlURIs to refinements. The fix wasn't all that
>>pretty, but I didn't find any problem with it (other then
>>the ones I handled).
>>
>Calling the getParent method (
>http://crawler.archive.org/xref/org/archive/crawler/settings/Craw
>lerSettings.html#315),
>from the core getDataContainerRecursive method
>(http://crawler.archive.org/xref/org/archive/crawler/settings/Com
>plexType.html#256),
>you were doing a 'new CrawlURI' just to wrap the passed in
>UURI (Integrating there were other places I had to do this).
>This is ugly. To do it right, looked like more refactoring
>needed -- changing Context to take CrawlURI rather than UURI.

Trust me, I looked into it. It just spiraled out of control
and you STILL needed at some point to wrap a URI in a CrawlURI!

Let's look at this in detail. There are 2 places where
(potentially) new CrawlURIs are created:

In ComplexType.getSettingsFromObject(Object o)
context.uri = (o instanceof CandidateURI) ?
CrawlURI.from((CandidateURI)o,0) : new CrawlURI((UURI)o);

Clearly the object is actually a CrawlURI most of the time,
making this cheap.


The other occurance is in the Context constructor that
accepts a UURI (there is an empty constructor as well). A
CrawlURI is constructed from the supplied UURI. So, if we
can change the constructor to simply accept a CrawlURI, we
should be fine?

I looked at all the instances that call that constructor.
Most pass it a null value (this is fine, CrawlURI can be
null as well). The only ones that pass a value are:

AttributeIterator(Object ctxt)

Using the following code:
Context c = new Context(context.settings, context.uri);

This is NO problem since (with my changes) context.uri
becomes a CrawlURI

Then there is
getDataContainerRecursive(Context context, String key)

Using the code:
Context c = new Context(context.settings, context.uri);

Again, it's using a context.uri, which should now be a CrawlURI


So, no problem. We just make Context accept a CrawlURI. That
should remove all the uglyness. I suppose I should have done
this in the first place, but somehow this one got by me.


4. What if there is no content type?

The ContentTypeCriteria is designed to pass on any null
CrawlURI as well as CrawlURIs that don't have a content
type. This is acceptable. If you are using this critera, you
must accept that it will only apply in the post fetch
portion of the processing chain. This is sufficient for
refining settings on post fetch processors (which is exactly
what I need).


5. Why is this important?

The WaitEvaluator has a number of settings to determine the
adaptive revisiting behavior. A core requirement is that it
can be tuned differently based on filetypes (images are far
less likely to change then html files etc.) The most
flexible way of achieving this (by far) is to use the
refinements with a content type criteria.


6. The proposed alternative.

Never mind, what I outlined above should remove the uglyness
from my fix and make it acceptable.




Date: 2005-03-29 17:48
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Response:

Kristinn Siguršsson wrote:

>Hey Michael,
>
>I've been away on holiday (we get a few days off for easter
here).
>
Figured. Of course, this is when I get the time to work on
integration.

>
>First, there is a bunch of problems with the modifications
you made for the integration. Mostly simple stuff (the load
seeds now uses schedule rather then innerSchedule which
causes none of the seeds to be queued since the batch is
never flushed) and your removal of the
discardUnneededCrawlURIInfo() has left the Frontier in a
state where it doesn't function. (Each 'disposition' method,
including reschedule, needs to invoke the processingCleanup
now, instead of the old method). I've fixed this stuff and
commited it.
>
Thanks for doing this. I removed your
discardUnneededCrawlURIInfo because it was an ARF
specialization of CrawlURI#processingCleanup. Makes sense I
should have replaced the removes with calls to
processingCleanup. The seeds stuff was done at end of day
to fix broken build. Gordon committed refactoring of seeds
at same time as my ARF commit so there was some syncing-up
needed. Thanks for fixing.

Any chance of you getting ARF atop AbstractFrontier at least
-- or even on top of BdbFrontier -- any time soon? (Yeah, I
know that you have different queueing, etc., but there is a
bunch of duplicated code; subclassing AF or BF would make
maintenance easier).

>
>Second, the CandidateURI/CrawlURI changes.
>
>I still prefer the approach I outlined awhile back (recaped
here)
>
See the issue for discussion of why the current approach.

>...
>
>The addAlistPersistentMember() method seems messy to me. It
just makes sense that you decide if an object is to be
persistent or transient when you set it in the Alist, not at
some other arbitrary time. In fact, force people to decide.
If Heritrix is to support iterative crawling, this is a
vital issue.
>
addAlistPersistentMember() is done up front, before crawler
goes to work, in constructor/initialization code; not an
'arbitrary' location.

>It DOES work though. The current approach puts
responsibility on the CrawlURI to cleanup/reset all its
member variables and on each module to register the items it
needs to be persistent. It just feels messy.
>
Yeah. Seems to work. Sorry you think it messy. For sure,
ARF is 'less' integrated when all is kept in alist rather
than as data members in CrawlURI but I think this ok for now
-- till we get more experience w/ revisiting.

(Don't you think it an improvement that we're encapsulating
cleanup inside CrawlURI?)

>
>
>Third, CrawlURIs to refinements. The fix wasn't all that
pretty, but I didn't find any problem with it (other then
the ones I handled).
>
Calling the getParent method (
http://crawler.archive.org/xref/org/archive/crawler/settings/CrawlerSettings.html#315),
from the core getDataContainerRecursive method
(http://crawler.archive.org/xref/org/archive/crawler/settings/ComplexType.html#256),
you were doing a 'new CrawlURI' just to wrap the passed in
UURI (Integrating there were other places I had to do this).
This is ugly. To do it right, looked like more refactoring
needed -- changing Context to take CrawlURI rather than UURI.



>
>It IS needed to be able to discriminate in the settings
based on a documents content type! I would classify this as
a key feature. It can be achived (poorly) by using reg.expr
on the URI 'file endings', but I would very much like to
avoid that.
>

Can you tell me more about this feature -- how its used --
and of what the story is when settings are checked prefetch
when there is no mimetype available? The criteria returns a
false. Means, refinement doesn't apply generally. This is OK?

>
>A possible alternative (that I don't like too much myself
and would require a good deal of effort but might work) is
to make the CandidateURI a subclass of UURI. This way the
ContentType criteria can simply do an instanceof comparison
to see if they've actually been handed a CrawlURI.
>
Might be an idea but not so close to a release.

>
>It _is_ a bit messy, the way it is, but it felt like the
simplest solution. Creating 'dummy' CrawlURIs from UURIs is
a simple enough action, maybe it can be simplified. I'm
going to dig through this again and see if there is another
way of achieving this.
>
Yes. Please take another look. I need some help here.

St.Ack



Date: 2005-03-29 09:36
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

First, there is a bunch of problems with the modifications
you made for the integration. Mostly simple stuff (the load
seeds now uses schedule rather then innerSchedule which
causes none of the seeds to be queued since the batch is
never flushed) and your removal of the
discardUnneededCrawlURIInfo() has left the Frontier in a
state where it doesn't function. (Each 'disposition' method,
including reschedule, needs to invoke the processingCleanup
now, instead of the old method). I've fixed this stuff and
commited it. The ARFrontier should now work as intended. I'm
quite frankly astonished that you reported that "ARF can do
a basic crawl"?? Well, it can now :smile:


Second, the CandidateURI/CrawlURI changes.

I still prefer the approach I outlined awhile back (recaped
here)

---
setObject(key, value, boolean transient)
{
if(transient == false && (value instanceof Serializable ==
false)){
//throw exception
//or simply put it in the transient section?
}
// put in transient or permanent AList, making sure the
key remains
// unique over both of them
}

We would then also provide the following 'convenience method':

setObject(key, Serializable value){
setObject(key, value, false);
}

setString, setInt etc. would be handled in the same manner.
Except the
convenience method would accept the proper type, rather then
Serializable
(they are after all Serializable).


Then there would be the getters:

getObject(key){
// Check persistant AList, if exists return element
// Check transient AList, if exist return element
// else throw NoSuchElementException probably
}

getString, getInt etc. would behave in the same way.

This should cover everything.
---

The addAlistPersistentMember() method seems messy to me. It
just makes sense that you decide if an object is to be
persistent or transient when you set it in the Alist, not at
some other arbitrary time. In fact, force people to decide.
If Heritrix is to support iterative crawling, this is a
vital issue.

It DOES work though. The current approach puts
responsibility on the CrawlURI to cleanup/reset all its
member variables and on each module to register the items it
needs to be persistent. It just feels messy.



Third, CrawlURIs to refinements. The fix wasn't all that
pretty, but I didn't find any problem with it (other then
the ones I handled).

It IS needed to be able to discriminate in the settings
based on a documents content type! I would classify this as
a key feature. It can be achived (poorly) by using reg.expr
on the URI 'file endings', but I would very much like to
avoid that.

A possible alternative (that I don't like too much myself
and would require a good deal of effort but might work) is
to make the CandidateURI a subclass of UURI. This way the
ContentType criteria can simply do an instanceof comparison
to see if they've actually been handed a CrawlURI.

It _is_ a bit messy, the way it is, but it felt like the
simplest solution. Creating 'dummy' CrawlURIs from UURIs is
a simple enough action, maybe it can be simplified. I'm
going to dig through this again and see if there is another
way of achieving this.

This really needs to be dealt with!


Date: 2005-03-29 03:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Decided to revert Refinements taking CrawlURIs. Went back to
their taking UURI. Did this because having things in state
of flux makes me antsy. Also removed the ContentType
Criteria, the one Criteria that needed a CrawlURI to do its
job. We can add back when we have cleaner way of getting
CrawlURIs into refinements.

Below is commit:

More on '[
1119580 ] Work with Kris to integrate revisiting frontier'.
Reverting Refinements taking CrawlURIs until fully worked
through.
Meantime removed the ContentType Criteria that relies on its
getting a CrawlURI.


* src/java/org/archive/crawler/settings/ComplexType.java
Pass a UURI to getSettings rather than a CrawlURI.
*
src/java/org/archive/crawler/settings/CrawlSettingsSAXHandler.java
*
src/java/org/archive/crawler/settings/CrawlSettingsSAXSource.java
Removed check for ContentType Refinements.
* src/java/org/archive/crawler/settings/SettingsHandler.java
Removed the blantently dumb getSettings that took a UURI
and did a
new CrawlURI so it could pass getSettings a CrawlURI.
*
src/java/org/archive/crawler/settings/refinements/Criteria.java
*
src/java/org/archive/crawler/settings/refinements/PortnumberCriteria.java
*
src/java/org/archive/crawler/settings/refinements/Refinement.java
*
src/java/org/archive/crawler/settings/refinements/TimespanCriteria.java
*
src/java/org/archive/crawler/settings/refinements/RegularExpressionCriteria.java
Revert to taking a UURI rather than a CrawlURI.
*
src/java/org/archive/crawler/settings/refinements/TimespanCriteriaTest.java
Removed empty test.
* src/webapps/admin/jobs/refinements/criteria.jsp
Removed reference to ContentType criteria.
* src/java/org/archive/crawler/postprocessor/WaitEvaluator.java
Was supposed to have been added on last commit.
*
src/java/org/archive/crawler/settings/refinements/ContentTypeRegExprCriteria.java
Remoed till better way of passing down CrawlURIs.



Date: 2005-03-28 23:45
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

The AdaptiveRevisitFrontier can do a basic crawl. Next
order of business is Kris taking a look see.

The thing that still needs resolution is passing a CrawlURI
into settings instead of UURI. There are few places -- the
SettingsHandler#getParent method -- where its not possible
to pass in a CrawlURI. I see in CrawlServer where you did a
new CrawlURI(UURI) just to get a UURI to pass into the
settings which is a little dirty (I removed it for now and
made a #getSettings that takes a UURI which then turns
around and does the dumb new CrawlURI(UUR) to be explicit
that we're doing something whatck). Making the settings
system take CrawlURI instead of UURI is a pretty big change;
would need to make lots of sympathetic changes.

I was trying to f igure a way that the refinement criteria
could ask the system for the CrawlURI that goes w/ the
passed UURI so we didn't have to pass down the UURI. What if
it gave the UURI to the frontier and asked it for the
corresponding CrawlURI? Could it figure the instance to
give back?

Otherwise, seems like a good idea passing a CrawlURI rather
than a UURI.

Other things that could be looked into is that there seemed
to be a bunch of places where we went to the alist w/o first
doing a containsKey check so were getting
NoSuchElementExceptions. I fixed a few of these but may not
have got all.

Here is commit message adding the ARF:

Part of '[ 1119580 ] Work with Kris to integrate revisiting
frontier'
Commit of the AdaptiveRevisit frontier. In its current
state, can do basic
crawls. Hasn't been tested doing AR. Still to be figured
is how to get a
CrawlURI down into the settings system down to the
refinements. Not
fully worked out in the kris branch (Part of this commit is
change to refinementcriteria so they take a CrawlURI).
* src/articles/user_manual.xml
Add mention of new AR frontier as well as its dependent
processors.
Refactoring of the frontiers section. Added more doc.
on bdbfrontier
and that HQF is deprecated.
* src/conf/heritrix.properties
Added commented logging lines for ARF.

* src/conf/modules/filters.options
Added mention of midfetch filter that looks at timestamp
and etags.
* src/conf/modules/processors.options
Added mention of new extractors.
* src/conf/modules/urifrontiers.options
Added ARF.
* src/java/org/archive/crawler/datamodel/CandidateURI.java
(keys): Added.
* src/java/org/archive/crawler/datamodel/CrawlURI.java
javadoc.
(alistPersistentMember): Added. Keeps list of keys of
items to persist
across processings.
(processingCleanup): Remove all from the alist except
items mention in
alistPersistentMember
(isPersistentAlistMember, addAlistPersistentMember,
removeAlistPersistentMember): Added.
* src/java/org/archive/crawler/extractor/Link.java
Make it so Link is serializable.
* src/java/org/archive/crawler/settings/ComplexType.java
Formatting. Javadoc.
(getSettings): Added override that takes a UURI. This
override then
does a new CrawlURI(UURI) so can go down into
refinements. This override
needs to be replaced w/ something more sane after figure
how to get
CrawlURIs into refinements (Made this override so its
explicit that we're
doing something whack).
*
src/java/org/archive/crawler/settings/CrawlSettingsSAXHandler.java
*
src/java/org/archive/crawler/settings/CrawlSettingsSAXSource.java
* src/webapps/admin/jobs/refinements/criteria.jsp
Added handling of new criteria, ContentTypeMatcherHandler.
* src/java/org/archive/crawler/settings/CrawlerSettings.java
:
src/java/org/archive/crawler/frontier/AdaptiveRevisitAttributeConstants.java
Formatting.
* src/java/org/archive/crawler/settings/SettingsHandler.java
(getSettings): Added override that takes an UURI. THIS
NEEDS TO BE FIXED
if refinements are supposed to be getting CrawlURI (This
method just
does new CrawlURI(UURI). It exists to make more blatant
the fact that
there is an inadequacy regards getting CrawlURIs down to
refinements.
* src/java/org/archive/crawler/settings/XMLSettingsHandler.java
(XML_ELEMENT_CONTENTMATCHES): New define.
*
src/java/org/archive/crawler/settings/refinements/Criteria.java
*
src/java/org/archive/crawler/settings/refinements/PortnumberCriteria.java
*
src/java/org/archive/crawler/settings/refinements/Refinement.java
*
src/java/org/archive/crawler/settings/refinements/RegularExpressionCriteria.java
*
src/java/org/archive/crawler/settings/refinements/TimespanCriteria.java
(isWithinRefinementsBounds): Takes a crawluri instead of
uuri (May want
to change this back).
* src/java/org/archive/crawler/extractor/ChangeEvaluator.java
* src/java/org/archive/crawler/extractor/HTTPContentDigest.java
*
src/java/org/archive/crawler/filter/HTTPMidFetchUnchangedFilter.java
*
src/java/org/archive/crawler/frontier/AdaptiveRevisitAttributeConstants.java
*
src/java/org/archive/crawler/frontier/AdaptiveRevisitFrontier.java
*
src/java/org/archive/crawler/frontier/AdaptiveRevisitHostQueue.java
*
src/java/org/archive/crawler/frontier/AdaptiveRevisitHostQueueTest.java
*
src/java/org/archive/crawler/frontier/AdaptiveRevisitQueueList.java
* src/java/org/archive/crawler/postprocessor/WaitEvaluator.java
*
src/java/org/archive/crawler/settings/refinements/ContentTypeRegExprCriteria.java
Added.



Date: 2005-03-24 00:19
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Would like composite patch from Kris for review against
current code.... we'll apply it here as we review and work
through any items that come up.




Date: 2005-03-22 21:15
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Assignment to Raymie was in error.


Date: 2005-03-22 19:02
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Ok on 1. and 2. above.

On 3., lets continue along the path of shutting down access
to AList till we have need of a '...possibly-nested hashmap
with typed accessors, and perhaps gets used elsewhere...'.

On 4., sure, lets have one AList w/ a list of the keys that
are permanent.

On versoning of a CrawlURI's data, thats a nice idea. Would
make revisiting-type frontiers easier to implement. Lets
make an RFE to do it.

Do you want to assign this issue back to me rather than to
Raymie so I can go ahead and implement (w/ Kris's help?).





Date: 2005-03-22 00:28
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Kris' proposals were roughly:
(1) add a 'changed' state ENUM to CrawlURI
(2) changing refinements so they take a CrawlURI rather
than a UURI
(3) making AList private so processors have to go via accessors
(4) introducing a distinction between a persistent and
transient 'AList' -- but making a single accessor method
check both

My comments on each:
(1) Ideally, anything that's only of use to some crawl
scenarios would go into the CrawlURI attribute/keyed 'AList'
data -- only things necessary for 'all' (or almost all)
crawls become instance variables. (If we felt the idioms for
storing things into the AList were stable, even more things
could go there -- like 'outlinks' which is for now an
instance variable.) So I'd prefer this 'change-detected
state' go into the AList.

(2) Making refinements be able to switch on CrawlURI state
sounds good to me, as long as the implementation isn't too
hairy. (Again, we hit the issue that the 'usual' case shoudl
work on simple input -- just a CharSequence/String -- but
we'd like to be able to use richer object state if available
-- full CrawlURI parameters. So there's probably some
instanceof/casting/cascading in the interface.)

(3) We've already taken steps in this direction. I'm OK with
that, but if AList is considered generally useful as a
possibly-nested hashmap with typed accessors, and perhaps
gets used elsewhere, there's no harm in letting CrawlURI
clients at the raw AList.

(4) My main feedback is that this same issue came up in a
long-ago design discussion with Raymie Stata, and I'd
proposed a similar approach: two forks of attribute-value
per-CrawlURI data, one persisting across retries/revisits,
one not.

He suggested, based specifically on experience, that a
better a approach was to have a single Alist, but mark some
of the keys as persistent. (For example, by keeping an array
of 'those key names in a 'persistKeys' attribute.) Then, the
module which sets the key can be agnostic about whether it's
persistent -- it just analyzes/extracts/etc. Later modules
which want/need something to persist (or not) can make the
necessary changes for their own purposes.

I buy the reasoning: modules inserting data shouldn't have
to care about where it's put any more than those looking up
data want to (as the suggestion to have an accessor that
checks both grants). Only when they, or some other
loosely-coupled module, has strong persistence prefs should
the issue be considered. Also, having a bit of complicated
code at persist-or-not time, making a decision about each
key's persistent, seems easier to understand and
less-error-prone than exposing the dual nature in many
places (or in fallthrough accessors).

So rather than 2 ALists I would like Kris to consider if the
some-transient, some-persist need can be met by a single
AList where some keys are marked to persist.

(Raymie also suggested keeping the last X sets of AList
attributes for a CrawlURI inside its AList, so each each
CrawlURI would have a record of its own history for analysis
on later revisits. I think that's roughly what the numbered
DocAttrGroups on Mercator tips page
http://mercator.comm.nsdlib.org/Mercator/attributes.html are
referring to -- the 0th group is the current visit; the 1st
the previous, the 2nd before that, etc.)


Date: 2005-03-02 20:38
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Gordon: review suggested CrawlURI changes (per email)...
assign back to Stack when ready.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2005-04-01 00:38 stack-sf
close_date - 2005-04-01 00:38 stack-sf
assigned_to stack-sf 2005-03-28 23:45 stack-sf
assigned_to rstata 2005-03-22 21:15 gojomo
assigned_to gojomo 2005-03-22 00:28 gojomo
assigned_to stack-sf 2005-03-02 20:38 gojomo