From: Antoni M. <ant...@gm...> - 2008-04-23 13:37:03
|
Christiaan Fluit pisze: > Antoni Myłka wrote: >> Christiaan Fluit pisze: >>> Warning: referred IDs is already used by the WebCrawler to record link >>> structure. In practice these two forms of use may co-exist without >>> conflicts, as ZIP files do not have hyperlinks and HTML files do not >>> have nested DataObjects, but it may be more future-proof and >>> semantically clear to introduce a different key for storing this data. >>> Note that AccessData allows arbitrary (id, key, value) tuples. >> This is important. What I need is means to express 'cascade delete' and >> 'cascade touch'. The first one to implement 'object removed => >> subobjects removed'. The second one to implement 'object unchanged => >> subobjects unchanged'. In this respect referred ID's cannot be used. I >> have to introduce a different concept. 'containedIDs', 'childIDs', >> 'aggregatedIDs' - whatever. > > This can be done simply by invoking AccessData.put(id, key, value) with > a key expressing the hasPart relation. > > Another issue is who is doing the cascading delete and touch. You can > add convenience methods to AccessData (I think this is what you > proposed, right?), which may be the easiest and most efficient way to > implement it. This will mean an API change though, having implications > for those implementing their own AccessData implementations. > Alternatively, you could also let the CrawlerBase handle it completely, > using only the AccessData.put(id, key, value) and .get(id, key) methods. > What's your view on this? I'd rather include those methods in the AccessData interface, but OK, it can also be done in the Crawler. If you have doubts about this, we may postpone it until Aperture 2.0 >> I'm deliberately not talking about timestamps. The touch/untouched >> distinction might be implemented with timestamps OR with a >> deprecatedUrls set. > > OK, so we're talking about the conceptual and technical side, right? I > thought you meant setting an "untouched" flag on every item in the > AccessData at the start of the crawl process, which gets toggled to > "touched" once it is encountered during the crawl. My main point was to > skip this lengthy initialization process by using timestamps instead. > >>> What's important is to come up with a scheme that does not result in >>> (too much) damage when the crawl process is suddenly interrupted [snip] >> I guess it's enough to assume that when the crawl has been interrupted, >> we have not crawled all objects and therefore cannot imply that anything >> has been deleted - simply: >> >> if (exitCode.equals(ExitCode.COMPLETED)) {reportRemoved();} >> >> just like it is now. The next crawl will not be able to continue from >> where the previous ended, it can't do it now either. It's a major >> requirement that will need some serious analysis. I'd rather not try to >> do it now. > > I didn't mean to say that the Crawler can take up exactly where it left > without having to redo some work, only that it should be able to produce > the full remaining crawl results in a next crawl. > > For example, it is not a good idea to have a crawl initialization > process that changes something in the AccessData in such a way that it > is *vital* for it to complete the crawl process or else the next > invocation will mistakenly complete too soon. Basically you then have a > kind of data corruption that cannot be remedied without throwing away > all crawl results and access data. > > The approaches we have in mind don't seem to have that problem though. I meant "touched" as "entry timestamp equals to the timestamp of the crawl start". Antoni Mylka ant...@gm... |