From: Christiaan F. <chr...@ad...> - 2008-03-19 10:14:13
|
Dan...@em... wrote: > Hi Christiaan, > > thanks for your elaborate answer. > Some comments inline. If you have any questions/comments I'd like to > invite you to share them on the EILF newsgroup > (http://www.eclipse.org/newsportal/thread.php?group=eclipse.technology.e > ilf). Also, if you have any requirements considering EILF, features you > would like to see in EILF or open issues not adressed in aperture please > join our newsgroup. This is a tricky issue: which discussion takes place where? After all, parts of this discussion are also relevant to the Aperture community. Is there any way we can cross-post to our mailing list and your news group? > Yes, we follow the same approach. New/changed objects can be detected at > any time, deleted objects can only be identified at the end of a crawl. > Besides crawling we also want to provide active monitoring of a data > source (we call it Agent). An Agent would be able to report any event at > any time. I agree with Leo's comment about the term "Agent". Observer is much better. > We are planing to store just a hash code for each object. How the > hashcode is created (on what attributes, e.g. last modification date or > access rights) is configurable. In other words: each resource identifier (file path, URI, whatever ID type you use) has a hashcode registered with it and a different hashcode implies a changed resource, correct? I see how this can be used to allow for different file change detection strategies, which is a good thing. When the hashcode algorithm is sufficiently robust (e.g. a checksum on the entire binary file contents), it also allows for detecting moved resources. Do you plan to facilitate this somehow in your design? In our architecture, a moved resource is reported as a deleted and a new resource, there is no connection between these two events. Note that the term "hashcode" is a bit overloaded. Some systems (especially P2P systems) use the hashcode as the primary resource identifier, rather than as an attribute of the resource identifier. Could this be confusing? > The idea of storing outgoing links for html files to optimize > performance sounds interesting. But this again increases the amount of > data to store, especially if you think about high volume of data. In > practice I'm not sure if last modification dates returned from > webservers are reliable. We consider to store delta indexing information We do this by setting the If-Modified-Since header on the HTTP connection. In practice, not many servers support it (probably because of generated pages: hard to tell), meaning that your unchanged pages will still be reported as changed. When they do support it, they seem to be telling the truth though :) I have not seen cases where servers incorrectly reported URLs as unmodified (a 304 response code). > not only for data objects (the documents) but also for hierachy objects > (like folders) to improve runtime performance. In this way complete > hierachies (like subfolders) need not to be crawled. Can you elaborate on that? AFIAK, the file system metadata of a folder does not change when something changes in a folder nested in it, at least not in the part that is accessible through java.io.File. How can you detect when something has changed? > Yes, holding the information in memory is not feasible for high volume > of data. Our approach is to permanently store all the delta indexing > information, e.g. in a database, a search index or some other data store > (implementations will be interchangeable). During a crawl > new/changed/unchanged objects are marked as visited and at the end of a > crawl all unvisited objects are to be deleted. OK, so our strategies are very similar. Will your design also take multi-threaded crawling into account, e.g. for use in a clustered environment? Regards, Chris -- |