Re: [larm-dev] This project
Brought to you by:
cmarschner,
otis
From: Leo G. <leo...@eg...> - 2003-07-08 02:21:53
|
otisg wrote: >Leo wrote a text indexing/searching library, a la Lucene, and it >looks like he also wrote or wants to write a web crawler. > > Yes, that's true, my aim is an universal API for IR (incl. metasearchers and peer2peer nets) while Lucene is tuned for DB services. On the other hand, the main algorithms are the same, and the aim is identical - full-text services. I guess, we could co-operate on a robot, or, on open standards which could open ``hidden web''. Abstract: any (web) server could store its small index (a.k.a. barrel=lucene_segment=optimalized_lucene_index) as http://server/dbindex, and the central server could download these mini indices on board [It's something like Harvest engine did]. If you wanted to implement this, LARM (incl. the new one) would have real problems IMHO. On the other hand, this approach is very important for you, if you wanna create an index for DB sources - see below. Now, let's say, that a server offers many indices, i.e. dbindex1, ... dbindexN. Your model of LARM is based on a thought, that you know: (1) all indices are of size 1, it means - the index contains just 1 document (i.e. web page can be transformed to an index and stored under URL=original_URL+".barrel") - it does not matter, if the index is prepared on the central server (after gathering, during the indexing phase of the central server) or closer to the original source, as I suggest here (2) you can always say what content is saved in a barrel (i.e., in "/index.html.barrel" you will always find the inverted index of "/index.html") BTW: Obviously, the barrels may be filtered (in a linear time) for each specific access, i.e.: a) all records which are related to "local documents" are filtered out when the dbindex is accessed outside your intranet b) the central server can pass "Modified-since" meta value and the barrel would contain just the records which are related to documents which were changed after the specified date BTW2: All dbindex-es may be generated on-fly, so you can model them as virtual objects ^^^ The two paragraphs also describe the model of the classic web-crawler (then all dbindex* are prepared on the central machine which runs the crawler; Modified-Since is the well-known HTTP meta value; point b) can be identical to 4xx HTTP responses) - I think, you see the analogy. And now, the most important point: On the central server, when a barrel (a.k.a. index, segment or whatever you call it) comes, you must filter our records which are already up-to-date - and that's the issue. If I understand your design correctly, this decision is made before you gather the pages (or barrels in a common case) [due to (1)+(2)], so the timestamp-records may be left in the main index, and you neednot care about the issue. On the other hand, when you want to crawl barrels of size > 1, the decision must be made elsewhere, after you analyze the incoming barrel. Then the timestamp-values must be stored outside the main index, in a hashtable, I guess. Moreover, the load, which is related to modifications of the main index, cannot be handled if you pass all update-requests via the standard Lucene API (It means, in the direction "into"). You would rather use the reverse direction. If you calculated the amortized complexity of update operations, it would be better for overall performance, I guess. I'm not sure, if you want to develop ``a crawler'', or something more general. That's why I asked, if you stop your effort. I tried to express a situation, when LARM neednot work effectively. And I have more examples, but I think, this one is in your direction - I mean towards the RDBMS technology using Lucene for full-text services on background. Then "dbindex" could be an index of DB table, or something like that. All my thoughts about LARM are based on the following data flow. I tried to read the documentation, unfortunately, I am not familiar with the old LARM, so if I miss a point, please, correct me, if I'm wrong: 1. you want to store timestamps in the main Lucene index 2. scheduler periodically retrieves URLs which must be updated (the URLs are read from the main index) 3. scheduler prepares a pipe for gatherer 4. gatherer gets pages 5. filters do something and everything ends in the main index 6. old document is replaced with a new one BTW: AFAIK Lucene, the next weak-point could be in point 6, this action could take a lot of time when the main index is huge. Ufff. I'm off. Now, it is your turn. BTW: I'm sorry for the long letter and my English. If you read this line, your nerves need a drink. Cheers! :-) -g- |