Re: Re: [larm-dev] This project
Brought to you by:
cmarschner,
otis
From: otisg <ot...@ur...> - 2003-07-08 22:25:00
|
I don't think I follow everything in your email, but I think what you want to do (the distributed indexes a la Harvest, etc.) is different enough from what LARM wants to do (be a 'vanilla' web crawler (also DB and file system indexer)), that the two projects should continue their lives independently. Otis ---- On Tue, 08 Jul 2003, Leo Galambos (leo...@eg...) wrote: > otisg wrote: > > >Leo wrote a text indexing/searching library, a la Lucene, and it > >looks like he also wrote or wants to write a web crawler. > > > > > > Yes, that's true, my aim is an universal API for IR (incl. metasearchers > and peer2peer nets) while Lucene is tuned for DB services. On the other > hand, the main algorithms are the same, and the aim is identical - > full-text services. > > I guess, we could co-operate on a robot, or, on open standards which > could open ``hidden web''. > > Abstract: any (web) server could store its small index (a.k.a. > barrel=lucene_segment=optimalized_lucene_index) as > http://server/dbindex, and the central server could download these mini > indices on board [It's something like Harvest engine did]. If you wanted > to implement this, LARM (incl. the new one) would have real problems > IMHO. On the other hand, this approach is very important for you, if you > wanna create an index for DB sources - see below. > > Now, let's say, that a server offers many indices, i.e. dbindex1, ... > dbindexN. Your model of LARM is based on a thought, that you know: > (1) all indices are of size 1, it means - the index contains just 1 > document (i.e. web page can be transformed to an index and stored under > URL=original_URL+".barrel") - it does not matter, if the index is > prepared on the central server (after gathering, during the indexing > phase of the central server) or closer to the original source, as I > suggest here > (2) you can always say what content is saved in a barrel (i.e., in > "/index.html.barrel" you will always find the inverted index of > "/index.html") > > BTW: Obviously, the barrels may be filtered (in a linear time) for each > specific access, i.e.: > a) all records which are related to "local documents" are filtered out > when the dbindex is accessed outside your intranet > b) the central server can pass "Modified-since" meta value and the > barrel would contain just the records which are related to documents > which were changed after the specified date > BTW2: All dbindex-es may be generated on-fly, so you can model them as > virtual objects > > ^^^ The two paragraphs also describe the model of the classic > web-crawler (then all dbindex* are prepared on the central machine which > runs the crawler; Modified-Since is the well-known HTTP meta value; > point b) can be identical to 4xx HTTP responses) - I think, you see the > analogy. > > And now, the most important point: > On the central server, when a barrel (a.k.a. index, segment or whatever > you call it) comes, you must filter our records which are already > up-to-date - and that's the issue. If I understand your design > correctly, this decision is made before you gather the pages (or barrels > in a common case) [due to (1)+(2)], so the timestamp-records may be left > in the main index, and you neednot care about the issue. > On the other hand, when you want to crawl barrels of size > 1, the > decision must be made elsewhere, after you analyze the incoming barrel. > Then the timestamp-values must be stored outside the main index, in a > hashtable, I guess. Moreover, the load, which is related to > modifications of the main index, cannot be handled if you pass all > update-requests via the standard Lucene API (It means, in the direction > "into"). You would rather use the reverse direction. If you calculated > the amortized complexity of update operations, it would be better for > overall performance, I guess. > > I'm not sure, if you want to develop ``a crawler'', or something more > general. That's why I asked, if you stop your effort. > > I tried to express a situation, when LARM neednot work effectively. And > I have more examples, but I think, this one is in your direction - I > mean towards the RDBMS technology using Lucene for full-text services on > background. Then "dbindex" could be an index of DB table, or something > like that. > > All my thoughts about LARM are based on the following data flow. I tried > to read the documentation, unfortunately, I am not familiar with the old > LARM, so if I miss a point, please, correct me, if I'm wrong: > > 1. you want to store timestamps in the main Lucene index > 2. scheduler periodically retrieves URLs which must be updated (the URLs > are read from the main index) > 3. scheduler prepares a pipe for gatherer > 4. gatherer gets pages > 5. filters do something and everything ends in the main index > 6. old document is replaced with a new one > > BTW: AFAIK Lucene, the next weak-point could be in point 6, this action > could take a lot of time when the main index is huge. > > Ufff. I'm off. Now, it is your turn. > > BTW: I'm sorry for the long letter and my English. If you read this > line, your nerves need a drink. Cheers! :-) > > -g- > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01 > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > > ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |