Re: [larm-dev] This project

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

otisg wrote:

>Leo wrote a text indexing/searching library, a la Lucene, and it
>looks like he also wrote or wants to write a web crawler.
>  
>

Yes, that's true, my aim is an universal API for IR (incl. metasearchers 
and peer2peer nets) while Lucene is tuned for DB services. On the other 
hand, the main algorithms are the same, and the aim is identical - 
full-text services.

I guess, we could co-operate on a robot, or, on open standards which 
could open ``hidden web''.

Abstract: any (web) server could store its small index (a.k.a. 
barrel=lucene_segment=optimalized_lucene_index) as 
http://server/dbindex, and the central server could download these mini 
indices on board [It's something like Harvest engine did]. If you wanted 
to implement this, LARM (incl. the new one) would have real problems 
IMHO. On the other hand, this approach is very important for you, if you 
wanna create an index for DB sources - see below.

Now, let's say, that a server offers many indices, i.e. dbindex1, ... 
dbindexN. Your model of LARM is based on a thought, that you know:
(1) all indices are of size 1, it means - the index contains just 1 
document (i.e. web page can be transformed to an index and stored under 
URL=original_URL+".barrel") - it does not matter, if the index is 
prepared on the central server (after gathering, during the indexing 
phase of the central server) or closer to the original source, as I 
suggest here
(2) you can always say what content is saved in a barrel (i.e., in 
"/index.html.barrel" you will always find the inverted index of 
"/index.html")

BTW: Obviously, the barrels may be filtered (in a linear time) for each 
specific access, i.e.:
a) all records which are related to "local documents" are filtered out 
when the dbindex is accessed outside your intranet
b) the central server can pass "Modified-since" meta value and the 
barrel would contain just the records which are related to documents 
which were changed after the specified date
BTW2: All dbindex-es may be generated on-fly, so you can model them as 
virtual objects

^^^ The two paragraphs also describe the model of the classic 
web-crawler (then all dbindex* are prepared on the central machine which 
runs the crawler; Modified-Since is the well-known HTTP meta value; 
point b) can be identical to 4xx HTTP responses) - I think, you see the 
analogy.

And now, the most important point:
On the central server, when a barrel (a.k.a. index, segment or whatever 
you call it) comes, you must filter our records which are already 
up-to-date - and that's the issue. If I understand your design 
correctly, this decision is made before you gather the pages (or barrels 
in a common case) [due to (1)+(2)], so the timestamp-records may be left 
in the main index, and you neednot care about the issue.
On the other hand, when you want to crawl barrels of size > 1, the 
decision must be made elsewhere, after you analyze the incoming barrel. 
Then the timestamp-values must be stored outside the main index, in a 
hashtable, I guess. Moreover, the load, which is related to 
modifications of the main index, cannot be handled if you pass all 
update-requests via the standard Lucene API (It means, in the direction 
"into"). You would rather use the reverse direction. If you calculated 
the amortized complexity of update operations, it would be better for 
overall performance, I guess.

I'm not sure, if you want to develop ``a crawler'', or something more 
general. That's why I asked, if you stop your effort.

I tried to express a situation, when LARM neednot work effectively. And 
I have more examples, but I think, this one is in your direction - I 
mean towards the RDBMS technology using Lucene for full-text services on 
background. Then "dbindex" could be an index of DB table, or something 
like that.

All my thoughts about LARM are based on the following data flow. I tried 
to read the documentation, unfortunately, I am not familiar with the old 
LARM, so if I miss a point, please, correct me, if I'm wrong:

1. you want to store timestamps in the main Lucene index
2. scheduler periodically retrieves URLs which must be updated (the URLs 
are read from the main index)
3. scheduler prepares a pipe for gatherer
4. gatherer gets pages
5. filters do something and everything ends in the main index
6. old document is replaced with a new one

BTW: AFAIK Lucene, the next weak-point could be in point 6, this action 
could take a lot of time when the main index is huge.

Ufff. I'm off. Now, it is your turn.

BTW: I'm sorry for the long letter and my English. If you read this 
line, your nerves need a drink. Cheers! :-)

-g-