Re: Re: [larm-dev] This project

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I don't think I follow everything in your email, but I think
what you want to do (the distributed indexes a la Harvest, etc.)
is different enough from what LARM wants to do (be a 'vanilla'
web crawler (also DB and file system indexer)), that the two
projects should continue their lives independently.

Otis

---- On Tue, 08 Jul 2003, Leo Galambos
(leo...@eg...) wrote:

> otisg wrote:
> 
> >Leo wrote a text indexing/searching library, a la Lucene, and
it
> >looks like he also wrote or wants to write a web crawler.
> >  
> >
> 
> Yes, that's true, my aim is an universal API for IR (incl.
metasearchers 
> and peer2peer nets) while Lucene is tuned for DB services. On
the other 
> hand, the main algorithms are the same, and the aim is
identical - 
> full-text services.
> 
> I guess, we could co-operate on a robot, or, on open standards
which 
> could open ``hidden web''.
> 
> Abstract: any (web) server could store its small index (a.k.a.

> barrel=lucene_segment=optimalized_lucene_index) as 
> http://server/dbindex, and the central server could download
these mini 
> indices on board [It's something like Harvest engine did]. If
you wanted 
> to implement this, LARM (incl. the new one) would have real
problems 
> IMHO. On the other hand, this approach is very important for
you, if you 
> wanna create an index for DB sources - see below.
> 
> Now, let's say, that a server offers many indices, i.e.
dbindex1, ... 
> dbindexN. Your model of LARM is based on a thought, that you
know:
> (1) all indices are of size 1, it means - the index contains
just 1 
> document (i.e. web page can be transformed to an index and
stored under 
> URL=original_URL+".barrel") - it does not matter, if the index
is 
> prepared on the central server (after gathering, during the
indexing 
> phase of the central server) or closer to the original source,
as I 
> suggest here
> (2) you can always say what content is saved in a barrel
(i.e., in 
> "/index.html.barrel" you will always find the inverted index
of 
> "/index.html")
> 
> BTW: Obviously, the barrels may be filtered (in a linear time)
for each 
> specific access, i.e.:
> a) all records which are related to "local documents" are
filtered out 
> when the dbindex is accessed outside your intranet
> b) the central server can pass "Modified-since" meta value and
the 
> barrel would contain just the records which are related to
documents 
> which were changed after the specified date
> BTW2: All dbindex-es may be generated on-fly, so you can model
them as 
> virtual objects
> 
> ^^^ The two paragraphs also describe the model of the classic

> web-crawler (then all dbindex* are prepared on the central
machine which 
> runs the crawler; Modified-Since is the well-known HTTP meta
value; 
> point b) can be identical to 4xx HTTP responses) - I think,
you see the 
> analogy.
> 
> And now, the most important point:
> On the central server, when a barrel (a.k.a. index, segment or
whatever 
> you call it) comes, you must filter our records which are
already 
> up-to-date - and that's the issue. If I understand your design

> correctly, this decision is made before you gather the pages
(or barrels 
> in a common case) [due to (1)+(2)], so the timestamp-records
may be left 
> in the main index, and you neednot care about the issue.
> On the other hand, when you want to crawl barrels of size > 1,
the 
> decision must be made elsewhere, after you analyze the
incoming barrel. 
> Then the timestamp-values must be stored outside the main
index, in a 
> hashtable, I guess. Moreover, the load, which is related to 
> modifications of the main index, cannot be handled if you pass
all 
> update-requests via the standard Lucene API (It means, in the
direction 
> "into"). You would rather use the reverse direction. If you
calculated 
> the amortized complexity of update operations, it would be
better for 
> overall performance, I guess.
> 
> I'm not sure, if you want to develop ``a crawler'', or
something more 
> general. That's why I asked, if you stop your effort.
> 
> I tried to express a situation, when LARM neednot work
effectively. And 
> I have more examples, but I think, this one is in your
direction - I 
> mean towards the RDBMS technology using Lucene for full-text
services on 
> background. Then "dbindex" could be an index of DB table, or
something 
> like that.
> 
> All my thoughts about LARM are based on the following data
flow. I tried 
> to read the documentation, unfortunately, I am not familiar
with the old 
> LARM, so if I miss a point, please, correct me, if I'm wrong:
> 
> 1. you want to store timestamps in the main Lucene index
> 2. scheduler periodically retrieves URLs which must be updated
(the URLs 
> are read from the main index)
> 3. scheduler prepares a pipe for gatherer
> 4. gatherer gets pages
> 5. filters do something and everything ends in the main index
> 6. old document is replaced with a new one
> 
> BTW: AFAIK Lucene, the next weak-point could be in point 6,
this action 
> could take a lot of time when the main index is huge.
> 
> Ufff. I'm off. Now, it is your turn.
> 
> BTW: I'm sorry for the long letter and my English. If you read
this 
> line, your nerves need a drink. Cheers! :-)
> 
> -g-
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by: Free pre-built ASP.NET sites
including
> Data Reports, E-commerce, Portals, and Forums are available
now.
> Download today and enter to win an XBOX or Visual Studio
.NET.
>
http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01
> _______________________________________________
> larm-developer mailing list
> lar...@li...
> https://lists.sourceforge.net/lists/listinfo/larm-developer
> LARM is groovy
> 
> 

________________________________________________
Get your own "800" number
Voicemail, fax, email, and a lot more
http://www.ureach.com/reg/tag