larm-developer Mailing List for Lucene Advanced Retrieval Machine (LARM)
Brought to you by:
cmarschner,
otis
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(27) |
Jul
(8) |
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <ben...@id...> - 2004-05-22 13:21:20
|
Dear Open Source developer I am doing a research project on "Fun and Software Development" in which I kindly invite you to participate. You will find the online survey under http://fasd.ethz.ch/qsf/. The questionnaire consists of 53 questions and you will need about 15 minutes to complete it. With the FASD project (Fun and Software Development) we want to define the motivational significance of fun when software developers decide to engage in Open Source projects. What is special about our research project is that a similar survey is planned with software developers in commercial firms. This procedure allows the immediate comparison between the involved individuals and the conditions of production of these two development models. Thus we hope to obtain substantial new insights to the phenomenon of Open Source Development. With many thanks for your participation, Benno Luthiger PS: The results of the survey will be published under http://www.isu.unizh.ch/fuehrung/blprojects/FASD/. We have set up the mailing list fa...@we... for this study. Please see http://fasd.ethz.ch/qsf/mailinglist_en.html for registration to this mailing list. _______________________________________________________________________ Benno Luthiger Swiss Federal Institute of Technology Zurich 8092 Zurich Mail: benno.luthiger(at)id.ethz.ch _______________________________________________________________________ |
From: otisg <ot...@ur...> - 2003-09-12 13:18:39
|
Try checking out top-level 'docs' directory under larm cvs module. I think that may be different from what you've got. If not, see Lucene-Sandbox on Jakarta site. Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag ---- On Thu, 11 Sep 2003, antonin bonte (ab...@ph...) wrote: > Hi, > > I downloaded Larm docs from CVS : > > jakarta-lucene-sandbox/projects/larm/docs > > They don't have diagram included. Where can > i find full versions with diagrams ? > > Thanks. > > Antonin Bonte > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > > |
From: antonin b. <ab...@ph...> - 2003-09-11 16:27:41
|
Hi, I downloaded Larm docs from CVS : jakarta-lucene-sandbox/projects/larm/docs They don't have diagram included. Where can i find full versions with diagrams ? Thanks. Antonin Bonte |
From: Leo G. <leo...@eg...> - 2003-07-10 02:47:07
|
otisg wrote: >Hello Leo, > >Yes, indexing a RDBMS is lower on the priority list than web >crawling and indexing. Personally, I am not interested in that >at all. I am only interested in crawling and indexing the web. >If by 'doing more' you are referring to local indices a la >Harvest, then again, I don't think we want to do that with LARM. > ok. > Harvest already failed attempting to get people to index their >own servers. There is no need to repeat that mistake :). > > Hi Otis. Harvest had other goals, and we have the 21st century ;-) -g- |
From: otisg <ot...@ur...> - 2003-07-09 12:05:05
|
Hello Leo, Yes, indexing a RDBMS is lower on the priority list than web crawling and indexing. Personally, I am not interested in that at all. I am only interested in crawling and indexing the web. If by 'doing more' you are referring to local indices a la Harvest, then again, I don't think we want to do that with LARM. Harvest already failed attempting to get people to index their own servers. There is no need to repeat that mistake :). Otis ---- On Wed, 09 Jul 2003, Leo Galambos (leo...@eg...) wrote: > Otis, > > the point I tried to express is, that I do not know, how you would like > to keep the index behind LARM in sync with a remote DB/SQL. As far as I > read the documentation, you can do this in your LARM model, if you fetch > a table tuple by tuple. On the other hand, if you want to crawl local > directories or remote web servers, the design is perfect. > > BTW, I tried to show, that you could easily modify your design, and also > do something more in a web space. You know, I'm not interested in RDBMS > features, but web has a high priority for me ;-). > > -g- > > > otisg wrote: > > >I don't think I follow everything in your email, but I think > >what you want to do (the distributed indexes a la Harvest, etc.) > >is different enough from what LARM wants to do (be a 'vanilla' > >web crawler (also DB and file system indexer)), that the two > >projects should continue their lives independently. > > > > > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Parasoft > Error proof Web apps, automate testing & more. > Download & eval WebKing and get a free book. > www.parasoft.com/bulletproofapps > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > > ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |
From: Leo G. <leo...@eg...> - 2003-07-09 03:10:09
|
Otis, the point I tried to express is, that I do not know, how you would like to keep the index behind LARM in sync with a remote DB/SQL. As far as I read the documentation, you can do this in your LARM model, if you fetch a table tuple by tuple. On the other hand, if you want to crawl local directories or remote web servers, the design is perfect. BTW, I tried to show, that you could easily modify your design, and also do something more in a web space. You know, I'm not interested in RDBMS features, but web has a high priority for me ;-). -g- otisg wrote: >I don't think I follow everything in your email, but I think >what you want to do (the distributed indexes a la Harvest, etc.) >is different enough from what LARM wants to do (be a 'vanilla' >web crawler (also DB and file system indexer)), that the two >projects should continue their lives independently. > > |
From: otisg <ot...@ur...> - 2003-07-08 22:25:00
|
I don't think I follow everything in your email, but I think what you want to do (the distributed indexes a la Harvest, etc.) is different enough from what LARM wants to do (be a 'vanilla' web crawler (also DB and file system indexer)), that the two projects should continue their lives independently. Otis ---- On Tue, 08 Jul 2003, Leo Galambos (leo...@eg...) wrote: > otisg wrote: > > >Leo wrote a text indexing/searching library, a la Lucene, and it > >looks like he also wrote or wants to write a web crawler. > > > > > > Yes, that's true, my aim is an universal API for IR (incl. metasearchers > and peer2peer nets) while Lucene is tuned for DB services. On the other > hand, the main algorithms are the same, and the aim is identical - > full-text services. > > I guess, we could co-operate on a robot, or, on open standards which > could open ``hidden web''. > > Abstract: any (web) server could store its small index (a.k.a. > barrel=lucene_segment=optimalized_lucene_index) as > http://server/dbindex, and the central server could download these mini > indices on board [It's something like Harvest engine did]. If you wanted > to implement this, LARM (incl. the new one) would have real problems > IMHO. On the other hand, this approach is very important for you, if you > wanna create an index for DB sources - see below. > > Now, let's say, that a server offers many indices, i.e. dbindex1, ... > dbindexN. Your model of LARM is based on a thought, that you know: > (1) all indices are of size 1, it means - the index contains just 1 > document (i.e. web page can be transformed to an index and stored under > URL=original_URL+".barrel") - it does not matter, if the index is > prepared on the central server (after gathering, during the indexing > phase of the central server) or closer to the original source, as I > suggest here > (2) you can always say what content is saved in a barrel (i.e., in > "/index.html.barrel" you will always find the inverted index of > "/index.html") > > BTW: Obviously, the barrels may be filtered (in a linear time) for each > specific access, i.e.: > a) all records which are related to "local documents" are filtered out > when the dbindex is accessed outside your intranet > b) the central server can pass "Modified-since" meta value and the > barrel would contain just the records which are related to documents > which were changed after the specified date > BTW2: All dbindex-es may be generated on-fly, so you can model them as > virtual objects > > ^^^ The two paragraphs also describe the model of the classic > web-crawler (then all dbindex* are prepared on the central machine which > runs the crawler; Modified-Since is the well-known HTTP meta value; > point b) can be identical to 4xx HTTP responses) - I think, you see the > analogy. > > And now, the most important point: > On the central server, when a barrel (a.k.a. index, segment or whatever > you call it) comes, you must filter our records which are already > up-to-date - and that's the issue. If I understand your design > correctly, this decision is made before you gather the pages (or barrels > in a common case) [due to (1)+(2)], so the timestamp-records may be left > in the main index, and you neednot care about the issue. > On the other hand, when you want to crawl barrels of size > 1, the > decision must be made elsewhere, after you analyze the incoming barrel. > Then the timestamp-values must be stored outside the main index, in a > hashtable, I guess. Moreover, the load, which is related to > modifications of the main index, cannot be handled if you pass all > update-requests via the standard Lucene API (It means, in the direction > "into"). You would rather use the reverse direction. If you calculated > the amortized complexity of update operations, it would be better for > overall performance, I guess. > > I'm not sure, if you want to develop ``a crawler'', or something more > general. That's why I asked, if you stop your effort. > > I tried to express a situation, when LARM neednot work effectively. And > I have more examples, but I think, this one is in your direction - I > mean towards the RDBMS technology using Lucene for full-text services on > background. Then "dbindex" could be an index of DB table, or something > like that. > > All my thoughts about LARM are based on the following data flow. I tried > to read the documentation, unfortunately, I am not familiar with the old > LARM, so if I miss a point, please, correct me, if I'm wrong: > > 1. you want to store timestamps in the main Lucene index > 2. scheduler periodically retrieves URLs which must be updated (the URLs > are read from the main index) > 3. scheduler prepares a pipe for gatherer > 4. gatherer gets pages > 5. filters do something and everything ends in the main index > 6. old document is replaced with a new one > > BTW: AFAIK Lucene, the next weak-point could be in point 6, this action > could take a lot of time when the main index is huge. > > Ufff. I'm off. Now, it is your turn. > > BTW: I'm sorry for the long letter and my English. If you read this > line, your nerves need a drink. Cheers! :-) > > -g- > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01 > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > > ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |
From: Leo G. <leo...@eg...> - 2003-07-08 02:21:53
|
otisg wrote: >Leo wrote a text indexing/searching library, a la Lucene, and it >looks like he also wrote or wants to write a web crawler. > > Yes, that's true, my aim is an universal API for IR (incl. metasearchers and peer2peer nets) while Lucene is tuned for DB services. On the other hand, the main algorithms are the same, and the aim is identical - full-text services. I guess, we could co-operate on a robot, or, on open standards which could open ``hidden web''. Abstract: any (web) server could store its small index (a.k.a. barrel=lucene_segment=optimalized_lucene_index) as http://server/dbindex, and the central server could download these mini indices on board [It's something like Harvest engine did]. If you wanted to implement this, LARM (incl. the new one) would have real problems IMHO. On the other hand, this approach is very important for you, if you wanna create an index for DB sources - see below. Now, let's say, that a server offers many indices, i.e. dbindex1, ... dbindexN. Your model of LARM is based on a thought, that you know: (1) all indices are of size 1, it means - the index contains just 1 document (i.e. web page can be transformed to an index and stored under URL=original_URL+".barrel") - it does not matter, if the index is prepared on the central server (after gathering, during the indexing phase of the central server) or closer to the original source, as I suggest here (2) you can always say what content is saved in a barrel (i.e., in "/index.html.barrel" you will always find the inverted index of "/index.html") BTW: Obviously, the barrels may be filtered (in a linear time) for each specific access, i.e.: a) all records which are related to "local documents" are filtered out when the dbindex is accessed outside your intranet b) the central server can pass "Modified-since" meta value and the barrel would contain just the records which are related to documents which were changed after the specified date BTW2: All dbindex-es may be generated on-fly, so you can model them as virtual objects ^^^ The two paragraphs also describe the model of the classic web-crawler (then all dbindex* are prepared on the central machine which runs the crawler; Modified-Since is the well-known HTTP meta value; point b) can be identical to 4xx HTTP responses) - I think, you see the analogy. And now, the most important point: On the central server, when a barrel (a.k.a. index, segment or whatever you call it) comes, you must filter our records which are already up-to-date - and that's the issue. If I understand your design correctly, this decision is made before you gather the pages (or barrels in a common case) [due to (1)+(2)], so the timestamp-records may be left in the main index, and you neednot care about the issue. On the other hand, when you want to crawl barrels of size > 1, the decision must be made elsewhere, after you analyze the incoming barrel. Then the timestamp-values must be stored outside the main index, in a hashtable, I guess. Moreover, the load, which is related to modifications of the main index, cannot be handled if you pass all update-requests via the standard Lucene API (It means, in the direction "into"). You would rather use the reverse direction. If you calculated the amortized complexity of update operations, it would be better for overall performance, I guess. I'm not sure, if you want to develop ``a crawler'', or something more general. That's why I asked, if you stop your effort. I tried to express a situation, when LARM neednot work effectively. And I have more examples, but I think, this one is in your direction - I mean towards the RDBMS technology using Lucene for full-text services on background. Then "dbindex" could be an index of DB table, or something like that. All my thoughts about LARM are based on the following data flow. I tried to read the documentation, unfortunately, I am not familiar with the old LARM, so if I miss a point, please, correct me, if I'm wrong: 1. you want to store timestamps in the main Lucene index 2. scheduler periodically retrieves URLs which must be updated (the URLs are read from the main index) 3. scheduler prepares a pipe for gatherer 4. gatherer gets pages 5. filters do something and everything ends in the main index 6. old document is replaced with a new one BTW: AFAIK Lucene, the next weak-point could be in point 6, this action could take a lot of time when the main index is huge. Ufff. I'm off. Now, it is your turn. BTW: I'm sorry for the long letter and my English. If you read this line, your nerves need a drink. Cheers! :-) -g- |
From: otisg <ot...@ur...> - 2003-07-07 22:24:55
|
Not sure. Clemens, the original product leader, had to stop working because of work. I cannot contribute code currently, because I just have too many other things going. David ported the original LARM to run on top of Avalon. I am not sure what the status of that code is. It is not in CVS. Maybe David will want to put it in a separate CVS module under LARM project. Jeff expressed interest in working on LARM. Leo wrote a text indexing/searching library, a la Lucene, and it looks like he also wrote or wants to write a web crawler. There, summary for everyone :) Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag ---- On Mon, 7 Jul 2003, David Worms (da...@si...) wrote: > > Look at the mail archive. For sure, it isn't stopped, the question is, > for how long is it paused? > > d. > > Begin forwarded message: > > > From: Leo Galambos <leo...@eg...> > > Date: Mon Jul 7, 2003 9:39:18 AM US/Pacific > > To: lar...@li... > > Subject: [larm-dev] This project > > Reply-To: lar...@li... > > > > Is this project stopped? > > > > -g- > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > > Data Reports, E-commerce, Portals, and Forums are available now. > > Download today and enter to win an XBOX or Visual Studio .NET. > > http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/ > > 01 > > _______________________________________________ > > larm-developer mailing list > > lar...@li... > > https://lists.sourceforge.net/lists/listinfo/larm-developer > > LARM is groovy > > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01 > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > > |
From: David W. <da...@si...> - 2003-07-07 17:10:00
|
Look at the mail archive. For sure, it isn't stopped, the question is, for how long is it paused? d. Begin forwarded message: > From: Leo Galambos <leo...@eg...> > Date: Mon Jul 7, 2003 9:39:18 AM US/Pacific > To: lar...@li... > Subject: [larm-dev] This project > Reply-To: lar...@li... > > Is this project stopped? > > -g- > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/ > 01 > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > |
From: Leo G. <leo...@eg...> - 2003-07-07 16:38:31
|
Is this project stopped? -g- |
From: Clemens M. <Cle...@in...> - 2003-06-30 11:09:26
|
http://larm.sourceforge.net/ |
From: otisg <ot...@ur...> - 2003-06-25 01:50:31
|
> > primariURI is normalized (e.g. hosts lowercased, port included > > only if != 80, I suppose, the path part of the URL cleaned up, > > etc.). Maybe add this, although it's pretty obvious. > > Sounds reasonable. Some remarks on that one: > I would say the URL should still allow for opening the file. This should be > the case with the actions taken that you mentioned. > It was sometimes not the case with the old larm. larm-old contained the > following normalizations: > > 1. http://host --> http://host/ (always include a path) > 2. + ---> %20 (%20 instead of + for space) > 3. %af %aF %Af --> %AF (all escape sequences uppercase) > 4. all unsafe chars --> escaped version (see URL RFC) > 5. http://Host/ --> http://host/ (host.tolower) > 6. http://host/? --> http://host/ (remove empty query) > 7. http://host:80/ --> http://host/ (remove default ports) > 8. https://host:443/ --> https://host/ (remove default ports) > 9. http://host/./ --> http://host/ (remove redundant ./) > 10. http://host/path/../ --> http://host/ (resolve ../ - not implemented in > larm-old) > > In addition, old-LARM did the following: > > 11. http://www.host.com/ --> http://host.com (remove www.) > 12. http://www.host.com/index.* --> http://www.host.com/ > 13. http://www.host.com/default.* --> http://www.host.com/ (remove > redundant (?) index.* / default.* > 14. http://host1/ --> http://host2/ (if host1 and 2 > are in an alias table) > > 11-14 may produce false positives, that is two URLs merged to one that point > to different files. Furthermore, the URL may lead to an error page or to a > non-existing server (if www. is cut), although most of the times it will > work out. and 14. is complicated to handle during the crawl itself, so I > would say we should leave this out. > > I think now 11-14 should be handled using document fingerprints. However, > since we have to avoid that a host with different names is crawled from two > different threads, I would say at least that "www." is cut off the keys that > are used for identifying the thread responsible for a host. I agree. I'm pro 1-10 and against 11-14 because they are based on assumptions that will not always yield correct results. As for using the domain name as the key when assigning hosts to fetcher threads, I agree. Note that I said 'using the domain name'. I think that may be the easiest thing to do and sufficient for this purpose. Examples: foo.bar.domain.com -> domain.dom a.b.otherdomain.com -> otherdomain.com www.blah.thirddomain.com -> thirdcomain.com www.fourthdomain.com -> fourthdomain.com > > > 3. secondaryURIs: Collection: A list of secondary URIs of the > > document. If > > > the URI is ambiguous (e.g. if a document is represented by > > more than one > > > URL) this contains a list of all URIs (found) under which this > > document is > > > accessible. > > > > This can also be null. Obvious, too. > > null or an empty collection? I would say the latter to avoid making this > distinction. empty is fine. > > > 6. lastChangedDate: Date: The time the last change has > > occurred as > > > > Something seems to be missing here... > > the timestamp in the index. Oh, we need a timestamp in the index...! Yes we do. I thought you had already mentioned that somewhere. Doesn't matter, we need the time of the last fetch date and of the last change date (we can look for changed fingerprints to determine if the page changed. However, this will result in false positives when pages include things like counters or current date/time. To get around that we could try using Nilsimsa (http://www.google.com/search?q=nilsimsa)) We need both of these dates, I think, in order to adjust fetching frequency dynamically, based on the frequency at which each page changes. Those pages that change less frequently can also be crawled less frequently. > > Isn't "lastChanged" date the same as "indexedDate"? It is the same thing, isn't it? If the page is not changed, we don't re-index it, so indexedDate will always be lastChangedDate, no? > > In the Fetcher paragraph there is a hint about Fetchers polling > > the FetcherManager. I am not sure what you have in mind, but if > > that is not such an important feature, I'd remove references to > > it, in order to keep things simple and avoid over-engineering. > > If we follow a "push" model here it means the FetcherManager prepares lists > of URLs for each thread. The thread, when it is ready, should poll for new > messages, and go idle when there are no new messages. This avoids > synchronization. Then the thread gets a list of CrawlRequests and downloads > the files. OK. > Then we can still decide whether it pushes each file along the queue when it > gets it or whether it collects some files and pushes them all together. > That's constrained by the file sizes and the memory available. I don't know the answer yet. Too early for me to say it. I am leaning towards dealing with pages in batches. I have a feeling it may be more efficient. I also remember reading about some crawler implementation (Uni. of Indiana, I believe) that mentioned that they implemented this type of stuff in batches for efficiency purposes. Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |
From: Clemens M. <Cle...@in...> - 2003-06-24 11:53:27
|
> primariURI is normalized (e.g. hosts lowercased, port included > only if != 80, I suppose, the path part of the URL cleaned up, > etc.). Maybe add this, although it's pretty obvious. Sounds reasonable. Some remarks on that one: I would say the URL should still allow for opening the file. This should be the case with the actions taken that you mentioned. It was sometimes not the case with the old larm. larm-old contained the following normalizations: 1. http://host --> http://host/ (always include a path) 2. + ---> %20 (%20 instead of + for space) 3. %af %aF %Af --> %AF (all escape sequences uppercase) 4. all unsafe chars --> escaped version (see URL RFC) 5. http://Host/ --> http://host/ (host.tolower) 6. http://host/? --> http://host/ (remove empty query) 7. http://host:80/ --> http://host/ (remove default ports) 8. https://host:443/ --> https://host/ (remove default ports) 9. http://host/./ --> http://host/ (remove redundant ./) 10. http://host/path/../ --> http://host/ (resolve ../ - not implemented in larm-old) In addition, old-LARM did the following: 11. http://www.host.com/ --> http://host.com (remove www.) 12. http://www.host.com/index.* --> http://www.host.com/ 13. http://www.host.com/default.* --> http://www.host.com/ (remove redundant (?) index.* / default.* 14. http://host1/ --> http://host2/ (if host1 and 2 are in an alias table) 11-14 may produce false positives, that is two URLs merged to one that point to different files. Furthermore, the URL may lead to an error page or to a non-existing server (if www. is cut), although most of the times it will work out. and 14. is complicated to handle during the crawl itself, so I would say we should leave this out. I think now 11-14 should be handled using document fingerprints. However, since we have to avoid that a host with different names is crawled from two different threads, I would say at least that "www." is cut off the keys that are used for identifying the thread responsible for a host. > > 3. secondaryURIs: Collection: A list of secondary URIs of the > document. If > > the URI is ambiguous (e.g. if a document is represented by > more than one > > URL) this contains a list of all URIs (found) under which this > document is > > accessible. > > This can also be null. Obvious, too. null or an empty collection? I would say the latter to avoid making this distinction. >... > I'd call this a 'fingerprint'. That term is > implementation-agnostic. fine with me > > 6. lastChangedDate: Date: The time the last change has > occurred as > > Something seems to be missing here... the timestamp in the index. Oh, we need a timestamp in the index...! > > documentWeight > What is this going to be used for? Lucene field boosting? yes. That's very important. > CHECK_FOR_SERVER_RUNNING sounded bogus to me. What is that > method for? You're right I just had a feeling that it should be in. I take it out. > Regarding "deferedURL", I'd call the two URLs "initialURL" and > "finalURL" or some such. ok, you're more of a native speaker than me... > Isn't "lastChanged" date the same as "indexedDate"? > In the Fetcher paragraph there is a hint about Fetchers polling > the FetcherManager. I am not sure what you have in mind, but if > that is not such an important feature, I'd remove references to > it, in order to keep things simple and avoid over-engineering. If we follow a "push" model here it means the FetcherManager prepares lists of URLs for each thread. The thread, when it is ready, should poll for new messages, and go idle when there are no new messages. This avoids synchronization. Then the thread gets a list of CrawlRequests and downloads the files. Then we can still decide whether it pushes each file along the queue when it gets it or whether it collects some files and pushes them all together. That's constrained by the file sizes and the memory available. Clemens |
From: Jeff L. <je...@gr...> - 2003-06-19 05:19:59
|
Thanks for taking a look at it. I'm still not completely up to speed on the LARM architecture yet. I would definitely think that this module would be one of the last ones implemented, since it is not a core module. Clemens Marschner wrote: > Thanks for these thoughts, Jeff. Let's see how we can get them together with > the current design. Maybe we find out that the design must be changed. So I > just go through what I have written and ponder whether it fits together. > > Ok, let's see. I take the documents in the larm-cvs / doc as a basis. > > There you see under Part III in contents.txt the record processors. A > "record" is something to be indexed, in this case, a document that contains > DC metadata that is to be extracted. At the start of the pipeline, the > content can be anything from binary PDF to HTML. The record processors are > responsible for changing this into a format that the indexer understands - a > Lucene document that is already divided into fields which are marked stored, > tokenized, indexed. So I suppose the Dublin Core Metadata Indexing (DCMI) > can be developed as a record processor. > > Record processors may form a pipeline like > > +------------ PDFToText --> TextToField ---------+ > ! ! > [contentType=pdf] ! > ! ! > ------->+--[=html]--- HTMLLinkExtractor -> HTMLToField ->+--> ProcessorX --> > ... > ! ! > +--[=xml]--------------------------------------->+ > > The output of a processor may be a converted version of the original > document and a set of fields to be indexed. > I suppose you have to do different things if you have an RDF file in > contrast to HTML or other formats, which means different extractors. > > RecordProcessors are passed instances of the following classes: > > interface IndexRecord > { > // enum Command > final static byte CMD_ADD = (byte)'a'; > final static byte CMD_UPDATE = (byte)'u'; // maybe unnecessary > final static byte CMD_DELETE = (byte)'d'; > > byte command; // type: Command > URI primaryURI; // identifier > ArrayList secondaryURIs; // an ArrayList<URI> > MD5Hash MD5Hash; > Date indexedDate; > Date lastChangedDate; > float documentWeight; > String MIMEtype; > ArrayList fields; // an ArrayList<FieldInfo> > } > > Maybe we should add the original document here as well: > Object record; I think this is a good idea, in so far as one of the record processors may be interested in working with the original (un-parsed) content, and not everything may end up in the fields. > > interface FieldInfo > { > // enum MethodTypes (?) > final static byte MT_INDEX = 0x01; > final static byte MT_STORE = 0x02; > final static byte MT_TOKENIZE = 0x04; > > // enum FieldType > final static byte FT_TEXT = (byte)'t'; > final static byte FT_DATE = (byte)'d'; > > byte methods; // type: MethodTypes > byte type; // type: FieldType > String fieldName; > float weight; > char[] contents; > } > > Now it is crucial that we define how input and output of the different > record processors look like, since this is not modelled on the Java level > but forms an important unnegligible dependency > > >>1. The Metadata Retriever >> >>This retriever can read the Dublin Core metadata from a content element. >>It will support HTML, XML using the Dublin Core schema, and RDF files >>using the Dublin Core schema. It will not be responsible for getting >>the content element or RDF file from its location, but it will extract >>the relevant metadata from the pages. The retriever will be pluggable to >>support additional content formats. > > > These are DC Extractors put into the branches of the pipeline that cope with > the different file formats. They produce fields that are saved within the > IndexRecord.fields array. This sounds like what I was thinking. The extractors would be pluggable in the chain. Only this combines both of the retriever and the builder. > > >>2. The Metadata Engine >> >>The retriever will feed the data to the engine, which is responsible for >>any validation rules may be configured for the metadata to prevent >>spamming the search engine or inappropriate results. In addition, some >>metadata elements may not be allowed, and they can be removed here. >>Other metadata elements may only be relevant with a certain subset of >>URL's, and that filter may be applied here as well. > > > This would be a processor applied after the format conversion is done, which > may alter IndexRecords or delete them from the pipe. > I think so. > >>3. The Metadata Builder >> >>The builder retrieves the metadata from the engine and adds it to the >>Lucene document as a set of fields. The fields on the document will be >>mapped to metadata elements using a configuration, or defaults will be >>used. > > > This would be the generic Lucene indexer, no need to develop that. Or am I > wrong? > > This would sit in between the indexer and the metadata, which wouldn't necessarily be added as fields, maybe only some are, or the field names are different. This could also be done as an extractor. > >>Title >>Creator >>Subject >>Description >>Publisher >>Contributor >>Date >>Type >>Format >>Identifier >>Source >>Language >>Relation >>Coverage >>Rights > > > I could imagine extracting some of these fields from the text itself, using > linguistic analysis. A primitive example would be "Subject" which could be > extracted from HTML title or H1 tags. Language is also a feature that can be > detected by comparing the words in a text with lexicons of different > languages. I think this will be necessary at some time since metatags are > only used in very restricted areas (news, medical information, etc.) > I see this as more useful in an intranet environment, where the metadata may already be in a CMS, rather than for a google-like search engine that searches arbitrary data. Adding the linguistic/semantic analysis would be an interesting project to work on. > Clemens > |
From: otisg <ot...@ur...> - 2003-06-17 05:23:31
|
Clemens, Brief and minor comments. primariURI is normalized (e.g. hosts lowercased, port included only if != 80, I suppose, the path part of the URL cleaned up, etc.). Maybe add this, although it's pretty obvious. > 3. secondaryURIs: Collection: A list of secondary URIs of the document. If > the URI is ambiguous (e.g. if a document is represented by more than one > URL) this contains a list of all URIs (found) under which this document is > accessible. This can also be null. Obvious, too. > 4. MD5Hash: MD5Hash: The MD5 hash of the doc. In case of a recrawl this hash > will be sent to the gatherer to determine whether the document contents have > changed. I'd call this a 'fingerprint'. That term is implementation-agnostic. > 6. lastChangedDate: Date: The time the last change has occurred as Something seems to be missing here... > 7. documentWeight: float. It is left to the processing pipeline to What is this going to be used for? Lucene field boosting? Comments about the Crawler document follow. CHECK_FOR_SERVER_RUNNING sounded bogus to me. What is that method for? Regarding "deferedURL", I'd call the two URLs "initialURL" and "finalURL" or some such. Isn't "lastChanged" date the same as "indexedDate"? In the Fetcher paragraph there is a hint about Fetchers polling the FetcherManager. I am not sure what you have in mind, but if that is not such an important feature, I'd remove references to it, in order to keep things simple and avoid over-engineering. That's all for now. Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |
From: otisg <ot...@ur...> - 2003-06-16 18:48:06
|
> > If the caching stuff is in such core JDK classes, can we really > > avoid them? > > How do we connect to a host without Socket and InetAddress? > > (this may sound basic, I just haven't done enough networking > > stuff in Java yet, I guess) > > You can avoid the cache by using IP addresses. To avoid the mechanism > completely, we'll have to use a different DNS resolver than the one Sun > uses, and call Socket only with the resolved addresses. Aha, ok. I thought the stuff would be looked up and cached internally even if the IPs were used. Got it. > > I know. I won't be able to mail them soon. Just got back from > > Vermont last night, have to work 2 more days, then go to Las > > Vegas, then pack and change continents. If I find 30 minutes of > > piece, I'll comment. I don't have any major comments, so carry > > on without worries that you are missing something super > > important. > > Don't you have any REAL reasons?? ;-) > > Suppose we'll work in the same timezone then. Didn't you say you would be in > Munich for Oktoberfest? That would be very nice. Yes. I'll email before going to Munchen. Will you be there? And now that I have a moment to send my comments I don't have my notes :( Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |
From: Clemens M. <Cle...@in...> - 2003-06-16 16:39:55
|
> If the caching stuff is in such core JDK classes, can we really > avoid them? > How do we connect to a host without Socket and InetAddress? > (this may sound basic, I just haven't done enough networking > stuff in Java yet, I guess) You can avoid the cache by using IP addresses. To avoid the mechanism completely, we'll have to use a different DNS resolver than the one Sun uses, and call Socket only with the resolved addresses. > I know. I won't be able to mail them soon. Just got back from > Vermont last night, have to work 2 more days, then go to Las > Vegas, then pack and change continents. If I find 30 minutes of > piece, I'll comment. I don't have any major comments, so carry > on without worries that you are missing something super > important. Don't you have any REAL reasons?? ;-) Suppose we'll work in the same timezone then. Didn't you say you would be in Munich for Oktoberfest? That would be very nice. Clemens |
From: otisg <ot...@ur...> - 2003-06-16 13:43:44
|
> It's in the sources of java.net.Socket and java.net.InetAddress of 1.4.1_02 If the caching stuff is in such core JDK classes, can we really avoid them? How do we connect to a host without Socket and InetAddress? (this may sound basic, I just haven't done enough networking stuff in Java yet, I guess) > > (please don't take this comment as a bad criticism, I'm trying > > to be constructive here :)) > > No, go ahead. That's why it's important to go public. We all need sparring > partners, that's why we're here. I'm still waiting for your comments on the > docs. If it's easier for you write them into the files themselves (I checked > the changes in now). I know. I won't be able to mail them soon. Just got back from Vermont last night, have to work 2 more days, then go to Las Vegas, then pack and change continents. If I find 30 minutes of piece, I'll comment. I don't have any major comments, so carry on without worries that you are missing something super important. Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |
From: Clemens M. <Cle...@in...> - 2003-06-16 10:00:07
|
> Do you think it would be easier and maybe faster if we reported > your observations to httpclient-dev and suggested the > alternative approach that you described below? This could be a viable approach. I would also like a method like getResponseData(int maxBytes, int timeout) since getResponseData() may a) produce an indefinitely large byte array and b) may not return at all. > Also, in which sources are the caches that you mentioned? JDK > 1.4.* or HTTPClient? It's in the sources of java.net.Socket and java.net.InetAddress of 1.4.1_02 > If in HTTPClient, is that in the CVS version or some released > version? I used the beta 1 of HTTPClient 2.0 which was released on May 25. > If this cache stuff is in JDK 1.4.*, then maybe we should see > what 1.5 brings when it comes out. I heard that it should be > out this Fall. That may be worth waiting. I would like to use the JDK 1.5 features as soon as possible (enums, generics). > Also, with all the LARM things, I think we should try not to get > stuck with 'details' (this is not a detail in the long run, but > I think you want to try to put more pieces together before > thinking about how to improve individual components, tune them, > etc.). I know, I change from the broad picture to the details and back. Don't worry. > (please don't take this comment as a bad criticism, I'm trying > to be constructive here :)) No, go ahead. That's why it's important to go public. We all need sparring partners, that's why we're here. I'm still waiting for your comments on the docs. If it's easier for you write them into the files themselves (I checked the changes in now). Clemens |
From: otisg <ot...@ur...> - 2003-06-15 23:47:42
|
Clemens, Thanks for looking through sources and summarizing things for us. Do you think it would be easier and maybe faster if we reported your observations to httpclient-dev and suggested the alternative approach that you described below? Also, in which sources are the caches that you mentioned? JDK 1.4.* or HTTPClient? If in HTTPClient, is that in the CVS version or some released version? If it's in a released version, then maybe we should check the CVS version. I've been on httpclient-dev for a long time, and although I don't actively monitor the list, I seem to recall seeing mentions, or maybe bugs in Bugzilla, related to HTTP requests that use IP addresses instead of host names. If this cache stuff is in JDK 1.4.*, then maybe we should see what 1.5 brings when it comes out. I heard that it should be out this Fall. That may be worth waiting. Also, with all the LARM things, I think we should try not to get stuck with 'details' (this is not a detail in the long run, but I think you want to try to put more pieces together before thinking about how to improve individual components, tune them, etc.). (please don't take this comment as a bad criticism, I'm trying to be constructive here :)) Otis ---- On Sun, 15 Jun 2003, Clemens Marschner (Cle...@in...) wrote: > I looked at the source code of Jakarta's HTTPClient for use in the crawler. > Seems ok except that it creates a lot of objects on the way until a page is > loaded. > > The main thing I don't like is that it opens a java.net.Socket using the > host name. > > In the socket class this host name is resolved into an IP adress using > InetAddress.getHostByName. This method uses a completely awkward caching > mechanism that seems to become a bottleneck to me if 100s or 1000s of hosts > are in the cache. > > getByName calls getAllByName0() which performs a getCachedAddress(host) > lookup. This method first performs a linear (!) scan through the whole > cache, builds a Vector of entries that are expired, and then does a second > linear scan through that vector to remove these expired entries. And all > this is done in a synchronized section. Since this is done as a side effect > at each cache lookup it will be done for each connection opened by the > crawler. > > In short, this won't work. We have to lookup the IP address for ourselves, > using a mechanism that can cope with hundreds of host names without blocking > other threads. Then the host name and the IP address have to be provided to > the HTTP class. The socket has to be opened via the IP address and the HTTP > header has to contain the host name. > > This will end up in a rewrite of the HTTPClient.... > > If we want to use NIO for this, it will again be a different situation. I > suppose we have to write the HTTPClient from scratch some day. > > Clemens > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: eBay > Great deals on office technology -- on eBay now! Click here: > http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5 > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > > ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |
From: Clemens M. <Cle...@in...> - 2003-06-15 23:19:47
|
I looked at the source code of Jakarta's HTTPClient for use in the crawler. Seems ok except that it creates a lot of objects on the way until a page is loaded. The main thing I don't like is that it opens a java.net.Socket using the host name. In the socket class this host name is resolved into an IP adress using InetAddress.getHostByName. This method uses a completely awkward caching mechanism that seems to become a bottleneck to me if 100s or 1000s of hosts are in the cache. getByName calls getAllByName0() which performs a getCachedAddress(host) lookup. This method first performs a linear (!) scan through the whole cache, builds a Vector of entries that are expired, and then does a second linear scan through that vector to remove these expired entries. And all this is done in a synchronized section. Since this is done as a side effect at each cache lookup it will be done for each connection opened by the crawler. In short, this won't work. We have to lookup the IP address for ourselves, using a mechanism that can cope with hundreds of host names without blocking other threads. Then the host name and the IP address have to be provided to the HTTP class. The socket has to be opened via the IP address and the HTTP header has to contain the host name. This will end up in a rewrite of the HTTPClient.... If we want to use NIO for this, it will again be a different situation. I suppose we have to write the HTTPClient from scratch some day. Clemens |
From: Clemens M. <Cle...@in...> - 2003-06-15 10:04:48
|
Thanks for these thoughts, Jeff. Let's see how we can get them together with the current design. Maybe we find out that the design must be changed. So I just go through what I have written and ponder whether it fits together. Ok, let's see. I take the documents in the larm-cvs / doc as a basis. There you see under Part III in contents.txt the record processors. A "record" is something to be indexed, in this case, a document that contains DC metadata that is to be extracted. At the start of the pipeline, the content can be anything from binary PDF to HTML. The record processors are responsible for changing this into a format that the indexer understands - a Lucene document that is already divided into fields which are marked stored, tokenized, indexed. So I suppose the Dublin Core Metadata Indexing (DCMI) can be developed as a record processor. Record processors may form a pipeline like +------------ PDFToText --> TextToField ---------+ ! ! [contentType=pdf] ! ! ! ------->+--[=html]--- HTMLLinkExtractor -> HTMLToField ->+--> ProcessorX --> ... ! ! +--[=xml]--------------------------------------->+ The output of a processor may be a converted version of the original document and a set of fields to be indexed. I suppose you have to do different things if you have an RDF file in contrast to HTML or other formats, which means different extractors. RecordProcessors are passed instances of the following classes: interface IndexRecord { // enum Command final static byte CMD_ADD = (byte)'a'; final static byte CMD_UPDATE = (byte)'u'; // maybe unnecessary final static byte CMD_DELETE = (byte)'d'; byte command; // type: Command URI primaryURI; // identifier ArrayList secondaryURIs; // an ArrayList<URI> MD5Hash MD5Hash; Date indexedDate; Date lastChangedDate; float documentWeight; String MIMEtype; ArrayList fields; // an ArrayList<FieldInfo> } Maybe we should add the original document here as well: Object record; interface FieldInfo { // enum MethodTypes (?) final static byte MT_INDEX = 0x01; final static byte MT_STORE = 0x02; final static byte MT_TOKENIZE = 0x04; // enum FieldType final static byte FT_TEXT = (byte)'t'; final static byte FT_DATE = (byte)'d'; byte methods; // type: MethodTypes byte type; // type: FieldType String fieldName; float weight; char[] contents; } Now it is crucial that we define how input and output of the different record processors look like, since this is not modelled on the Java level but forms an important unnegligible dependency > 1. The Metadata Retriever > > This retriever can read the Dublin Core metadata from a content element. > It will support HTML, XML using the Dublin Core schema, and RDF files > using the Dublin Core schema. It will not be responsible for getting > the content element or RDF file from its location, but it will extract > the relevant metadata from the pages. The retriever will be pluggable to > support additional content formats. These are DC Extractors put into the branches of the pipeline that cope with the different file formats. They produce fields that are saved within the IndexRecord.fields array. > 2. The Metadata Engine > > The retriever will feed the data to the engine, which is responsible for > any validation rules may be configured for the metadata to prevent > spamming the search engine or inappropriate results. In addition, some > metadata elements may not be allowed, and they can be removed here. > Other metadata elements may only be relevant with a certain subset of > URL's, and that filter may be applied here as well. This would be a processor applied after the format conversion is done, which may alter IndexRecords or delete them from the pipe. > 3. The Metadata Builder > > The builder retrieves the metadata from the engine and adds it to the > Lucene document as a set of fields. The fields on the document will be > mapped to metadata elements using a configuration, or defaults will be > used. This would be the generic Lucene indexer, no need to develop that. Or am I wrong? > Title > Creator > Subject > Description > Publisher > Contributor > Date > Type > Format > Identifier > Source > Language > Relation > Coverage > Rights I could imagine extracting some of these fields from the text itself, using linguistic analysis. A primitive example would be "Subject" which could be extracted from HTML title or H1 tags. Language is also a feature that can be detected by comparing the words in a text with lexicons of different languages. I think this will be necessary at some time since metatags are only used in very restricted areas (news, medical information, etc.) Clemens |
From: Jeff L. <je...@gr...> - 2003-06-14 16:14:22
|
Hi, This would go under record processing. Let me know what you think about the design. Jeff Dublin Core Metadata Indexing The record processor should be able to optionally handle Dublin Core metadata elements inside the content that is being indexed, or as part of an RDF record that is external to the content. Because these metadata elements are standard, we can use that to add fields to the Lucene Document for each of the metadata elements. This support is entirely optional, and can be configured. 1. The Metadata Retriever This retriever can read the Dublin Core metadata from a content element. It will support HTML, XML using the Dublin Core schema, and RDF files using the Dublin Core schema. It will not be responsible for getting the content element or RDF file from its location, but it will extract the relevant metadata from the pages. The retriever will be pluggable to support additional content formats. 2. The Metadata Engine The retriever will feed the data to the engine, which is responsible for any validation rules may be configured for the metadata to prevent spamming the search engine or inappropriate results. In addition, some metadata elements may not be allowed, and they can be removed here. Other metadata elements may only be relevant with a certain subset of URL's, and that filter may be applied here as well. 3. The Metadata Builder The builder retrieves the metadata from the engine and adds it to the Lucene document as a set of fields. The fields on the document will be mapped to metadata elements using a configuration, or defaults will be used. 4. Dublin Core metadata elements (from http://www.dublincore.org/documents/dces/) Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights 5. References http://www.dublincore.org/ http://www.dublincore.org/documents/dces/ |
From: Clemens M. <Cle...@in...> - 2003-06-12 00:21:04
|
larm-cvs is set up now. Happy committing. --> http://sourceforge.net/mail/?group_id=80648 Clemens |