Re: URLNormalizer // Re: [larm-dev] Re: [larm-developer] Specs: Indexer
Brought to you by:
cmarschner,
otis
From: otisg <ot...@ur...> - 2003-06-25 01:50:31
|
> > primariURI is normalized (e.g. hosts lowercased, port included > > only if != 80, I suppose, the path part of the URL cleaned up, > > etc.). Maybe add this, although it's pretty obvious. > > Sounds reasonable. Some remarks on that one: > I would say the URL should still allow for opening the file. This should be > the case with the actions taken that you mentioned. > It was sometimes not the case with the old larm. larm-old contained the > following normalizations: > > 1. http://host --> http://host/ (always include a path) > 2. + ---> %20 (%20 instead of + for space) > 3. %af %aF %Af --> %AF (all escape sequences uppercase) > 4. all unsafe chars --> escaped version (see URL RFC) > 5. http://Host/ --> http://host/ (host.tolower) > 6. http://host/? --> http://host/ (remove empty query) > 7. http://host:80/ --> http://host/ (remove default ports) > 8. https://host:443/ --> https://host/ (remove default ports) > 9. http://host/./ --> http://host/ (remove redundant ./) > 10. http://host/path/../ --> http://host/ (resolve ../ - not implemented in > larm-old) > > In addition, old-LARM did the following: > > 11. http://www.host.com/ --> http://host.com (remove www.) > 12. http://www.host.com/index.* --> http://www.host.com/ > 13. http://www.host.com/default.* --> http://www.host.com/ (remove > redundant (?) index.* / default.* > 14. http://host1/ --> http://host2/ (if host1 and 2 > are in an alias table) > > 11-14 may produce false positives, that is two URLs merged to one that point > to different files. Furthermore, the URL may lead to an error page or to a > non-existing server (if www. is cut), although most of the times it will > work out. and 14. is complicated to handle during the crawl itself, so I > would say we should leave this out. > > I think now 11-14 should be handled using document fingerprints. However, > since we have to avoid that a host with different names is crawled from two > different threads, I would say at least that "www." is cut off the keys that > are used for identifying the thread responsible for a host. I agree. I'm pro 1-10 and against 11-14 because they are based on assumptions that will not always yield correct results. As for using the domain name as the key when assigning hosts to fetcher threads, I agree. Note that I said 'using the domain name'. I think that may be the easiest thing to do and sufficient for this purpose. Examples: foo.bar.domain.com -> domain.dom a.b.otherdomain.com -> otherdomain.com www.blah.thirddomain.com -> thirdcomain.com www.fourthdomain.com -> fourthdomain.com > > > 3. secondaryURIs: Collection: A list of secondary URIs of the > > document. If > > > the URI is ambiguous (e.g. if a document is represented by > > more than one > > > URL) this contains a list of all URIs (found) under which this > > document is > > > accessible. > > > > This can also be null. Obvious, too. > > null or an empty collection? I would say the latter to avoid making this > distinction. empty is fine. > > > 6. lastChangedDate: Date: The time the last change has > > occurred as > > > > Something seems to be missing here... > > the timestamp in the index. Oh, we need a timestamp in the index...! Yes we do. I thought you had already mentioned that somewhere. Doesn't matter, we need the time of the last fetch date and of the last change date (we can look for changed fingerprints to determine if the page changed. However, this will result in false positives when pages include things like counters or current date/time. To get around that we could try using Nilsimsa (http://www.google.com/search?q=nilsimsa)) We need both of these dates, I think, in order to adjust fetching frequency dynamically, based on the frequency at which each page changes. Those pages that change less frequently can also be crawled less frequently. > > Isn't "lastChanged" date the same as "indexedDate"? It is the same thing, isn't it? If the page is not changed, we don't re-index it, so indexedDate will always be lastChangedDate, no? > > In the Fetcher paragraph there is a hint about Fetchers polling > > the FetcherManager. I am not sure what you have in mind, but if > > that is not such an important feature, I'd remove references to > > it, in order to keep things simple and avoid over-engineering. > > If we follow a "push" model here it means the FetcherManager prepares lists > of URLs for each thread. The thread, when it is ready, should poll for new > messages, and go idle when there are no new messages. This avoids > synchronization. Then the thread gets a list of CrawlRequests and downloads > the files. OK. > Then we can still decide whether it pushes each file along the queue when it > gets it or whether it collects some files and pushes them all together. > That's constrained by the file sizes and the memory available. I don't know the answer yet. Too early for me to say it. I am leaning towards dealing with pages in batches. I have a feeling it may be more efficient. I also remember reading about some crawler implementation (Uni. of Indiana, I believe) that mentioned that they implemented this type of stuff in batches for efficiency purposes. Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |