Re: URLNormalizer // Re: [larm-dev] Re: [larm-developer] Specs: Indexer

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> > primariURI is normalized (e.g. hosts lowercased, port
included
> > only if != 80, I suppose, the path part of the URL cleaned
up,
> > etc.).  Maybe add this, although it's pretty obvious.
> 
> Sounds reasonable. Some remarks on that one:
> I would say the URL should still allow for opening the file.
This should be
> the case with the actions taken that you mentioned.
> It was sometimes not the case with the old larm. larm-old
contained the
> following normalizations:
> 
> 1. http://host   --> http://host/         (always include a
path)
> 2. + ---> %20                             (%20 instead of +
for space)
> 3. %af %aF %Af  --> %AF                   (all escape
sequences uppercase)
> 4. all unsafe chars --> escaped version   (see URL RFC)
> 5. http://Host/     --> http://host/      (host.tolower)
> 6. http://host/?    --> http://host/      (remove empty
query)
> 7. http://host:80/  --> http://host/      (remove default
ports)
> 8. https://host:443/ --> https://host/    (remove default
ports)
> 9. http://host/./    --> http://host/     (remove redundant
./)
> 10. http://host/path/../ --> http://host/ (resolve ../ - not
implemented in
> larm-old)
> 
> In addition, old-LARM did the following:
> 
> 11. http://www.host.com/           --> http://host.com     
(remove www.)
> 12. http://www.host.com/index.*    --> http://www.host.com/
> 13. http://www.host.com/default.*  --> http://www.host.com/
(remove
> redundant (?) index.* / default.*
> 14. http://host1/                 --> http://host2/        (if
host1 and 2
> are in an alias table)
> 
> 11-14 may produce false positives, that is two URLs merged to
one that point
> to different files. Furthermore, the URL may lead to an error
page or to a
> non-existing server (if www. is cut), although most of the
times it will
> work out. and 14. is complicated to handle during the crawl
itself, so I
> would say we should leave this out.
> 
> I think now 11-14 should be handled using document
fingerprints. However,
> since we have to avoid that a host with different names is
crawled from two
> different threads, I would say at least that "www." is cut off
the keys that
> are used for identifying the thread responsible for a host.

I agree.  I'm pro 1-10 and against 11-14 because they are based
on assumptions that will not always yield correct results.
As for using the domain name as the key when assigning hosts to
fetcher threads, I agree.
Note that I said 'using the domain name'.  I think that may be
the easiest thing to do and sufficient for this purpose.
Examples:
foo.bar.domain.com -> domain.dom
a.b.otherdomain.com -> otherdomain.com
www.blah.thirddomain.com -> thirdcomain.com
www.fourthdomain.com -> fourthdomain.com

> > > 3. secondaryURIs: Collection: A list of secondary URIs of
the
> > document. If
> > > the URI is ambiguous (e.g. if a document is represented
by
> > more than one
> > > URL) this contains a list of all URIs (found) under which
this
> > document is
> > > accessible.
> >
> > This can also be null.  Obvious, too.
> 
> null or an empty collection? I would say the latter to avoid
making this
> distinction.

empty is fine.

> > > 6. lastChangedDate: Date: The time the last change has
> > occurred as
> >
> > Something seems to be missing here...
> 
> the timestamp in the index. Oh, we need a timestamp in the
index...!

Yes we do.  I thought you had already mentioned that somewhere. 
Doesn't matter, we need the time of the last fetch date and of
the last change date (we can look for changed fingerprints to
determine if the page changed.  However, this will result in
false positives when pages include things like counters or
current date/time.  To get around that we could try using 
Nilsimsa (http://www.google.com/search?q=nilsimsa))

We need both of these dates, I think, in order to adjust
fetching frequency dynamically, based on the frequency at which
each page changes.  Those pages that change less frequently can
also be crawled less frequently.

> > Isn't "lastChanged" date the same as "indexedDate"?

It is the same thing, isn't it?
If the page is not changed, we don't re-index it, so indexedDate
will always be lastChangedDate, no?

> > In the Fetcher paragraph there is a hint about Fetchers
polling
> > the FetcherManager.  I am not sure what you have in mind,
but if
> > that is not such an important feature, I'd remove references
to
> > it, in order to keep things simple and avoid
over-engineering.
> 
> If we follow a "push" model here it means the FetcherManager
prepares lists
> of URLs for each thread. The thread, when it is ready, should
poll for new
> messages, and go idle when there are no new messages. This
avoids
> synchronization. Then the thread gets a list of CrawlRequests
and downloads
> the files.

OK.

> Then we can still decide whether it pushes each file along the
queue when it
> gets it or whether it collects some files and pushes them all
together.
> That's constrained by the file sizes and the memory
available.

I don't know the answer yet.  Too early for me to say it.  I am
leaning towards dealing with pages in batches.  I have a feeling
it may be more efficient.  I also remember reading about some
crawler implementation (Uni. of Indiana, I believe) that
mentioned that they implemented this type of stuff in batches
for efficiency purposes.

Otis

________________________________________________
Get your own "800" number
Voicemail, fax, email, and a lot more
http://www.ureach.com/reg/tag