URLNormalizer // Re: [larm-dev] Re: [larm-developer] Specs: Indexer

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> primariURI is normalized (e.g. hosts lowercased, port included
> only if != 80, I suppose, the path part of the URL cleaned up,
> etc.).  Maybe add this, although it's pretty obvious.

Sounds reasonable. Some remarks on that one:
I would say the URL should still allow for opening the file. This should be
the case with the actions taken that you mentioned.
It was sometimes not the case with the old larm. larm-old contained the
following normalizations:

1. http://host   --> http://host/         (always include a path)
2. + ---> %20                             (%20 instead of + for space)
3. %af %aF %Af  --> %AF                   (all escape sequences uppercase)
4. all unsafe chars --> escaped version   (see URL RFC)
5. http://Host/     --> http://host/      (host.tolower)
6. http://host/?    --> http://host/      (remove empty query)
7. http://host:80/  --> http://host/      (remove default ports)
8. https://host:443/ --> https://host/    (remove default ports)
9. http://host/./    --> http://host/     (remove redundant ./)
10. http://host/path/../ --> http://host/ (resolve ../ - not implemented in
larm-old)

In addition, old-LARM did the following:

11. http://www.host.com/           --> http://host.com      (remove www.)
12. http://www.host.com/index.*    --> http://www.host.com/
13. http://www.host.com/default.*  --> http://www.host.com/ (remove
redundant (?) index.* / default.*
14. http://host1/                 --> http://host2/        (if host1 and 2
are in an alias table)

11-14 may produce false positives, that is two URLs merged to one that point
to different files. Furthermore, the URL may lead to an error page or to a
non-existing server (if www. is cut), although most of the times it will
work out. and 14. is complicated to handle during the crawl itself, so I
would say we should leave this out.

I think now 11-14 should be handled using document fingerprints. However,
since we have to avoid that a host with different names is crawled from two
different threads, I would say at least that "www." is cut off the keys that
are used for identifying the thread responsible for a host.

> > 3. secondaryURIs: Collection: A list of secondary URIs of the
> document. If
> > the URI is ambiguous (e.g. if a document is represented by
> more than one
> > URL) this contains a list of all URIs (found) under which this
> document is
> > accessible.
>
> This can also be null.  Obvious, too.

null or an empty collection? I would say the latter to avoid making this
distinction.

>...
> I'd call this a 'fingerprint'.  That term is
> implementation-agnostic.

fine with me

> > 6. lastChangedDate: Date: The time the last change has
> occurred as
>
> Something seems to be missing here...

the timestamp in the index. Oh, we need a timestamp in the index...!

> > documentWeight
> What is this going to be used for?  Lucene field boosting?

yes. That's very important.

> CHECK_FOR_SERVER_RUNNING sounded bogus to me.  What is that
> method for?

You're right I just had a feeling that it should be in. I take it out.

> Regarding "deferedURL", I'd call the two URLs "initialURL" and
> "finalURL" or some such.

ok, you're more of a native speaker than me...

> Isn't "lastChanged" date the same as "indexedDate"?

> In the Fetcher paragraph there is a hint about Fetchers polling
> the FetcherManager.  I am not sure what you have in mind, but if
> that is not such an important feature, I'd remove references to
> it, in order to keep things simple and avoid over-engineering.

If we follow a "push" model here it means the FetcherManager prepares lists
of URLs for each thread. The thread, when it is ready, should poll for new
messages, and go idle when there are no new messages. This avoids
synchronization. Then the thread gets a list of CrawlRequests and downloads
the files.

Then we can still decide whether it pushes each file along the queue when it
gets it or whether it collects some files and pushes them all together.
That's constrained by the file sizes and the memory available.

Clemens