#11 Two sites on same physical machine considered equal

open
nobody
None
5
2004-08-18
2004-08-18
No

If I have two different sites on the same physical
machine, j-spider consider them the same site. In my
particular case, I tried to spider the site http://www.
ddc-deutsch.de/ which refers to http://www.ddb.de/
searching for broken links. Both those sites run on the
same apache server. I used the default checkError
configuration which should prevent http://www.ddb.de to
be spidered, since it is an external site which should
never be parsed. However, it was.
The problem is that j-spider compares sites using java.
net.URL.equals() which is incompatible with virtual
hosting since two URLs are considered equal iff they
resolve to the same IP address (which is a feature known
to Sun, cf. http://java.sun.com/j2se/1.4.
2/docs/api/java/net/URL.html#equals(java.lang.Object) ).
This, however, makes URLs unsuitable for use as keys in
maps, which is the case e.g. in net.javacoding.jspider.
storage.memory.SiteDAOImpl. The solution here is to use
something else as key, e.g. the String representation of
the URL as returned by java.net.URL.toString().
The other place where this matters is when registering
new sites in the SpiderContext. In the current
implementation of net.javacoding.jspider.core.impl.
SpiderContextImpl.registerNewSite(), the new site is set
to base site if the new site's URL is equal to the current
baseURL using java.net.URL.equals(), which is also
incompatible with virtual hosting. Here, the solution would
be not to compare the URLs for equality but their string
representations.
I attach two patches, one for SiteDAOImpl and one for
SpiderContextImpl which solved the above problems for
me. I didn't have the time to check, if there are other
places where URLs are compared

Discussion

  • Lars G. Svensson

    Logged In: YES
    user_id=1030003

    When testing, I noticed that in net.javacoding.jspider.core.
    storage.memory.ResourceDAOImpl java.net.URLs are used as
    keys in maps, too, so here's another patch

     
  • merlin lain

    merlin lain - 2004-10-12

    Logged In: YES
    user_id=659928

    I all some met this bug with jspider-src-0.5.0-dev.
    While i start download the "http://quova.com"
    There is two download folder "quova.com" and "www.quova.com",
    most content of them are same.

     

Log in to post a comment.