If I have two different sites on the same physical
machine, j-spider consider them the same site. In my
particular case, I tried to spider the site http://www.
ddc-deutsch.de/ which refers to http://www.ddb.de/
searching for broken links. Both those sites run on the
same apache server. I used the default checkError
configuration which should prevent http://www.ddb.de to
be spidered, since it is an external site which should
never be parsed. However, it was.
The problem is that j-spider compares sites using java.
net.URL.equals() which is incompatible with virtual
hosting since two URLs are considered equal iff they
resolve to the same IP address (which is a feature known
to Sun, cf. http://java.sun.com/j2se/1.4.
This, however, makes URLs unsuitable for use as keys in
maps, which is the case e.g. in net.javacoding.jspider.
storage.memory.SiteDAOImpl. The solution here is to use
something else as key, e.g. the String representation of
the URL as returned by java.net.URL.toString().
The other place where this matters is when registering
new sites in the SpiderContext. In the current
implementation of net.javacoding.jspider.core.impl.
SpiderContextImpl.registerNewSite(), the new site is set
to base site if the new site's URL is equal to the current
baseURL using java.net.URL.equals(), which is also
incompatible with virtual hosting. Here, the solution would
be not to compare the URLs for equality but their string
I attach two patches, one for SiteDAOImpl and one for
SpiderContextImpl which solved the above problems for
me. I didn't have the time to check, if there are other
places where URLs are compared
Log in to post a comment.