#84 URL rewriting not quite adequate.

htdig (31)

I'm having a problem with some URL rewriting when it
comes to the domain-name part. Some servers have
aliases which are "partially separate" virtual hosts,
where some URL-paths are the same regardless of domain
actually used, while others are not. For example, all
files under the URL location /foo/ are the same, but
under /bar/, they are not. Of course, I don't want to
multiply index these identical resources.

Server_aliases cannot be used, as these virtual hosts
do return different information for some URLs. Also,
in some [other] cases (more than 5 aliases), a regex
expression would be nice, but not supported. Note
also that it doesn't do subdomains (i.e. a convenience
to strip off a leading but optional "www."), so this
actually doubles the ruleset size needed. I mention
this because there are also some hostnames in the
spider-set which are identical virtual hosts which are
sometimes "wildcard-CNAMEd" at the DNS level.

URL_rewrite_rules does support regex, but isn't
working in my case as it appears to be applied to pre-
normalized URLs, not post-normalized. Therefore,
relative URLs escape their action, and as my usage is
indexing multiple domains, I cannot limit the rules to
just the domain it pertains to - I need the full,
normalized URL to rewrite. If it were to apply to
post-normalized URLs, I could limit it. I can't split
that one site off into its own htdig configuration
file as my intent is to crawl a set of sites that have
interreferencing URLs.

I'm about to hack my source for htdig to give me
a "url_rewrite_normalized" equivalence, but would
prefer some sort of official change or consideration
for adding this feature. It may be useful to someone

I'm using HTDig 3.1.6 with the https/SSL patch applied.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks