Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 canonicalization of URIs for alreadyIncluded testing - ID: 900004
Last Update: Comment added ( karl-ia )

It should be possible to apply a sort of
"canonicalization" to URIs before they are added to the
alreadyIncluded structure, for example case-flattening
or session-id-removing, so that essentially-alike URIs
will be treated as having already been handled.

Both Alexa and Mercator do this in some form or another.

Custom canonicalization for individual sites should be
possible, and perhaps the method of specifying a
canonicalization rule is a regexp and a replace-string.


Gordon Mohr ( gojomo ) - 2004-02-19 00:38

9

Closed

None

Michael Stack

None

None

Public


Comments ( 6 )

Date: 2007-03-14 01:26
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-719 -- please add further
comments at that location.


Date: 2004-10-08 02:12
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Finished. See below for commit message used.


New feature: [ 900004 ] canonicalization of URIs for
alreadyIncluded testing
* src/articles/developer_manual.xml
Added an id so can refer to this section.
* src/articles/user_manual.xml
Added documentation of new 'URL Canonicalization Rule'
screens.
* src/conf/heritrix.properties
Added commented out enabling of canonicalizer logging.
* src/conf/profiles/Simple/order.xml
* src/conf/selftest/order.xml
Added base set of canonicalizing rules.
* src/java/org/archive/crawler/admin/CrawlJobHandler.java
Added constant for name of the canonicalization rules file.
* src/java/org/archive/crawler/admin/ui/JobConfigureUtils.java
Moved code from jsp to here. Refactoring allowed me
reuse same
code in four plus different pages (two are new).
(handleJobAction): Added.
* src/java/org/archive/crawler/datamodel/CandidateURI.java
* src/java/org/archive/crawler/datamodel/UURI.java
Removed HasUri interface so remove here all references
and its method,
getUri.
* src/java/org/archive/crawler/datamodel/CrawlOrder.java
Add in new url-canonicalization-rules are under
/crawlorder/controller.
* src/java/org/archive/crawler/datamodel/UURIFactory.java
Minor formatting.
* src/java/org/archive/crawler/datamodel/UriUniqFilter.java
Refactoring so that for each operation it takes a
CandidateURI and
a canonicalized string of the passed CandidateURI.
Removed the HasUri interface. Instead made it explicit
that its
CandidateURI that is passing through.
* src/java/org/archive/crawler/framework/Filter.java
Removed commented out code.
* src/java/org/archive/crawler/framework/Processor.java
Removed hanging javadoc -- javadoc about a non-existent
data member.
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
Added utility canonicalize method for use by all subclasses.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
*
src/java/org/archive/crawler/frontier/ExperimentalFrontier.java
* src/java/org/archive/crawler/frontier/HostQueuesFrontier.java
Refactoring so all communication with the
alreadyincluded list passes
also a canonicalized version of the thing to be looked
up, added or
forgotten.
* src/java/org/archive/crawler/util/BdbUriUniqFilter.java
* src/java/org/archive/crawler/util/BdbUriUniqFilterTest.java
* src/java/org/archive/crawler/util/FPUriUniqFilter.java
* src/java/org/archive/crawler/util/FPUriUniqFilterTest.java
* src/java/org/archive/crawler/util/MemUriUniqFilter.java
Refactoring because implemented interface changed; now
takes a
canonicalized version of the object being looked up,
added, or forgotten,
etc.
* src/webapps/admin/include/jobfilters.jsp
This jsp is like the newly added /include/filters.jsp.
They differ
slightly. With some work they could be made the same
and we could
save on duplicated jsp code.
(printFilter): Added arguments so could include this jsp
in filters.jsp
and in url-canonicalization-rules.jsp.
* src/webapps/admin/include/jobnav.jsp
* src/webapps/admin/include/jobpernav.jsp
Added in the new 'url' page. These two jsp pages look
like they could be
refactored and much of the duplicate code factored out
into a new include.
* src/webapps/admin/jobs/filters.jsp
Factored out the guts of this file into a new include,
/include/filters.jsp
and /include/filters_js.jsp. These two new files are
then used to
build other pages and in place of duplicated code.
and /include/filters_js.jsp. These two new files are
then used to
build other pages and in place of duplicated code.
* src/webapps/admin/jobs/new.jsp
Added in new 'url' page button.
* src/webapps/admin/jobs/per/filters.jsp
Use new method in JobConfigureUtils in place of
duplicated code.
* src/conf/modules/url-canonicalization-rules.options
Added ile of all canonicalization rule options.
* src/java/org/archive/crawler/url/CanonicalizationRule.java
Added interface that all canonicalization rules implement.
* src/java/org/archive/crawler/url/Canonicalizer.java
* src/java/org/archive/crawler/url/CanonicalizerTest.java
Added class to run all canonicalization rules.
* src/java/org/archive/crawler/url/canonicalize/BaseRule.java
*
src/java/org/archive/crawler/url/canonicalize/FixupQueryStr.java
*
src/java/org/archive/crawler/url/canonicalize/FixupQueryStrTest.java
*
src/java/org/archive/crawler/url/canonicalize/LowercaseRule.java
*
src/java/org/archive/crawler/url/canonicalize/LowercaseRuleTest.java
* src/java/org/archive/crawler/url/canonicalize/RegexRule.java
*
src/java/org/archive/crawler/url/canonicalize/RegexRuleTest.java
*
src/java/org/archive/crawler/url/canonicalize/StripSessionIDs.java
*
src/java/org/archive/crawler/url/canonicalize/StripSessionIDsTest.java
*
src/java/org/archive/crawler/url/canonicalize/StripUserinfoRule.java
*
src/java/org/archive/crawler/url/canonicalize/StripUserinfoRuleTest.java
*
src/java/org/archive/crawler/url/canonicalize/StripWWWRule.java
*
src/java/org/archive/crawler/url/canonicalize/StripWWWRuleTest.java
Added canonicalization rules and accompanying unit tests.
* src/webapps/admin/include/filters.jsp
* src/webapps/admin/include/filters_js.jsp
New includes to use in place of duplicating jsp code.
* src/webapps/admin/jobs/url-canonicalization-rules.jsp
* src/webapps/admin/jobs/per/url-canonicalization-rules.jsp
Added new 'URL' screens.
~



Date: 2004-09-29 00:50
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

These are most common session ids that I have been seeing
around the Web. There are many, many other session ids but
these are produced my software that is most commonly used
for the Web.

1.
(?i).*jsessionid=[0-9a-zA-Z]{32}.*
Example: jsessionid=999A9EF028317A82AC83F0FDFE59385A

2.
(?i).*PHPSESSID=[0-9a-zA-Z]{32}.*
Example: PHPSESSID=9682993c8daa2c5497996114facdc805

3.
(?i).*sid=[0-9a-zA-Z]{32}.*
sid=9682993c8daa2c5497996114facdc805

'sid=' can be tricky but all sid= followed by 32 byte string
that I have seen so far have been session ids.

4.
(?i).*ASPSESSIONID[a-zA-z]{8}=[a-zA-Z]{24}.*
Example:ASPSESSIONIDAQBSDSRT=EOHBLBDDPFCLHKPGGKLILNAM

5.
(?i).*CFID=[0-9]{4}\&CFTOKEN=[0-9]{8}.*
Example: CFID=3313&CFTOKEN=38433225

Usually, these two ids go together. I believe that I have
seen these ids being used separately. I am not sure if this
usage is intentional or not. I will have to find more examples.



Date: 2004-09-02 02:04
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I believe this could be done as a list of rules, each
consisting of:
(1) a java-style regexp
(2) a replace string in the style of String.replaceFirst()
(3) and optional name/comment explaining its rationale

Every CrawlURI would have not just its URI string but also a
'canonicalizedUriString'. This string would be used, rather
than the regular UriString, for the purposes of the
alreadyIncluded test (UriUniqFilter). In the absence of any
canonicalization rules, this is the same as the URI string.

However, on a global or per-domain/per-site basis, rules
could be specified. (John-Erik tells me the MapType,
currently used with Filters, would also allow additive
tranformation rules in overrides, or even a per-host
override that disables the parent transformation rule.)

I would expect the most common canonicalization rules to be:

- toLowercase the full URI (not just the host part, which
is case-insensitive by spec)
- Remove www. For example (untested):

^(http://)(?:www\.)(.*)$ \1\2 # strip www.

- Remove an obvious session ID. For example (untested):

^(http://.*?)(?:(.*)&)+?PHPSESSID=\p{XDigit}{32} \1\2 #
strips trailing PHPSESSID




Date: 2004-09-01 23:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Needs to be general system useable outside of crawler.

Should work per-host as well as globally

Look at what wb does.

Thoughts are that it would be a list of regexes to apply to
an URI.


Date: 2004-08-25 23:51
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This also goes for stripping the 'www.' from the beginning
of URIs, when appropriate, to avoid duplicate harvesting of
site.com and www.site.com. (For focused crawls, I'd prefer
to do this on a case-by-case basis, once it has been
determined the sites are the same, but for large/broad
crawls it might have to be automatic and miss some content.)


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-10-08 02:12 stack-sf
close_date - 2004-10-08 02:12 stack-sf
assigned_to nobody 2004-09-01 23:19 gojomo
priority 6 2004-09-01 21:49 gojomo
priority 5 2004-07-29 00:55 gojomo