Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 BdbUriUniqFilter URI fingerprint somewhat collision prone - ID: 1226387
Last Update: Comment added ( karl-ia )

Testing BdbUriUniqFilter against other implementations
revealed that out of a set of 4.9 million URLs,
BdbUrlUniqFilter rejects as duplicates 4 URLs accepted
by the other implementations.

The following code shows the collisions (if you make
the necessary constructor & createKey() temporarily
public):

String[] collisions = {
"dns:mail.daps.dla.mil",
"dns:militaryreview.army.mil",
"dns:wwwmil.dover.af.mil",
"dns:204mibn.army.mil",
"dns:dix-emh5.army.mil",
"dns:www.iach.knox.amedd.army.mil",
"dns:familysupport.acw.mil",
"dns:www.wr.disa.mil"};
BdbUriUniqFilter uuf = new BdbUriUniqFilter();
for(int i = 0; i<collisions.length; i++) {
System.out.println(uuf.createKey(collisions[i])+"
"+collisions[i]);
}

output:

8901596914508212125 dns:mail.daps.dla.mil
8901596914508212125 dns:militaryreview.army.mil
235382603819624349 dns:wwwmil.dover.af.mil
235382603819624349 dns:204mibn.army.mil
-4544075694866847843 dns:dix-emh5.army.mil
-4544075694866847843 dns:www.iach.knox.amedd.army.mil
-2439269590698973283 dns:familysupport.acw.mil
-2439269590698973283 dns:www.wr.disa.mil

Looking more closely at the algorithm, which was
inspired by the relevant section of the Chakrabarti
'Mining the Web' book, I see the problem: with only 24
bits assigned to the scheme+host section, there's a
likelihood of collision (by the birthday paradox) in
just 2^12 items -- 4096 items. If the path portion of
these URLs is identical, the whole 64-bit fingerprint
will collide too.

We're using an empty string for the path part on DNS
URLs, and it's highly likely that HTTP/HTTPS URLs have
identical paths (especially '/', '/robots.txt',
'index.html'. etc) as well.

In fact, I'm surprised there aren't more collisions
among the 4.9 million test set. It includes ~6K unique
DNS URLs (which alone would be expected to generate 1-2
collisions), and almost 12000 unique hosts in
HTTP/HTTPS URLs (which, for the root/robots URLs that
are sure to be present, would be expected to create
several more). The only odd thing is that all the
collisions are DNS URLs, but that could just be a
coincidence... the total of 4 sounds about as expected.

We can probable continue to

(FYI, the test input of almost 11 million URLs, only
4.9 million unique, is from a test crawl recovery log,
and current resides on
crawling015:/0/gojomo/recovertest.uris. The outputs
from the Bdb and Bloom UriUniqFilters are in bdb.out
and bloom.out; bdb2.out placed the first set of
known-missng 4 items at the front so that it would miss
and thus identify their original expressions instead.)

A suitable fix would be to redundantly include the
whole string in the second 40-bit range as well, so
that there's no clumping around empty/common paths.
(There'd still be the desired locality around shared
host-parts, and some collisions in the first 24 bits,
but the whole fingerprint should be a lot closer to the
64bit ideal of a collision expected only after 2^32 items.)


Gordon Mohr ( gojomo ) - 2005-06-23 17:00

7

Closed

None

Karl Thiessen

Frontier

1.6.0

Public


Comments ( 3 )

Date: 2007-03-14 00:56
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-451 -- please add further
comments at that location.


Date: 2005-07-09 00:53
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fix for [ 1226387 ] BdbUriUniqFilter URI fingerprint
somewhat collision prone
* BdbUriUniqFilter.java
make createKey use scheme+host (if separable) for 24
bits, then whole URI for next 40bits, to avoid too-frequent
collisions for common (empty) paths
* BdbUriUniqFilterTest.java
add test which will catch a collision found in old
method; update expected FP values for other test

There's still something a bit suspicious about the
fingerprints created... I don't seem to get collisions as
fast as theory would predict, when feeding test data
(ascending integers printed in various bases). Deserves
further investigation at some point if this remains the
primary/default alreadyIncluded mechanism.

For now assigning to Karl for verification/closing.



Date: 2005-07-06 21:37
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I'll fix this.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2005-12-02 17:14 stack-sf
close_date - 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
priority 9 2005-07-09 00:53 gojomo
assigned_to gojomo 2005-07-09 00:53 gojomo
assigned_to stack-sf 2005-07-06 21:37 gojomo