Testing BdbUriUniqFilter against other implementations
revealed that out of a set of 4.9 million URLs,
BdbUrlUniqFilter rejects as duplicates 4 URLs accepted
by the other implementations.
The following code shows the collisions (if you make
the necessary constructor & createKey() temporarily
public):
String[] collisions = {
"dns:mail.daps.dla.mil",
"dns:militaryreview.army.mil",
"dns:wwwmil.dover.af.mil",
"dns:204mibn.army.mil",
"dns:dix-emh5.army.mil",
"dns:www.iach.knox.amedd.army.mil",
"dns:familysupport.acw.mil",
"dns:www.wr.disa.mil"};
BdbUriUniqFilter uuf = new BdbUriUniqFilter();
for(int i = 0; i<collisions.length; i++) {
System.out.println(uuf.createKey(collisions[i])+"
"+collisions[i]);
}
output:
8901596914508212125 dns:mail.daps.dla.mil
8901596914508212125 dns:militaryreview.army.mil
235382603819624349 dns:wwwmil.dover.af.mil
235382603819624349 dns:204mibn.army.mil
-4544075694866847843 dns:dix-emh5.army.mil
-4544075694866847843 dns:www.iach.knox.amedd.army.mil
-2439269590698973283 dns:familysupport.acw.mil
-2439269590698973283 dns:www.wr.disa.mil
Looking more closely at the algorithm, which was
inspired by the relevant section of the Chakrabarti
'Mining the Web' book, I see the problem: with only 24
bits assigned to the scheme+host section, there's a
likelihood of collision (by the birthday paradox) in
just 2^12 items -- 4096 items. If the path portion of
these URLs is identical, the whole 64-bit fingerprint
will collide too.
We're using an empty string for the path part on DNS
URLs, and it's highly likely that HTTP/HTTPS URLs have
identical paths (especially '/', '/robots.txt',
'index.html'. etc) as well.
In fact, I'm surprised there aren't more collisions
among the 4.9 million test set. It includes ~6K unique
DNS URLs (which alone would be expected to generate 1-2
collisions), and almost 12000 unique hosts in
HTTP/HTTPS URLs (which, for the root/robots URLs that
are sure to be present, would be expected to create
several more). The only odd thing is that all the
collisions are DNS URLs, but that could just be a
coincidence... the total of 4 sounds about as expected.
We can probable continue to
(FYI, the test input of almost 11 million URLs, only
4.9 million unique, is from a test crawl recovery log,
and current resides on
crawling015:/0/gojomo/recovertest.uris. The outputs
from the Bdb and Bloom UriUniqFilters are in bdb.out
and bloom.out; bdb2.out placed the first set of
known-missng 4 items at the front so that it would miss
and thus identify their original expressions instead.)
A suitable fix would be to redundantly include the
whole string in the second 40-bit range as well, so
that there's no clumping around empty/common paths.
(There'd still be the desired locality around shared
host-parts, and some collisions in the first 24 bits,
but the whole fingerprint should be a lot closer to the
64bit ideal of a collision expected only after 2^32 items.)
Karl Thiessen
Frontier
1.6.0
Public
|
Date: 2007-03-14 00:56
|
|
Date: 2005-07-09 00:53 Logged In: YES |
|
Date: 2005-07-06 21:37 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-12-02 17:14 | stack-sf |
| close_date | - | 2005-12-02 17:14 | stack-sf |
| artifact_group_id | None | 2005-09-23 18:29 | gojomo |
| priority | 9 | 2005-07-09 00:53 | gojomo |
| assigned_to | gojomo | 2005-07-09 00:53 | gojomo |
| assigned_to | stack-sf | 2005-07-06 21:37 | gojomo |