FROM LIST:
Thanks for the report -- my investigation reveals this
is actually a bug
in OnHostsDecideRule, in how it updates itself when a
new seed is added.
It fails to override its superclass method of deducing
a SURT prefix
from the seed.
So in your case, it gets a default SURT prefix like:
http://(de,bund,bnd,www,)/cln_027/DE/Home__Vorschaltseite/
...instead of the proper/desired host-centric behavior...
http://(de,bund,bnd,www,)
I've attached a patch that clears this up in my
testing, by refactoring
SurtPrefixedDecideRule and SurtPrefixSet a little (to
make the
prefix-ification reusable/overridable), and changing
OnHostsDecideRule/OnDomainsDecideRule (to override
their superclass
prefixification).
Let me know if it works for you, and we'll integrate
this into CVS-HEAD
post-1.8.
- Gordon @ IA
pandae667 wrote:
> > Hi everyone,
> >
> > I'm using an up-to-date CVS version of heritrix and
came across the
> > following problem:
> > I'm trying to spider the site
http://www.bundesnachrichtendienst.de
> > which issues an immediate redirect via
http-equiv="refresh".
> > As I want to follow this redirect I added an
> > AddRedirectFromRootServerToScope to my decide rules.
> > It adds a new seed into my seeds.txt as below
> >
> > # Heritrix added seed redirect from
> > http://www.bundesnachrichtendienst.de/.
> >
http://www.bnd.bund.de/cln_027/DE/Home__Vorschaltseite/home__node.html__nnn
=true
> >
> > but still I get the redirected link itself and all
subsequent links
> > reported as outOfScope.
> >
> > 04/19/2006 19:37:59 +0000 INFO
> >
org.archive.crawler.deciderules.AddRedirectFromRootServerToScope
> > evaluate Adding http://www.bnd.bund.de/cln_0
> >
27/DE/Home__Vorschaltseite/home__node.html__nnn=true to
seeds via
> > http://www.bundesnachrichtendienst.de/
> >
> > 04/19/2006 19:37:59 +0000 INFO
org.archive.crawler.framework.Scoper
> > outOfScope
http://www.bnd.bund.de/cln_027/DE/Home__Vorschaltseite/home__
> > node.html__nnn=true
> >
> > 04/19/2006 19:37:59 +0000 INFO
> > org.archive.crawler.postprocessor.LinksScoper
outOfScope
> > http://www.bnd.bund.de/cln_027/DE/Home__Vorschaltsei
> > te/home__node.html__nnn=true
> >
> >
> > here are the config settings for my scope if those
help anyone in
> > tracking down the problem:
> > <newObject name="scope"
> > class="org.archive.crawler.deciderules.DecidingScope">
> > <boolean name="enabled">true</boolean>
> > <string name="seedsfile">seeds.txt</string>
> > <boolean name="reread-seeds-on-config">true</boolean>
> >
> > <newObject name="decide-rules"
> >
class="org.archive.crawler.deciderules.DecideRuleSequence">
> >
> > <map name="rules">
> > <newObject name="RejectDecideRule"
> >
class="org.archive.crawler.deciderules.RejectDecideRule">
> > </newObject>
> >
> > <newObject name="AcceptRootRedirects"
> >
class="org.archive.crawler.deciderules.AddRedirectFromRootServerToScope">
> > <string name="decision">ACCEPT</string>
> > </newObject>
> >
> > <newObject name="AcceptHostRule"
> >
class="org.archive.crawler.deciderules.OnHostsDecideRule">
> > <string name="decision">ACCEPT</string>
> > <string name="surts-dump-file"/>
> > <boolean name="also-check-via">false</boolean>
> > <boolean name="rebuild-on-reconfig">true</boolean>
> > </newObject>
> >
> > <newObject name="RejectTooManyhops"
> >
class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
> > <integer name="max-hops">3</integer>
> > </newObject>
> >
> > <newObject name="RejectPathologicalRule"
> >
class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
> > <integer name="max-repetitions">2</integer>
> > </newObject>
> > <newObject name="AcceptPrerequisiteRule"
> >
class="org.archive.crawler.deciderules.PrerequisiteAcceptDecideRule">
> > </newObject>
> >
> > <newObject name="RejectFileTypes"
> >
class="org.archive.crawler.deciderules.MatchesRegExpDecideRule">
> > <string name="decision">REJECT</string>
> >
> > <string name="regexp">
> >
.*(?i)\.(a|ai|aif|aifc|aiff|asc|avi|bcpio|bin|bmp|bz2|c|cdf|cgi|cgm|class|c
pio|cpp?|cpt|csh|css|cxx|dcr|dif|dir|djv|djvu|dll|dmg|dms|doc|dtd|dv|dvi|dx
r|eps|etx|exe|ez|gif|gram|grxml|gtar|h|hdf|hqx|ice|ico|ics|ief|ifb|iges|igs
|iso|jnlp|jp2|jpe|jpeg|jpg|js|kar|latex|lha|lzh|m3u|mac|man|mathml|me|mesh|
mid|midi|mif|mov|movie|mp2|mp3|mp4|mpe|mpeg|mpg|mpga|ms|msh|mxu|nc|o|oda|og
g|pbm|pct|pdb|pgm|pgn|pic|pict|pl|png|pnm|pnt|pntg|ppm|ppt|ps|py|qt|qti|qti
f|ra|ram|ras|rdf|rgb|rm|roff|rpm|rtf|rtx|s|sgm|sgml|sh|shar|silo|sit|skd|sk
m|skp|skt|smi|smil|snd|so|spl|src|srpm|sv4cpio|sv4crc|svg|swf|t|tar|tcl|tex
|texi|texinfo|tgz|tif|tiff|tr|tsv|ustar|vcd|vrml|vxml|wav|wbmp|wbxml|wml|wm
lc|wmls|wmlsc|wrl|xbm|xht|xhtml|xls|xml|xpm|xsl|xslt|xwd|xyz|z|zip)$
> > </string>
> > </newObject>
> > </map>
> > </newObject>
> > </newObject>
> >
> > I wonder if this might be in any way related to Bug
"[ 1219262 ]
> > 'treat seed redirects as new seeds' not working" -
as it sounds like
> > there was a similar issue that was fixed in there.
> >
> > Thanks in advance for any help
> > Olaf Freyer
> >
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> >
> >
> >
> >
> >
> >
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/archive-crawler/
<*> To unsubscribe from this group, send an email to:
archive-crawler-unsubscribe@yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
Index:
src/java/org/archive/crawler/deciderules/OnDomainsDecideRule.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/de
ciderules/OnDomainsDecideRule.java,v
retrieving revision 1.3
diff -u -r1.3 OnDomainsDecideRule.java
---
src/java/org/archive/crawler/deciderules/OnDomainsDecideRule.java
14 Jun 2005 04:06:10 -0000 1.3
+++
src/java/org/archive/crawler/deciderules/OnDomainsDecideRule.java
19 Apr 2006 22:48:32 -0000
@@ -26,6 +26,8 @@
import java.util.logging.Logger;
+import org.archive.util.SurtPrefixSet;
+
/**
* Rule applies configured decision to any URIs that
* are on one of the domains in the configured set of
@@ -64,4 +66,8 @@
surtPrefixes.convertAllPrefixesToDomains();
dumpSurtPrefixSet();
}
+
+ protected String prefixFrom(String uri) {
+ return
SurtPrefixSet.convertPrefixToDomain(super.prefixFrom(uri));
+ }
}
Index:
src/java/org/archive/crawler/deciderules/OnHostsDecideRule.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/de
ciderules/OnHostsDecideRule.java,v
retrieving revision 1.2
diff -u -r1.2 OnHostsDecideRule.java
---
src/java/org/archive/crawler/deciderules/OnHostsDecideRule.java
12 Apr 2005 21:46:43 -0000 1.2
+++
src/java/org/archive/crawler/deciderules/OnHostsDecideRule.java
19 Apr 2006 22:48:32 -0000
@@ -26,6 +26,8 @@
import java.util.logging.Logger;
+import org.archive.util.SurtPrefixSet;
+
/**
@@ -66,4 +68,8 @@
surtPrefixes.convertAllPrefixesToHosts();
dumpSurtPrefixSet();
}
+
+ protected String prefixFrom(String uri) {
+ return
SurtPrefixSet.convertPrefixToHost(super.prefixFrom(uri));
+ }
}
Index:
src/java/org/archive/crawler/deciderules/SurtPrefixedDecideRule.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/de
ciderules/SurtPrefixedDecideRule.java,v
retrieving revision 1.15
diff -u -r1.15 SurtPrefixedDecideRule.java
---
src/java/org/archive/crawler/deciderules/SurtPrefixedDecideRule.java
18 Apr 2006 01:28:43 -0000 1.15
+++
src/java/org/archive/crawler/deciderules/SurtPrefixedDecideRule.java
19 Apr 2006 22:48:32 -0000
@@ -276,7 +276,11 @@
public synchronized void addedSeed(final
CandidateURI curi) {
SurtPrefixSet newSurtPrefixes =
(SurtPrefixSet) surtPrefixes.clone();
-
newSurtPrefixes.add(SurtPrefixSet.prefixFromPlain(curi.toString()));
+ newSurtPrefixes.add(prefixFrom(curi.toString()));
surtPrefixes = newSurtPrefixes;
}
+
+ protected String prefixFrom(String uri) {
+ return SurtPrefixSet.prefixFromPlain(uri);
+ }
}
Index: src/java/org/archive/util/SurtPrefixSet.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/util/SurtP
refixSet.java,v
retrieving revision 1.17
diff -u -r1.17 SurtPrefixSet.java
--- src/java/org/archive/util/SurtPrefixSet.java 2 Mar
2006 01:27:31 -0000 1.17
+++ src/java/org/archive/util/SurtPrefixSet.java 19 Apr
2006 22:48:34 -0000
@@ -316,22 +316,30 @@
Iterator iter = iterCopy.iterator();
while (iter.hasNext()) {
String prefix = (String) iter.next();
- if(prefix.endsWith(")")) {
- continue; // no change necessary
+ String convPrefix =
convertPrefixToHost(prefix);
+ if(prefix!=convPrefix) {
+ // if returned value not unchanged,
update set
+ this.remove(prefix);
+ this.add(convPrefix);
}
- this.remove(prefix);
- if(prefix.indexOf(')')<0) {
- // open-ended domain prefix
- if(!prefix.endsWith(",")) {
- prefix += ",";
- }
- prefix += ")";
- } else {
- // prefix with excess path-info
- prefix =
prefix.substring(0,prefix.indexOf(')')+1);
+ }
+ }
+
+ public static String convertPrefixToHost(String
prefix) {
+ if(prefix.endsWith(")")) {
+ return prefix; // no change necessary
+ }
+ if(prefix.indexOf(')')<0) {
+ // open-ended domain prefix
+ if(!prefix.endsWith(",")) {
+ prefix += ",";
}
- this.add(prefix);
+ prefix += ")";
+ } else {
+ // prefix with excess path-info
+ prefix =
prefix.substring(0,prefix.indexOf(')')+1);
}
+ return prefix;
}
/**
@@ -346,18 +354,26 @@
Iterator iter = iterCopy.iterator();
while (iter.hasNext()) {
String prefix = (String) iter.next();
- if(prefix.indexOf(')')<0) {
- continue; // no change necessary
- }
- this.remove(prefix);
- prefix =
prefix.substring(0,prefix.indexOf(')'));
- if(prefix.endsWith("www,")) {
- prefix =
prefix.substring(0,prefix.length()-4);
+ String convPrefix =
convertPrefixToDomain(prefix);
+ if(prefix!=convPrefix) {
+ // if returned value not unchanged,
update set
+ this.remove(prefix);
+ this.add(convPrefix);
}
- this.add(prefix);
}
}
+ public static String convertPrefixToDomain(String
prefix) {
+ if(prefix.indexOf(')')<0) {
+ return prefix; // no change necessary
+ }
+ prefix = prefix.substring(0,prefix.indexOf(')'));
+ if(prefix.endsWith("www,")) {
+ prefix =
prefix.substring(0,prefix.length()-4);
+ }
+ return prefix;
+ }
+
/**
* Allow class to be used as a command-line tool
for converting
* URL lists (or naked host or host/path fragments
implied
Gordon Mohr
General
1.8.0
Public
|
Date: 2007-03-14 01:05
|
|
Date: 2006-04-24 21:40 Logged In: YES |
|
Date: 2006-04-24 19:58 Logged In: YES |
|
Date: 2006-04-24 19:58 Logged In: YES |
|
Date: 2006-04-24 19:44 Logged In: YES |
|
Date: 2006-04-24 19:10 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2006-04-24 21:40 | gojomo |
| close_date | - | 2006-04-24 21:40 | gojomo |
| resolution_id | None | 2006-04-24 21:40 | gojomo |
| assigned_to | stack-sf | 2006-04-24 19:58 | stack-sf |
| assigned_to | nobody | 2006-04-24 19:10 | gojomo |
| priority | 6 | 2006-04-24 19:10 | gojomo |
| artifact_group_id | None | 2006-04-24 19:10 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use