Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 Domain names in 'overrides' are not in alphabetical order - ID: 1044527
Last Update: Comment added ( karl-ia )

Under domain overrides there is a recursive list of
domains. For each level the order of these domains is
undetermined. Or rather based on the order in which the
file system reports them.

For windows this is in fact alphabetical order, but
under Linux this is usually order of creation.
Basically, as is, the order is indeterminate.

I've sent a patch to Michael where the domains are
ordered as they are read from the file system, if
accepted then this bug can be closed.


Kristinn Sigurdsson ( kristinn_sig ) - 2004-10-11 11:45

5

Closed

Fixed

Kristinn Sigurdsson

configuration

None

Public


Comments ( 2 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-256 -- please add further
comments at that location.


Date: 2004-10-12 19:43
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Submitted Kri's patch with a few minor changes. Patch used
is below. Tested it on linux. Looks good. Closing.

Index:
src/java/org/archive/crawler/settings/SettingsHandler.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/settings/SettingsHandler.java,v
retrieving revision 1.3
diff -u -r1.3 SettingsHandler.java
---
src/java/org/archive/crawler/settings/SettingsHandler.java
31 Aug 2004 21:26:08 -0000 1.3
+++
src/java/org/archive/crawler/settings/SettingsHandler.java
12 Oct 2004 19:40:13 -0000
@@ -28,7 +28,7 @@
import java.lang.reflect.Constructor;
import java.lang.reflect.InvocationTargetException;
import java.text.ParseException;
-import java.util.ArrayList;
+import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
@@ -451,8 +451,10 @@
public abstract File
getPathRelativeToWorkingDirectory(String path);

/**
- * Will return an array of strings with domains that
contain 'per' domain
- * overrides (or their subdomains contain them). The
domains considered are+ * Will return a Collection of
strings with domains that contain 'per'
+ * domain overrides (or their subdomains contain them).
+ *
+ * The domains considered are
* limited to those that are subdomains of the supplied
domain. If null or
* empty string is supplied the TLDs will be considered.
* @param rootDomain The domain to get domain overrides
for. Examples:
@@ -460,7 +462,7 @@
* @return An array of domains that contain overrides.
If rootDomain does not
* exist an empty array will be returned.
*/
- public abstract ArrayList getDomainOverrides(String
rootDomain);
+ public abstract Collection getDomainOverrides(String
rootDomain);

/**
* Unregister an instance of {@link ValueErrorHandler}.
Index:
src/java/org/archive/crawler/settings/XMLSettingsHandler.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/settings/XMLSettingsHandler.java,v
retrieving revision 1.6
diff -u -r1.6 XMLSettingsHandler.java
---
src/java/org/archive/crawler/settings/XMLSettingsHandler.java
31 Aug 2004 01:29:21 -0000 1.6
+++
src/java/org/archive/crawler/settings/XMLSettingsHandler.java
12 Oct 2004 19:40:13 -0000
@@ -32,7 +32,10 @@
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
import java.util.List;
+import java.util.TreeSet;
import java.util.logging.Logger;

import javax.management.Attribute;
@@ -363,7 +366,7 @@
return f;
}

- public ArrayList getDomainOverrides(String rootDomain) {
+ public Collection getDomainOverrides(String rootDomain) {
File settingsDir = getSettingsDirectory();

//Find the right start directory.
@@ -389,7 +392,17 @@
}
//Then we move to the approprite directory.
settingsDir = new File(settingsDir.getPath()+subDir);
- ArrayList confirmedSubDomains = new ArrayList();
+ TreeSet confirmedSubDomains = new TreeSet(new
Comparator() {
+ public int compare(Object o1, Object o2) {
+ if(o1 instanceof String && o2
instanceof String){
+ return ((String)o1).compareTo(o2);
+ } else {
+ // We only account for strings.
+ return 0;
+ }
+ }
+ }
+ );
if(settingsDir.exists()){
// Found our place! Search through it's subdirs.
File[] possibleSubDomains =
settingsDir.listFiles();
Index: src/webapps/admin/jobs/per/overview.jsp
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/webapps/admin/jobs/per/overview.jsp,v
retrieving revision 1.6
diff -u -r1.6 overview.jsp
--- src/webapps/admin/jobs/per/overview.jsp 21 Apr 2004
22:52:07 -0000 1.6
+++ src/webapps/admin/jobs/per/overview.jsp 12 Oct 2004
19:40:13 -0000
@@ -9,7 +9,8 @@
<%@include file="/include/handler.jsp"%>^M
<%@include file="/include/secure.jsp"%>^M
^M
-<%@page import="java.util.ArrayList"%>^M
+<%@page import="java.util.Collection"%>^M
+<%@page import="java.util.Iterator"%>^M
^M
<%@page import="org.archive.crawler.admin.CrawlJob"%>^M
<%@page
import="org.archive.crawler.settings.CrawlerSettings"%>^M
@@ -167,9 +168,9 @@
<li> <a
href="javascript:doGotoDomain('<%=parentDomain%>')">- Up -</a>^M
<%^M
}^M
- ArrayList subs =
settingsHandler.getDomainOverrides(currDomain);^M
- for(int i=0 ; i < subs.size() ; i++){^M
- String printDomain = (String)subs.get(i);^M
+ Collection subs =
settingsHandler.getDomainOverrides(currDomain);^M
+ for (Iterator i = subs.iterator();
i.hasNext();) {^M
+ String printDomain = (String)i.next();^M
if(currDomain.length()>0){^M
printDomain += "."+currDomain;^M
}^M
@@ -203,4 +204,4 @@
<input type="button" value="Delete"
onClick="doDelete()">^M
<% } %>^M
</form>^M
-<%@include file="/include/foot.jsp"%>
\ No newline at end of file
+<%@include file="/include/foot.jsp"%>^M



Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2004-10-12 19:43 stack-sf
resolution_id None 2004-10-12 19:43 stack-sf
close_date - 2004-10-12 19:43 stack-sf