Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 stop alerts 'line in seed file ignored' for mixed seed/surt - ID: 1442207
Last Update: Comment added ( karl-ia )

Stop alerting "line in seed file ignored:
gnu.inet.encoding.IDNAException: Contains non-LDH
characters. +http" when using seeds/surts mixed format.

To reproduce create a seeds file as:
http://archive.org
+http://(org,archive,)/

-------------------------------------
Time: Mar. 3, 2006 01:58:05 GMT
Level: WARNING
Message:

line in seed file ignored:
gnu.inet.encoding.IDNAException: Contains non-LDH
characters. +http

Exception:

org.apache.commons.httpclient.URIException:
gnu.inet.encoding.IDNAException: Contains non-LDH
characters. +http
Stacktrace: org.apache.commons.httpclient.URIException:
gnu.inet.encoding.IDNAException: Contains non-LDH
characters. +http
at
org.archive.net.UURIFactory.fixupDomainlabel(UURIFactory.java:623)
at
org.archive.net.UURIFactory.fixupAuthority(UURIFactory.java:577)
at org.archive.net.UURIFactory.fixup(UURIFactory.java:476)
at
org.archive.net.UURIFactory.create(UURIFactory.java:320)
at
org.archive.net.UURIFactory.create(UURIFactory.java:310)
at
org.archive.net.UURIFactory.getInstance(UURIFactory.java:263)
at
org.archive.crawler.scope.SeedFileIterator.transform(SeedFileIterator.java:
90)
at
org.archive.util.iterator.TransformingIteratorWrapper.lookahead(Transformin
gIteratorWrapper.java:47)
at
org.archive.util.iterator.LookaheadIterator.hasNext(LookaheadIterator.java:
48)
at
org.archive.crawler.admin.StatisticsTracker.getSeeds(StatisticsTracker.java
:735)
at
org.archive.crawler.admin.StatisticsTracker.getSeedRecordsSortedByStatusCod
e(StatisticsTracker.java:742)
at
org.archive.crawler.jspc.admin.reports.seeds_jsp._jspService(Unknown
Source)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
at
javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:358)
at
org.mortbay.jetty.servlet.WebApplicationHandler$Chain.doFilter(WebApplicati
onHandler.java:342)
at
org.archive.crawler.admin.ui.RootFilter.doFilter(RootFilter.java:67)
at
org.mortbay.jetty.servlet.WebApplicationHandler$Chain.doFilter(WebApplicati
onHandler.java:334)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHand
ler.java:286)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1807)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContex
t.java:525)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1757)
at
org.mortbay.http.HttpServer.service(HttpServer.java:879)
at
org.mortbay.http.HttpConnection.service(HttpConnection.java:789)
at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:960)
at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:806)
at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:218)
at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:300)
at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:511)


Igor Ranitovic ( ia_igor ) - 2006-03-03 02:05

9

Closed

Fixed

Gordon Mohr

General

1.8.0

Public


Comments ( 4 )

Date: 2007-03-14 01:04
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-542 -- please add further
comments at that location.


Date: 2006-03-17 23:11
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Commit comment:

Fix for [ 1442207 ] stop alerts 'line in seed file ignored'
for mixed seed/surt
* SeedFileIterator.java
downgrade logging of ignored line to INFO so Alert isn't
generated


Date: 2006-03-17 23:04
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Patch looks good to me. Clear to commit.


Date: 2006-03-17 22:46
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

To fix this annoyance for the crawl engineers, I'd like to
commit the following one-line fix through the pre-1.8 code
freeze.

It logs ignored non-comment lines from the seed file as INFO
rather than WARNING -- so no Alert gets created. The ignored
lines are still collected for reporting in the seeds report,
so problems (like corrupt intended-seed URIs) can still be
found after the fact.

Please review for inclusion.


Index: SeedFileIterator.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/scope/SeedFileIterator.java,v
retrieving revision 1.7
diff -u -r1.7 SeedFileIterator.java
--- SeedFileIterator.java 11 Oct 2005 21:52:14 -0000 1.7
+++ SeedFileIterator.java 17 Mar 2006 22:43:54 -0000
@@ -89,7 +89,7 @@
// TODO: ignore lines beginning with non-word char
return UURIFactory.getInstance(uri);
} catch (URIException e) {
- logger.log(Level.WARNING, "line in seed file
ignored: "
+ logger.log(Level.INFO, "line in seed file
ignored: "
+ e.getMessage(), e);
if(ignored!=null) {
try {



Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
close_date - 2006-03-17 23:11 gojomo
status_id Open 2006-03-17 23:11 gojomo
resolution_id None 2006-03-17 23:11 gojomo
artifact_group_id None 2006-03-17 23:11 gojomo
assigned_to stack-sf 2006-03-17 23:04 stack-sf
assigned_to gojomo 2006-03-17 22:46 gojomo
assigned_to nobody 2006-03-17 21:43 gojomo
priority 7 2006-03-17 21:43 gojomo