Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 NPE in ArcWriterProcessor.writeDns() - ID: 1036720
Last Update: Comment added ( karl-ia )

Title: Problem occured processing
'dns:www.classicvacations.com'
Time: Sep. 29, 2004 01:42:21 GMT
Level: SEVERE
Message:

Problem java.lang.NullPointerException occured when
trying to process 'dns:www.classicvacations.com' at
step ABOUT_TO_BEGIN_PROCESSOR


Associated Throwable: java.lang.NullPointerException

Stacktrace:
java.lang.NullPointerException
at
org.archive.crawler.writer.ARCWriterProcessor.writeDns(ARCWriterProcessor.j
ava:439)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcess
or.java:369)
at
org.archive.crawler.framework.Processor.process(Processor.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java(Comp
iled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:118)


Gordon Mohr ( gojomo ) - 2004-09-29 01:56

7

Closed

Fixed

Michael Stack

None

1.4.2

Public


Comments ( 22 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-252 -- please add further
comments at that location.


Date: 2005-04-20 00:16
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Actually close.


Date: 2005-04-19 23:43
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Later debug showed that there were actual instances where we
had a reference that had been nulled but it hadn't yet made
it to the ReferenceQueue. Meant the referent hadn't yet
been persisted so lookups on disk would come back null. In
this case, we'd say there was no such object -- ServerCache
would go create a new instance.

Closing. Commit message below.

Among other things, fixes '[ 1036720 ] NPE in
ArcWriterProcessor.writeDns()'
Below is aggregation of a St.Ack and Gordon patch (Tested by
St.Ack).
* src/java/org/archive/util/CachedBdbMap.java
Javadoc. Line lengths.
(referent): Renamed as referentField.
(get): Was a hole in get in that a referent may have
been nulled by the
GC but it hadn't yet made it to the reference queue. In
this case, we
wouldn't find reference to extant referent because it
was null in the
memMap reference and we hadn't yet persisted the cleared
object to disk
so it wouldn't be found on disk either. Fixed by doing
direct call to
new method expungeStaleEntry persisting the referent we
know for sure had
been cleared.
(put, reove): Refactored to use get.
(expungeStaleEntries): Refactored to call
expungeStaleEntry with each
entry found.
(expungeStaleEntry): Conditional persistence of
phantomly referenced
object and general clean up of phantoms.
Added accessor methods to our phantom and soft reference
overrides at
eclipse's suggestion ('Performance' improvement).
Added method doctoredGet to our phantom reference override.
Added to our soft reference a method to do cleanup of
the held phantom
reference.



Date: 2005-04-19 22:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Later debug showed that there were actual instances where we
had a reference that had been nulled but it hadn't yet made
it to the ReferenceQueue. Meant the referent hadn't yet
been persisted so lookups on disk would come back null. In
this case, we'd say there was no such object -- ServerCache
would go create a new instance.

Closing. Commit message below.

Among other things, fixes '[ 1036720 ] NPE in
ArcWriterProcessor.writeDns()'
Below is aggregation of a St.Ack and Gordon patch (Tested by
St.Ack).
* src/java/org/archive/util/CachedBdbMap.java
Javadoc. Line lengths.
(referent): Renamed as referentField.
(get): Was a hole in get in that a referent may have
been nulled by the
GC but it hadn't yet made it to the reference queue. In
this case, we
wouldn't find reference to extant referent because it
was null in the
memMap reference and we hadn't yet persisted the cleared
object to disk
so it wouldn't be found on disk either. Fixed by doing
direct call to
new method expungeStaleEntry persisting the referent we
know for sure had
been cleared.
(put, reove): Refactored to use get.
(expungeStaleEntries): Refactored to call
expungeStaleEntry with each
entry found.
(expungeStaleEntry): Conditional persistence of
phantomly referenced
object and general clean up of phantoms.
Added accessor methods to our phantom and soft reference
overrides at
eclipse's suggestion ('Performance' improvement).
Added method doctoredGet to our phantom reference override.
Added to our soft reference a method to do cleanup of
the held phantom
reference.



Date: 2005-04-19 00:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Logging object identity, see creation of CrawlHost for
www.cinu.org.mx at 18:59:55 w/ set of IP happening in same
second.

52 seconds later we try to get the CrawlHost for
www.cinu.org.mx and we're returned a new object instead of
the one created at 18:59:55. See below.

bash-2.05$ grep -r 'www.cinu.org.mx' heritrix_out.log
04/18/2005 18:59:55 +0000 FINE
org.archive.crawler.datamodel.ServerCache createHostFor
Created host www.cinu.org.mx
org.archive.crawler.datamodel.CrawlHost@3656fd
04/18/2005 18:59:55 +0000 FINE
org.archive.crawler.datamodel.CrawlHost setIP
org.archive.crawler.datamodel.CrawlHost@3656fd:
www.cinu.org.mx: /148.243.163.138
04/18/2005 19:00:47 +0000 FINE
org.archive.crawler.datamodel.ServerCache createHostFor
Created host www.cinu.org.mx
org.archive.crawler.datamodel.CrawlHost@13fe47c



Date: 2005-04-18 23:49
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Logging object identity, see creation of CrawlHost for
www.cinu.org.mx at 18:59:55 w/ set of IP happening in same
second.

52 seconds later we try to get the CrawlHost for
www.cinu.org.mx and we're returned a new object instead of
the one created at 18:59:55. See below.

bash-2.05$ grep -r 'www.cinu.org.mx' heritrix_out.log
04/18/2005 18:59:55 +0000 FINE
org.archive.crawler.datamodel.ServerCache createHostFor
Created host www.cinu.org.mx
org.archive.crawler.datamodel.CrawlHost@3656fd
04/18/2005 18:59:55 +0000 FINE
org.archive.crawler.datamodel.CrawlHost setIP
org.archive.crawler.datamodel.CrawlHost@3656fd:
www.cinu.org.mx: /148.243.163.138
04/18/2005 19:00:47 +0000 FINE
org.archive.crawler.datamodel.ServerCache createHostFor
Created host www.cinu.org.mx
org.archive.crawler.datamodel.CrawlHost@13fe47c



Date: 2005-04-16 04:23
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is evidence that cache lookup is broke. First, here is
an instance of our exception:

Title: Problem occured processing
'http://cdn.digitalcity.com/web_mexico/channels/noticias/internas/12h'
Time: Apr. 16, 2005 04:12:55 GMT
Level: SEVERE
Message:

Problem java.lang.NullPointerException: Address is null for
http://cdn.digitalcity.com/web_mexico/channels/noticias/internas/12h
http://canales.aol.com.mx/nacional/notas/cfcpe/?id=993.
Address was set 0 times and was never looked up. occured
when trying to process
'http://cdn.digitalcity.com/web_mexico/channels/noticias/internas/12h'
at step ABOUT_TO_BEGIN_PROCESSOR in Archiver


Associated Throwable: java.lang.NullPointerException:
Address is null for
http://cdn.digitalcity.com/web_mexico/channels/noticias/internas/12h
http://canales.aol.com.mx/nacional/notas/cfcpe/?id=993.
Address was set 0 times and was never looked up.

Message:
Address is null for
http://cdn.digitalcity.com/web_mexico/channels/noticias/internas/12h
http://canales.aol.com.mx/nacional/notas/cfcpe/?id=993.
Address was set 0 times and was never looked up.

Stacktrace:
java.lang.NullPointerException: Address is null for
http://cdn.digitalcity.com/web_mexico/channels/noticias/internas/12h
http://canales.aol.com.mx/nacional/notas/cfcpe/?id=993.
Address was set 0 times and was never looked up.
at
org.archive.crawler.writer.ARCWriterProcessor.getHostAddress(ARCWriterProcessor.java:481)
at
org.archive.crawler.writer.ARCWriterProcessor.writeHttp(ARCWriterProcessor.java:379)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java:356)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:272)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:142)


Next, here is logging in heritrix_out.log of the creation of
the 'cdn.digitalcity.com' CrawlHost object.

bash-2.05$ grep -r 'cdn.digitalcity.com' heritrix_out.log
04/16/2005 04:12:49 +0000 FINE
org.archive.crawler.datamodel.ServerCache createHostFor
Created host cdn.digitalcity.com
04/16/2005 04:12:49 +0000 FINE
org.archive.crawler.datamodel.CrawlHost setIP
cdn.digitalcity.com: /64.236.44.80
04/16/2005 04:12:55 +0000 FINE
org.archive.crawler.datamodel.ServerCache createHostFor
Created host cdn.digitalcity.com

Here are crawl log entries:

bash-2.05$ grep cdn.digitalcity.com crawl.log|more
2005-04-16T04:12:49.388Z 1 356
dns:cdn.digitalcity.com EP http://cdn.
digitalcity.com/web_mexico/channels/noticias/brdazul2
text/dns #010 200504160412
49279+103 - -
2005-04-16T04:12:50.096Z 404 538
http://cdn.digitalcity.com/robots.txt
EP
http://cdn.digitalcity.com/web_mexico/channels/noticias/brdazul2
text/html #0
10 20050416041249912+180 TS7YQLKBQRZKAMC2LPPOGJE6PZRSZBSM -
2005-04-16T04:12:51.112Z 200 43
http://cdn.digitalcity.com/web_mexico/
channels/noticias/brdazul2 E
http://canales.aol.com.mx/nacional/notas/cfcpe/?id=
993 image/gif #043 20050416041251013+96
4QXM3PB4BSOVX5MX5BRRDMTE3SQ5QNRN 3t
2005-04-16T04:12:55.132Z -5 254
http://cdn.digitalcity.com/web_mexico/
channels/noticias/internas/12h E
http://canales.aol.com.mx/nacional/notas/cfcpe/
?id=993 image/gif #006 20050416041251616+3402
GLCWTCIDSULUD6JUEQFMFKGNHB6EEUXV e
rr=java.lang.NullPointerException
2005-04-16T04:13:00.273Z 200 225
http://cdn.digitalcity.com/web_mexico/
channels/noticias/internas/bktitdes E
http://canales.aol.com.mx/nacional/notas/c
fcpe/?id=993 image/gif #034 20050416041300178+93
XIDR4VBYGUMUPAD2SKIMN5T6FFW2D34
L -


Date: 2005-04-16 03:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Odd is how 3 minutes before we fetched from this host fine
and 9 seconds later w/o a new DNS in between -- as though
the cache stumbled and gave out a new instance of CrawlHost
but just previous and subsequently, went back to giving out
the good CrawlHost, the one laden w/ the right IP. See below.

Adding more debugging.

2005-04-16T00:33:25.292Z 200 5928
http://cnh.gob.mx/documentos/8/2/art/archivos/w0gpnnl7.html
LLXL http://cnh.gob.mx/index.php?idseccion=64 text/html #036
20050416003324975+313 ALUYG6HUXBJEW25UX7EGTVI65AA5AU3D -
2005-04-16T00:33:37.396Z 200 83800
http://cnh.gob.mx/documentos/8/2/art/archivos/ikmukm0b.pdf
LLXL http://cnh.gob.mx/index.php?idseccion=64
application/pdf #014 20050416003328110+9259
LG2GETX735Z5I26TEDBHGELPDII3OFL2 -
2005-04-16T00:34:09.322Z 200 180666
http://cnh.gob.mx/documentos/8/2/art/archivos/jabje8uj.pdf
LLXL http://cnh.gob.mx/index.php?idseccion=64
application/pdf #045 20050416003343499+20400
FDR5JGBGKEXBPGMP35YGVKJOBFXM6U3F -
2005-04-16T00:35:34.214Z 200 659178
http://cnh.gob.mx/documentos/8/2/art/archivos/uwjdbltb.pdf
LLXL http://cnh.gob.mx/index.php?idseccion=64
application/pdf #041 20050416003417103+76023
TLFBR52IJ4EKHG7AXW7J5V5VL2GFMZLG -
2005-04-16T00:35:45.807Z 200 43257
http://cnh.gob.mx/documentos/8/2/art/archivos/nzxt9w6v.html
LLXL http://cnh.gob.mx/index.php?idseccion=64 text/html #005
20050416003541107+4248 UCSQBATSKOHU665GGFGN54RJIC6CZCT6 -
2005-04-16T00:38:08.796Z -5 1166586
http://cnh.gob.mx/documentos/8/2/art/archivos/movbm0f2.pdf
LLXL http://cnh.gob.mx/index.php?idseccion=64
application/pdf #015 20050416003553084+135589
5HM5QYXIIHUN2CDO2KY6CJJIAHMNVNXM
err=java.lang.NullPointerException
2005-04-16T00:38:17.083Z 200 23581
http://cnh.gob.mx/documentos/8/2/art/archivos/2wl5zom2.html
LLXL http://cnh.gob.mx/index.php?idseccion=64 text/html #048
20050416003814793+2279 VWKRECBLT2WAE2PCYZ62RIJQY7PTKZ7Y -
2005-04-16T00:38:23.507Z 200 5925
http://cnh.gob.mx/documentos/8/2/art/archivos/t69zqhss.html
LLXL http://cnh.gob.mx/index.php?idseccion=64 text/html #041
20050416003823254+248 KZ3F62OHJQCYFGFDNSYAH5LRGNIOWYYE -
2005-04-16T00:38:27.076Z 200 2557
http://cnh.gob.mx/documentos/8/2/art/archivos/4swfdqlv.html
LLXL http://cnh.gob.mx/index.php?idseccion=64 text/html #042
20050416003826891+182 2GLK5XB2ZJLKO5V53GLWDDIVUYN3MIGE -
2005-04-16T00:38:30.739Z 200 11540
http://cnh.gob.mx/documentos/8/2/art/archivos/0urlrvbk.html
LLXL http://cnh.gob.mx/index.php?idseccion=64 text/html #026
20050416003829187+1260 O6AAXX5DOU6LS4UPTONHHGYDDLDJAFLX -



Date: 2005-04-16 01:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Using build 1.3.0-200504151516, the one w/ the revised
CachedBdbMap, I'm still seeing these exceptions:

Title: Problem occured processing
'http://segundo.informe.presidencia.gob.mx/docs/pdfs/2info_anexo_435-450.pdf'
Time: Apr. 16, 2005 00:18:04 GMT
Level: SEVERE
Message:

Problem java.lang.NullPointerException: Address is null for
http://segundo.informe.presidencia.gob.mx/docs/pdfs/2info_anexo_435-450.pdf
http://segundo.informe.presidencia.gob.mx/index.php?idseccion=296.
Address was set 0 times and was never looked up. occured
when trying to process
'http://segundo.informe.presidencia.gob.mx/docs/pdfs/2info_anexo_435-450.pdf'
at step ABOUT_TO_BEGIN_PROCESSOR in Archiver


Associated Throwable: java.lang.NullPointerException:
Address is null for
http://segundo.informe.presidencia.gob.mx/docs/pdfs/2info_anexo_435-450.pdf
http://segundo.informe.presidencia.gob.mx/index.php?idseccion=296.
Address was set 0 times and was never looked up.

Message:
Address is null for
http://segundo.informe.presidencia.gob.mx/docs/pdfs/2info_anexo_435-450.pdf
http://segundo.informe.presidencia.gob.mx/index.php?idseccion=296.
Address was set 0 times and was never looked up.

Stacktrace:
java.lang.NullPointerException: Address is null for
http://segundo.informe.presidencia.gob.mx/docs/pdfs/2info_anexo_435-450.pdf
http://segundo.informe.presidencia.gob.mx/index.php?idseccion=296.
Address was set 0 times and was never looked up.
at
org.archive.crawler.writer.ARCWriterProcessor.getHostAddress(ARCWriterProcessor.java:481)
at
org.archive.crawler.writer.ARCWriterProcessor.writeHttp(ARCWriterProcessor.java:379)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java:356)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:272)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:142)



Title: Problem occured processing
'http://cnh.gob.mx/documentos/8/2/art/archivos/movbm0f2.pdf'
Time: Apr. 16, 2005 00:38:08 GMT
Level: SEVERE
Message:

Problem java.lang.NullPointerException: Address is null for
http://cnh.gob.mx/documentos/8/2/art/archivos/movbm0f2.pdf
http://cnh.gob.mx/index.php?idseccion=64. Address was set 0
times and was never looked up. occured when trying to
process
'http://cnh.gob.mx/documentos/8/2/art/archivos/movbm0f2.pdf'
at step ABOUT_TO_BEGIN_PROCESSOR in Archiver


Associated Throwable: java.lang.NullPointerException:
Address is null for
http://cnh.gob.mx/documentos/8/2/art/archivos/movbm0f2.pdf
http://cnh.gob.mx/index.php?idseccion=64. Address was set 0
times and was never looked up.

Message:
Address is null for
http://cnh.gob.mx/documentos/8/2/art/archivos/movbm0f2.pdf
http://cnh.gob.mx/index.php?idseccion=64. Address was set 0
times and was never looked up.

Stacktrace:
java.lang.NullPointerException: Address is null for
http://cnh.gob.mx/documentos/8/2/art/archivos/movbm0f2.pdf
http://cnh.gob.mx/index.php?idseccion=64. Address was set 0
times and was never looked up.
at
org.archive.crawler.writer.ARCWriterProcessor.getHostAddress(ARCWriterProcessor.java:481)
at
org.archive.crawler.writer.ARCWriterProcessor.writeHttp(ARCWriterProcessor.java:379)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java:356)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:272)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:142)


The new debugging seems to indicate that the CrawlHost
instances are being created anew -- when we go to do the
lookup of ARCWriter. I'll enable logging of CrawlHost
instantation to verify this is indeed the case.


Date: 2005-04-15 00:49
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Speculation has it that whats happening here is that the
CrawlHost is being flushed from the cache of Hosts and that
a hole in our cache checking if an object already exists in
memory is allowing the construction of a brand new empty
CrawlHost object -- one absent the lookedup IP.

Gordon edit of CachedBigMap may have just closed the hole.


Date: 2005-04-07 18:33
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tried reproducing w/ the pope crawl on dev04. No luck.
Committed debugging code that will need to be removed before
release. Flagging igor and dan to come get me if they see
new style exceptions.


Date: 2005-04-07 01:48
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tried reproducing w/ the pope crawl on dev04. No luck.
Committed debugging code that will need to be removed before
release. Flagging igor and dan to come get me if they see
new style exceptions.


Date: 2005-04-05 22:33
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Another example is over in
https://sourceforge.net/tracker/?func=detail&atid=539099&aid=1176739&group_id=73833

G speculates entry may have been dropped from hostmap.
That'd be a little odd since we've just gotten a hard
reference to the host -- but its something odd like this
that is happening.


Date: 2005-03-23 00:44
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Removed the set-to-null as suggested above and added in more
debugging code. See commit below.

Work on '[ 1036720 ] NPE in ArcWriterProcessor.writeDns()'.
Index: src/java/org/archive/crawler/fetcher/FetchDNS.java
Removed a set to null. Could result in NPE but probably
not the cause
of the NPEs 1036720 describes. Refactoring so we set to
null only after
host judged unresolveable.
(setUnresolveable): Added.
* src/java/org/archive/crawler/writer/ARCWriterProcessor.java
Added more debugging code.


Date: 2005-03-05 03:08
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

One thought was that the setting of the IP to null on line
147 of FetchDNS, if it happened due to a duplicate DNS
lookup (possible when two CrawlServers on different ports
have the same host), occurred at just the wrong time, it
could trigger an NPE like this. Examining the logs of where
this bug has occurred rules this out (in this case --
there's no culprit DNS operations or alt-port servers
evident in the logs), but it still could be a NPE risk bug
that should be foreclosed.


Date: 2005-03-05 02:09
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Not sure the latest report (which is like the 12/30 reports)
is the same problem as the original reports (sept/october)
-- the NPEs are coming from different methods -- though they
might be manifestations of the same problem.

Would suggest new bug with only new stack dumps. (This would
also establish if recent occurences are all only in writeDns
or both writeDns and writeHttp). Also adding extra logging
around ArcWriterProcessor.getHostAddress() so that on future
occurences it's clear which object in the dispatch chain is
null. I can investigate if you'd like.




Date: 2005-03-04 19:37
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Igor just got one of these NPEs in his new .fr crawl -- and
I got some yesterday testing bigmap -- so synchronization of
servercache access is not sufficent. Upping priority.

Title: Problem occured processing
'dns:lyc-vallee-gif.ac-versailles.fr'
Time: Mar. 4, 2005 08:08:32 GMT
Level: SEVERE
Message:

Problem java.lang.NullPointerException occured when trying
to process 'dns:lyc-vallee-gif.ac-versailles.fr' at step
ABOUT_TO_BEGIN_PROCESSOR in Archiver


Associated Throwable: java.lang.NullPointerException

Stacktrace:
java.lang.NullPointerException
at
org.archive.crawler.writer.ARCWriterProcessor.getHostAddress(ARCWriterProcessor.java:426)
at
org.archive.crawler.writer.ARCWriterProcessor.write(ARCWriterProcessor.java:406)
at
org.archive.crawler.writer.ARCWriterProcessor.writeDns(ARCWriterProcessor.java:385)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java:333)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)


Date: 2004-12-30 17:32
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

After taking a closer look, the last two comments on NPEs
from the french crawl are a different issue -- the get of a
host from the servercache returning null. This was
addressed by serializing servercache accesses. The last two
comments should therefore be disregarded.


Date: 2004-12-30 16:24
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is from same FR crawl:

java.lang.NullPointerException
at
org.archive.crawler.writer.ARCWriterProcessor.getHostAddress(ARCWriterProcessor.java:409)
at
org.archive.crawler.writer.ARCWriterProcessor.writeHttp(ARCWriterProcessor.java:355)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java:329)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:272)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)



Date: 2004-12-30 16:23
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Seen in french crawl:

java.lang.NullPointerException at
org.archive.crawler.writer.ARCWriterProcessor.getHostAddress(ARCWriterProcessor.java:409)
at
org.archive.crawler.writer.ARCWriterProcessor.writeHttp(ARCWriterProcessor.java:355)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java:329)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:272)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)

Probably a varient on the stack trace that opened this issue.


Date: 2004-10-28 00:33
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I got 5 of these over the course of my ~2 week umich.edu
crawl, spread out over the full time:

Time of alert Level Alert title
Oct. 27, 2004 19:08:25 GMT SEVERE Problem occured
processing 'dns:hrs.isr.umich.edu'
Oct. 26, 2004 12:07:45 GMT SEVERE Problem occured
processing 'dns:ai.eecs.umich.edu'
Oct. 17, 2004 21:13:00 GMT SEVERE Problem occured
processing 'dns:x509.citi.umich.edu'
Oct. 17, 2004 07:33:00 GMT SEVERE Problem occured
processing 'dns:tmi.engin.umich.edu'
Oct. 16, 2004 14:33:15 GMT SEVERE Problem occured
processing 'dns:www.icpsr.umich.edu'

I was using a 20041013 build, after this bug was closed, but
the stacktrace looks the same, so the inserted null-check
may not be on the right variable.

Title: Problem occured processing 'dns:hrs.isr.umich.edu'
Time: Oct. 27, 2004 19:08:25 GMT
Level: SEVERE
Message:

Problem java.lang.NullPointerException occured when trying
to process 'dns:hrs.isr.umich.edu' at step
ABOUT_TO_BEGIN_PROCESSOR


Associated Throwable: java.lang.NullPointerException

Stacktrace:
java.lang.NullPointerException
at
org.archive.crawler.writer.ARCWriterProcessor.writeDns(ARCWriterProcessor.java(Compiled
Code))
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java(Compiled
Code))
at
org.archive.crawler.framework.Processor.process(Processor.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java(Compiled
Code))

There was no disk space problem; the crawl definitely
continued generally OK; if the ARCWriter pool was exhausted,
there were no other symptoms of such.

It may be one of the other accesses of the CrawlURI, in
which case something else is corrupting the CrawlURI while
it is in process.

Reopening.




Date: 2004-10-08 22:26
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

This must be happening under catastrophic circumstance --
the disk is full or timeout getting writer from pool. Added
throw of catastrophic exception.
Closing.


Attached File

No Files Currently Attached

Changes ( 11 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:27 gojomo
status_id Open 2005-04-20 00:16 stack-sf
close_date 2004-10-08 22:26 2005-04-20 00:16 stack-sf
resolution_id None 2005-04-20 00:16 stack-sf
priority 5 2005-03-04 19:37 stack-sf
assigned_to nobody 2005-03-04 19:37 stack-sf
resolution_id Fixed 2004-10-28 00:33 gojomo
status_id Closed 2004-10-28 00:33 gojomo
resolution_id None 2004-10-08 22:26 stack-sf
close_date - 2004-10-08 22:26 stack-sf
status_id Open 2004-10-08 22:26 stack-sf