Menu

#50 NPE thrown by NekoHtmlDocumentHandler

closed-fixed
Core (65)
5
2007-08-30
2007-07-16
No

The following exception is thrown by the NekoHtmlDocumentHandler when loading some HTML files.

Attached is a zip containing 4 html documents (which can be opened by Firefox without problem)

-----------------------------------------------

java.lang.NullPointerException
at gate.html.NekoHtmlDocumentHandler.characters(NekoHtmlDocumentHandler.java:191)
at org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
at org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2319)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1881)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at gate.corpora.NekoHtmlDocumentFormat.unpackMarkup(NekoHtmlDocumentFormat.java:188)
at gate.corpora.NekoHtmlDocumentFormat.unpackMarkup(NekoHtmlDocumentFormat.java:89)
at gate.corpora.DocumentImpl.init(DocumentImpl.java:240)
at gate.Factory.createResource(Factory.java:302)

Discussion

  • Julien (GATE)

    Julien (GATE) - 2007-07-16

    4 HTML documents causing problems to Neko

     
  • Ian Roberts

    Ian Roberts - 2007-07-16

    Logged In: YES
    user_id=1157323
    Originator: NO

    In document 11169.htm, this appears to be what is causing the problem:

    <strong>< Investors</strong>

    i.e. a bare less-than sign in the text, and NekoHTML is not giving us the location information (row and column offsets in the original document) for this. We only need this information when we're parsing with collectRepositioningInfo=true, so we could change it to:

    1) ignore the error when parsing without repositioning info, but leave it as a fatal error when parsing with.

    2) ignore the error completely and accept that if you save preserving format any annotations that include the problematic character(s) will not necessarily write out in the correct place. Limited testing seems to suggest that annotations that don't touch the problem bits will be OK.

     
  • Julien (GATE)

    Julien (GATE) - 2007-07-16

    Logged In: YES
    user_id=1283756
    Originator: YES

    I think (2) is probably the best option,IMHO it is better to have a slightly incorrect document than nothing at all
    Maybe we could just leave a short message on the console log to have a trace of the problem (e.g. "Problem with positioning of characters in document xxxx")?

     
  • Ian Roberts

    Ian Roberts - 2007-08-30
    • assigned_to: nobody --> ian_roberts
    • status: open --> closed-fixed
     
  • Ian Roberts

    Ian Roberts - 2007-08-30

    Logged In: YES
    user_id=1157323
    Originator: NO

    I've committed a fix for this (revision 9042), we now ignore the problem when not collecting repositioning info, and warn the user if we are collecting it.

     

Log in to post a comment.