BUG - trying to parse META tag with quotes

  • ljuba

    ljuba - 2011-12-26

    I'm getting an empty List while trying to do:

    List<Element> listElement = sourceSearchPage.getAllElements("meta");

    this is a HTML page that I'm trying to parse:

    <TITLE>Single Document</TITLE>
    <META HTTP-EQUIV="REFRESH" CONTENT="1;URL=/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&p=1&f=G&l=50&d=PTXT&S1=(%22sino-american+electronic%22.ASNM.)&OS=an/"sino-american+electronic"&RS=AN/"sino-american+electronic"">

    Clearly the problem are the quotes in the URL attribute of the META tag "refresh". Url of the test page is:


    Note: browser will automatically refresh the page based on the URL of the META tag, in my code i'm trying to get that URL and mimic the browser "redirect" function.

    I'm using v3.1 of the Jericho HTML parser, so i'm wondering if this BUG is fixed in the new v3.2

  • Martin Jericho

    Martin Jericho - 2011-12-27

    Hi Nljuba,

    This is not a bug as the HTML is invalid.

    You can however use the static Attributes.setDefaultMaxErrorCount method to make the parser more tolerant of errors when parsing attributes.

    You should also use the latest version to avoid other bugs that have been fixed since 3.1.



Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks