I'm getting an empty List while trying to do:
List<Element> listElement = sourceSearchPage.getAllElements("meta");
this is a HTML page that I'm trying to parse:
<META HTTP-EQUIV="REFRESH" CONTENT="1;URL=/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&p=1&f=G&l=50&d=PTXT&S1=(%22sino-american+electronic%22.ASNM.)&OS=an/"sino-american+electronic"&RS=AN/"sino-american+electronic"">
Clearly the problem are the quotes in the URL attribute of the META tag "refresh". Url of the test page is:
Note: browser will automatically refresh the page based on the URL of the META tag, in my code i'm trying to get that URL and mimic the browser "redirect" function.
I'm using v3.1 of the Jericho HTML parser, so i'm wondering if this BUG is fixed in the new v3.2
This is not a bug as the HTML is invalid.
You can however use the static Attributes.setDefaultMaxErrorCount method to make the parser more tolerant of errors when parsing attributes.
You should also use the latest version to avoid other bugs that have been fixed since 3.1.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.