i am using this snippet to do simple text extraction: source.getTextExtractor().toString();
which doesn't seem to get us the desired results.
so i looked at the page source and tinked with the markup and it seems the <![CDATA[ inside of script tags is throwing off the parser. i removed the <![CDATA section in the script and everything works.
is this a known problem? what is a workaround?
thanks for any help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The page contains some invalidly nested comments which seems to be causing the problem. The latest development version of the parser seems to cope with it.
It is probably overdue but it takes a few hours to go through the process and I just don't have those hours to spare. The development versions are stable enough to use in production environments.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
example web page: http://annieology.com/2009/06/disappointed/
i am using this snippet to do simple text extraction: source.getTextExtractor().toString();
which doesn't seem to get us the desired results.
so i looked at the page source and tinked with the markup and it seems the <![CDATA[ inside of script tags is throwing off the parser. i removed the <![CDATA section in the script and everything works.
is this a known problem? what is a workaround?
thanks for any help.
Hi noomaan,
The page contains some invalidly nested comments which seems to be causing the problem. The latest development version of the parser seems to cope with it.
http://jericho.htmlparser.net/temp/jericho-html-3.2-dev.zip
Cheers
Martin
thanks martin. it works.
what is the release schedule for 3.2?
It is probably overdue but it takes a few hours to go through the process and I just don't have those hours to spare. The development versions are stable enough to use in production environments.