Jericho HTML Parser / Discussion / Open Discussion: problem with text extraction for some web pag

problem with text extraction for some web pag

Forum: Open Discussion

Creator:

Created: 2010-07-21

Updated: 2013-01-03

- 2010-07-21

example web page: http://annieology.com/2009/06/disappointed/

i am using this snippet to do simple text extraction: source.getTextExtractor().toString();

which doesn't seem to get us the desired results.

so i looked at the page source and tinked with the markup and it seems the <![CDATA[ inside of script tags is throwing off the parser. i removed the <![CDATA section in the script and everything works.

is this a known problem? what is a workaround?

thanks for any help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2010-07-21

Hi noomaan,

The page contains some invalidly nested comments which seems to be causing the problem. The latest development version of the parser seems to cope with it.

http://jericho.htmlparser.net/temp/jericho-html-3.2-dev.zip

Cheers
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2010-07-23

thanks martin. it works.

what is the release schedule for 3.2?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2010-07-28

It is probably overdue but it takes a few hours to go through the process and I just don't have those hours to spare. The development versions are stable enough to use in production environments.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.