Menu

problem with text extraction for some web pag

2010-07-21
2013-01-03
  • - 2010-07-21

    example web page: http://annieology.com/2009/06/disappointed/

    i am using this snippet to do simple text extraction: source.getTextExtractor().toString();

    which doesn't seem to get us the desired results.

    so i looked at the page source and tinked with the markup and it seems the <![CDATA[    inside of script tags is throwing off the parser. i removed the <![CDATA section in the script and everything works.

    is this a known problem? what is a workaround?

    thanks for any help.

     
  • Martin Jericho

    Martin Jericho - 2010-07-21

    Hi noomaan,

    The page contains some invalidly nested comments which seems to be causing the problem.  The latest development version of the parser seems to cope with it.

    http://jericho.htmlparser.net/temp/jericho-html-3.2-dev.zip

    Cheers 
    Martin

     
  • - 2010-07-23

    thanks martin.  it works.

    what is the release schedule for 3.2?

     
  • Martin Jericho

    Martin Jericho - 2010-07-28

    It is probably overdue but it takes a few hours to go through the process and I just don't have those hours to spare.  The development versions are stable enough to use in production environments.

     

Log in to post a comment.