[Htmlparser-user] How to extract only the "viewable" text? (Not scripts and comments, etc.)
Brought to you by:
derrickoswald
From: Peter A. D. <pet...@gm...> - 2009-01-22 02:41:15
|
I'm using htmlparser very successfully for specific tag extraction, but am having trouble trying to implementing a plain text export for a "word count" function. I have spent half of today in JavaDoc and experimenting trying to get only the "printable" words on a page. I cannot get the javascript to not be included, although I'm able to exclude the script tags themselves (script body still prints) using the NotFilter class combined with a ScriptTag filter. Am I not going about this correctly? Maybe a better question is how I should be going about trying to do this? I can think of complicated ways I could use brute force to make this work, but it seems as if there is a simple and elegant solution I am missing. Thank you for any help, -Pete |