Thread: [Htmlparser-user] How to extract only the "viewable" text? (Not scripts and comments, etc.)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I'm using htmlparser very successfully for specific tag extraction,
but am having trouble trying to implementing a plain text export for a
"word count" function.

I have spent half of today in JavaDoc and experimenting trying to get
only the "printable" words on a page.  I cannot get the javascript to
not be included, although I'm able to exclude the script tags
themselves (script body still prints) using the NotFilter class
combined with a ScriptTag filter.

Am I not going about this correctly?  Maybe a better question is how I
should be going about trying to do this?  I can think of complicated
ways I could use brute force to make this work, but it seems as if
there is a simple and elegant solution I am missing.

Thank you for any help,

-Pete

Thread: [Htmlparser-user] How to extract only the "viewable" text? (Not scripts and comments, etc.)

htmlparser-user