I'm using htmlparser very successfully for specific tag extraction,
but am having trouble trying to implementing a plain text export for a
"word count" function.
I have spent half of today in JavaDoc and experimenting trying to get
only the "printable" words on a page. I cannot get the javascript to
not be included, although I'm able to exclude the script tags
themselves (script body still prints) using the NotFilter class
combined with a ScriptTag filter.
Am I not going about this correctly? Maybe a better question is how I
should be going about trying to do this? I can think of complicated
ways I could use brute force to make this work, but it seems as if
there is a simple and elegant solution I am missing.
Thank you for any help,
-Pete
|