I have a class which extracts text from HTML using HTMLParser to convert into a NodeList and then performing rules based on the element to insert appropriate spacing to make the text layout similarly in the plaintext version.
My problem is memory usage. I mentioned it on another bug report but we have a 60MB HTML document which blows out the JVM's memory if loaded into a DOM-like structure like what HTMLParser is using. So it's impossible for us to use Parser at present without getting this problem.
My attention has now turned to Lexer, but my problem is that our existing code performs the spacing heuristics based on the position of the *end* tags, or to be more precise, it works like this:
createNextReader() {
while (index < nodeList.size()) {
Node node = nodeList.elementAt(index++);
if (node instanceof Text) {
// return text as a reader
} else if (node instanceof Tag) (
String tagName = ((Tag) node).getTagName();
if (dropTags.contains(tagName)) {
continue;
}
// if the tag has children, create reader
// which reads the children
Reader childrenReader = ...
// and now logic for the tag
Reader tagText = ...
return // join readers...
}
}
return null;
}
Conceptually if we were to rewrite this to use the Lexer and get rid of the recursion which we're using, we would need some way to determine where the end of a tag is. This is fine if the document specified them, but when a document is like this...
<p> Para
<p> Another para
Lexer doesn't emit any end tags (because there aren't any.)
If what I'm thinking is correct, the Scanner / TagScanner is what determines where the end of a tag is, but it isn't clear whether this could be used to solve this problem and there are no examples of how to use it.
You should look at the code for StringBean that does a plaintext conversion.
It doesn't handle <p> specially, but you may be able to wedge some special code in there.
We're already doing it that way (using Parser.) See the first couple of paragraphs in my post as to why it isn't a solution.