HTML Parser / Support Requests / #66 Sample of using Lexer but finding the end of tags

#66 Sample of using Lexer but finding the end of tags

Status: open

Owner: Derrick Oswald

Labels: Programming Problem (39)

Priority: 5

Updated: 2009-01-12

Created: 2009-01-12

Creator: Trejkaz

Private: No

I have a class which extracts text from HTML using HTMLParser to convert into a NodeList and then performing rules based on the element to insert appropriate spacing to make the text layout similarly in the plaintext version.

My problem is memory usage. I mentioned it on another bug report but we have a 60MB HTML document which blows out the JVM's memory if loaded into a DOM-like structure like what HTMLParser is using. So it's impossible for us to use Parser at present without getting this problem.

My attention has now turned to Lexer, but my problem is that our existing code performs the spacing heuristics based on the position of the *end* tags, or to be more precise, it works like this:

createNextReader() {
while (index < nodeList.size()) {
Node node = nodeList.elementAt(index++);
if (node instanceof Text) {
// return text as a reader
} else if (node instanceof Tag) (
String tagName = ((Tag) node).getTagName();
if (dropTags.contains(tagName)) {
continue;
}

// if the tag has children, create reader
// which reads the children
Reader childrenReader = ...

// and now logic for the tag
Reader tagText = ...

return // join readers...
}
}
return null;
}

Conceptually if we were to rewrite this to use the Lexer and get rid of the recursion which we're using, we would need some way to determine where the end of a tag is. This is fine if the document specified them, but when a document is like this...

<p> Para
<p> Another para

Lexer doesn't emit any end tags (because there aren't any.)

If what I'm thinking is correct, the Scanner / TagScanner is what determines where the end of a tag is, but it isn't clear whether this could be used to solve this problem and there are no examples of how to use it.

Discussion

Derrick Oswald - 2009-01-12

labels: --> Programming Problem

assigned_to: nobody --> derrickoswald

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2009-01-12

You should look at the code for StringBean that does a plaintext conversion.
It doesn't handle <p> specially, but you may be able to wedge some special code in there.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Trejkaz - 2009-01-12

We're already doing it that way (using Parser.) See the first couple of paragraphs in my post as to why it isn't a solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Trejkaz - 2009-01-12

status: pending --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sample of using Lexer but finding the end of tags

Group

Searches

Help

#66 Sample of using Lexer but finding the end of tags

Discussion