#66 Sample of using Lexer but finding the end of tags

open
5
2009-01-12
2009-01-12
Trejkaz
No

I have a class which extracts text from HTML using HTMLParser to convert into a NodeList and then performing rules based on the element to insert appropriate spacing to make the text layout similarly in the plaintext version.

My problem is memory usage. I mentioned it on another bug report but we have a 60MB HTML document which blows out the JVM's memory if loaded into a DOM-like structure like what HTMLParser is using. So it's impossible for us to use Parser at present without getting this problem.

My attention has now turned to Lexer, but my problem is that our existing code performs the spacing heuristics based on the position of the *end* tags, or to be more precise, it works like this:

createNextReader() {
while (index < nodeList.size()) {
Node node = nodeList.elementAt(index++);
if (node instanceof Text) {
// return text as a reader
} else if (node instanceof Tag) (
String tagName = ((Tag) node).getTagName();
if (dropTags.contains(tagName)) {
continue;
}

// if the tag has children, create reader
// which reads the children
Reader childrenReader = ...

// and now logic for the tag
Reader tagText = ...

return // join readers...
}
}
return null;
}

Conceptually if we were to rewrite this to use the Lexer and get rid of the recursion which we're using, we would need some way to determine where the end of a tag is. This is fine if the document specified them, but when a document is like this...

<p> Para
<p> Another para

Lexer doesn't emit any end tags (because there aren't any.)

If what I'm thinking is correct, the Scanner / TagScanner is what determines where the end of a tag is, but it isn't clear whether this could be used to solve this problem and there are no examples of how to use it.

Discussion

  • Derrick Oswald

    Derrick Oswald - 2009-01-12
    • labels: --> Programming Problem
    • assigned_to: nobody --> derrickoswald
    • status: open --> pending
     
  • Derrick Oswald

    Derrick Oswald - 2009-01-12

    You should look at the code for StringBean that does a plaintext conversion.
    It doesn't handle <p> specially, but you may be able to wedge some special code in there.

     
  • Trejkaz

    Trejkaz - 2009-01-12

    We're already doing it that way (using Parser.) See the first couple of paragraphs in my post as to why it isn't a solution.

     
  • Trejkaz

    Trejkaz - 2009-01-12
    • status: pending --> open
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks