Menu

Event-Driven Lexing

Help
2005-11-14
2013-04-27
  • Michael Rimov

    Michael Rimov - 2005-11-14

    Hi there.

    First off, thank you for the library.  I've successfully used the html-lexer module to perform some fancy page composition stuff and I've been quite happy with it.

    Now I'm interested in increasing performance. [Performance is pretty good for all those watching and concerned with my post BTW]

    What I'd like to see is that every time I receive output that I can send the partially completed document to the lexer, which would, in turn, fire the different "on tag encountered" for every complete tag found in the buffer.  The remaining 'garbage' would be saved until the next chunk of data came in, at which point the events would continue firing.

    This is pretty much like what I saw in the code as the beginnings of NIO capabilities, but being a newbie in Java NIO, I couldn't really determine how to fit it all together. (Esp eince I'm basically parsing the data as it comes into a wrapped outputstream)

    I'd like this so I could potentially handle document rewriting without loading the document into memory.

    Any pointers on how it could all be fit together?

    Thanks!

    -Mike

     
    • Derrick Oswald

      Derrick Oswald - 2005-11-15

      Should be possible, unless you are outputting partial tags or text, in which case you may need to alter the lexer to handle a partially scanned lexeme without creating a TextNode by default for the fragment.

      I would try it using a background thread and a custom Source derived from org.htmlparser.lexer.Source. The current subclasses (InputStreamSource and StringSource) buffer the input as characters and a String respectively, and report end of stream when the end of those is reached. But you could have a 'pending' flag that the thread could check and 'wait' on if no more characters were available. The sending code could 'notify' the flag when more came in and the thread picks up where it left off. If the pending flag isn't set and there are no more characters, then that's the real indication you're at the end of the Source.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.