html cleanup?

  • mpease

    mpease - 2007-03-16

    Hi all --

       I'm trying to do simple html cleanup, converting tag soup html -> valid xhtml. 

       I'm intrigued by this project because it seems more current than other java libs (tagsoup, and jtidy).  It also leverages firefox html parsing, which must be some of the most robust parsing code around.

       But... I need it to work consistently.   Your release number is really low.  What sort of performance can I expect here?

       What are the plans for this project going forward?

    Thank you-

    • Ohad Serfaty

      Ohad Serfaty - 2007-03-18

      Hi Matt
      I can say what i know from personal experience with the parser. perhaps other users can share their own experience.
      We have developed this parser as a part of the Dapper project ( . We are using it to parse html pages. Currently the parser is handling more than 80,000 html pages a day , concurrently , without any memory leaks.
      The mozilla parser is slower than other java parsers like tagsoup or neko (~ 30% worse than tagsoup ) , but it is more accurate and creates a denser and more reliable DOM - and you have to take than into account as well. more complex DOMs produce more load because doing a simple element traversal operation takes longer etc , so the decrease in performance will be harder the bigger post processing task you have.

      All in all , we are pretty satisfied with the mozilla parser here at dapper. We have had some issues with it at first , mainly performance related , and it crashed our servers for some time . We have switched to a brand new 4GB server and it kind of solved it. I wouldn't say it is totally stable and define is as a production piece but it is moving in the right direction...

      Let me know if you have any more questions. you can skype me if you want to discuss this any further at skype:ohad.serfay .




Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks