Menu

How to save memory during parsing?

jianjin
2013-12-25
2014-01-06
  • jianjin

    jianjin - 2013-12-25

    Our system need to parse thousands of HTML per second. We found GC is a big problem for our system performance and latency.

    Is their any way to improve the memory usage during parsing? I did some dump and noticed the startTag used much memory, Could we ignore some tag during parsing? Would that help?

    For example, if i am only interested in img's url. Could we only parse and get the img tag's result?

     
  • Martin Jericho

    Martin Jericho - 2013-12-26

    Hi Jianjin,

    Using StreamedSource instead of Source allows you to limit memory usage on large files. It doesn't have all of the functionality of the Source class but should be sufficient if you are only parsing links.

    Alternatively you could try the latest development release which has numerous memory optimisations.
    http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

    Cheers
    Martin

     
    • jianjin

      jianjin - 2014-01-06

      Thanks, I will try 1. upgrade the parser 2. Try to see if StreamedSource could help.

       

Log in to post a comment.