Jericho HTML Parser / Discussion / Open Discussion: How to save memory during parsing?

How to save memory during parsing?

Forum: Open Discussion

Creator: jianjin

Created: 2013-12-25

Updated: 2014-01-06

jianjin - 2013-12-25

Our system need to parse thousands of HTML per second. We found GC is a big problem for our system performance and latency.

Is their any way to improve the memory usage during parsing? I did some dump and noticed the startTag used much memory, Could we ignore some tag during parsing? Would that help?

For example, if i am only interested in img's url. Could we only parse and get the img tag's result?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-12-26

Hi Jianjin,

Using StreamedSource instead of Source allows you to limit memory usage on large files. It doesn't have all of the functionality of the Source class but should be sufficient if you are only parsing links.

Alternatively you could try the latest development release which has numerous memory optimisations.
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

Cheers
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- jianjin - 2014-01-06
  
  Thanks, I will try 1. upgrade the parser 2. Try to see if StreamedSource could help.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.