Our system need to parse thousands of HTML per second. We found GC is a big problem for our system performance and latency.
Is their any way to improve the memory usage during parsing? I did some dump and noticed the startTag used much memory, Could we ignore some tag during parsing? Would that help?
For example, if i am only interested in img's url. Could we only parse and get the img tag's result?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Using StreamedSource instead of Source allows you to limit memory usage on large files. It doesn't have all of the functionality of the Source class but should be sufficient if you are only parsing links.
Our system need to parse thousands of HTML per second. We found GC is a big problem for our system performance and latency.
Is their any way to improve the memory usage during parsing? I did some dump and noticed the startTag used much memory, Could we ignore some tag during parsing? Would that help?
For example, if i am only interested in img's url. Could we only parse and get the img tag's result?
Hi Jianjin,
Using StreamedSource instead of Source allows you to limit memory usage on large files. It doesn't have all of the functionality of the Source class but should be sufficient if you are only parsing links.
Alternatively you could try the latest development release which has numerous memory optimisations.
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
Cheers
Martin
Thanks, I will try 1. upgrade the parser 2. Try to see if StreamedSource could help.