Our system need to parse thousands of HTML per second. We found GC is a big problem for our system performance and latency.
Is their any way to improve the memory usage during parsing? I did some dump and noticed the startTag used much memory, Could we ignore some tag during parsing? Would that help?
For example, if i am only interested in img's url. Could we only parse and get the img tag's result?
Using StreamedSource instead of Source allows you to limit memory usage on large files. It doesn't have all of the functionality of the Source class but should be sufficient if you are only parsing links.
Alternatively you could try the latest development release which has numerous memory optimisations.
Thanks, I will try 1. upgrade the parser 2. Try to see if StreamedSource could help.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.