handle very large files
Brought to you by:
derrickoswald
I got out of memory while parsing an 4,7MB HTML file
using HTLMParser 1.4.
I used different methods with/without filters like
NodeList list = parser.extractAllNodesThatMatch(filter);
NodeIterator iter = parser.elements();
...
Its all the time the same OutOfMemoryError.
Peter MH
Logged In: YES
user_id=149976
Well I checked with 3,1MB and 2,5MB and still get the error.
What is the best way to parse huge files when I like to have
TITLE and META tags?
Peter MH
Logged In: YES
user_id=605407
Can you add a link to one of these large files for testing
purposes?
Or attach it to this bug if it will left you add such a
large file.
My initial advice would be to use the Lexer class in
htmllexer.jar rather than the Parser class in the
htmlparser.jar since the lexer doesn't read in the whole
page at once (at least for URLConnections, but not for strings).
Check each returned node to see if it's a TagNode and then
check for a name equal to TITLE or META until you encounter
a tag with name BODY and then stop.
Logged In: YES
user_id=149976
Well I havnt worked with the Lexer Class yet but I will give it
a try.
The HTML file belogs to a project so I will have to do some
modifications before I can try to upload it.
Peter MH
Logged In: YES
user_id=149976
Here is the file
Logged In: YES
user_id=149976
Using the Lexer is realy nice. Now I changed from parser class
to lexer class and the OutOfMemoryError is gone.
Logged In: YES
user_id=1062407
I have the same problem with 1.5. I was trying to parse the
following page:
http://www.zaik.uni-koeln.de/pipermail/dmanet.mbox/dmanet.mbox
I can't use the lexer, because I need to read the entire page as
a string.
Any ideas?
Logged In: YES
user_id=605407
Try giving it more memory:
java -Xmx512M -jar lib/htmlparser.jar
http://www.zaik.uni-koeln.de/pipermail/dmanet.mbox/dmanet.mbox
This increases the heap to 512 megabytes, and that allows it
to handle that over 9 MB URL. The same switch can be applied
to your own program either at the command line or in a
development environment.
Logged In: YES
user_id=605407
See test case MemoryTest.testBigFile ().
Logged In: YES
user_id=626861
I use the 1.5 and get OutofMemory error with 50-70KB files
and 30 files iterations. I try to gc() but not work. Seems a
Parser dispose problem?
Logged In: YES
user_id=605407
Francisco,
It's unlikely to be a 'dispose' problem, otherwise it
certainly would have been reported long ago.
Perhaps you are holding on to some nodes returned by the parser.
Logged In: YES
user_id=605407
Changing to a RFE, since this is likely to involve a
significant rewrite of some underlying classes...
was: Bug# 922439 OutOfMemory on huge HTML files (4,7MB)
Failing test case, uses http://htmlparser.sourceforge.net/test/A002.html
We're having the same issue, only the file is 60MB and eats 1GB+, and the remaining memory to the 1.3GB line is used by the rest of our app so increasing the JVM's memory isn't practical for us. :-)
Switching to using Lexer directly seems like it might be a lot of work to get the line breaks correct in our conversion to plain text.