handle very large files

Brought to you by: derrickoswald

#18 handle very large files

Status: open-accepted

Owner: nobody

Labels: None

Priority: 5

Updated: 2006-03-29

Created: 2004-03-24

Creator: Peter MH

Private: No

I got out of memory while parsing an 4,7MB HTML file
using HTLMParser 1.4.
I used different methods with/without filters like
NodeList list = parser.extractAllNodesThatMatch(filter);
NodeIterator iter = parser.elements();
...
Its all the time the same OutOfMemoryError.

Peter MH

Discussion

Peter MH - 2004-03-24

Logged In: YES
user_id=149976

Well I checked with 3,1MB and 2,5MB and still get the error.
What is the best way to parse huge files when I like to have
TITLE and META tags?
Peter MH

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2004-03-29

labels: --> 322296

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2004-03-29

Logged In: YES
user_id=605407

Can you add a link to one of these large files for testing
purposes?
Or attach it to this bug if it will left you add such a
large file.

My initial advice would be to use the Lexer class in
htmllexer.jar rather than the Parser class in the
htmlparser.jar since the lexer doesn't read in the whole
page at once (at least for URLConnections, but not for strings).
Check each returned node to see if it's a TagNode and then
check for a name equal to TITLE or META until you encounter
a tag with name BODY and then stop.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2004-03-29

assigned_to: nobody --> derrickoswald
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter MH - 2004-03-29

Logged In: YES
user_id=149976

Well I havnt worked with the Lexer Class yet but I will give it
a try.
The HTML file belogs to a project so I will have to do some
modifications before I can try to upload it.
Peter MH

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter MH - 2004-03-29

Logged In: YES
user_id=149976

Here is the file

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter MH - 2004-03-30

Logged In: YES
user_id=149976

Using the Lexer is realy nice. Now I changed from parser class
to lexer class and the OutOfMemoryError is gone.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ralf Holzer - 2004-06-12

Logged In: YES
user_id=1062407

I have the same problem with 1.5. I was trying to parse the
following page:

http://www.zaik.uni-koeln.de/pipermail/dmanet.mbox/dmanet.mbox

I can't use the lexer, because I need to read the entire page as
a string.

Any ideas?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2004-06-13

Logged In: YES
user_id=605407

Try giving it more memory:

java -Xmx512M -jar lib/htmlparser.jar
http://www.zaik.uni-koeln.de/pipermail/dmanet.mbox/dmanet.mbox

This increases the heap to 512 megabytes, and that allows it
to handle that over 9 MB URL. The same switch can be applied
to your own program either at the command line or in a
development environment.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2004-07-31

Logged In: YES
user_id=605407

See test case MemoryTest.testBigFile ().

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2004-07-31

assigned_to: derrickoswald --> nobody
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Francisco Mesa - 2005-04-09

Logged In: YES
user_id=626861

I use the 1.5 and get OutofMemory error with 50-70KB files
and 30 files iterations. I try to gc() but not work. Seems a
Parser dispose problem?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2005-04-10

Logged In: YES
user_id=605407

Francisco,

It's unlikely to be a 'dispose' problem, otherwise it
certainly would have been reported long ago.

Perhaps you are holding on to some nodes returned by the parser.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2006-03-29

Logged In: YES
user_id=605407

Changing to a RFE, since this is likely to involve a
significant rewrite of some underlying classes...
was: Bug# 922439 OutOfMemory on huge HTML files (4,7MB)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2006-03-29

labels: 322296 -->

summary: OutOfMemory on huge HTML files (4,7MB) --> handle very large files
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2006-04-11

Failing test case, uses http://htmlparser.sourceforge.net/test/A002.html

MemoryTest.java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Trejkaz - 2009-01-11

We're having the same issue, only the file is 60MB and eats 1GB+, and the remaining memory to the 1.3GB line is used by the rest of our app so increasing the JVM's memory isn't practical for us. :-)

Switching to using Lexer directly seems like it might be a lot of work to get the line breaks correct in our conversion to plain text.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.