Menu

#18 handle very large files

open-accepted
nobody
None
5
2006-03-29
2004-03-24
Peter MH
No

I got out of memory while parsing an 4,7MB HTML file
using HTLMParser 1.4.
I used different methods with/without filters like
NodeList list = parser.extractAllNodesThatMatch(filter);
NodeIterator iter = parser.elements();
...
Its all the time the same OutOfMemoryError.

Peter MH

Discussion

  • Peter MH

    Peter MH - 2004-03-24

    Logged In: YES
    user_id=149976

    Well I checked with 3,1MB and 2,5MB and still get the error.
    What is the best way to parse huge files when I like to have
    TITLE and META tags?
    Peter MH

     
  • Derrick Oswald

    Derrick Oswald - 2004-03-29
    • labels: --> 322296
    • status: open --> open-accepted
     
  • Derrick Oswald

    Derrick Oswald - 2004-03-29

    Logged In: YES
    user_id=605407

    Can you add a link to one of these large files for testing
    purposes?
    Or attach it to this bug if it will left you add such a
    large file.

    My initial advice would be to use the Lexer class in
    htmllexer.jar rather than the Parser class in the
    htmlparser.jar since the lexer doesn't read in the whole
    page at once (at least for URLConnections, but not for strings).
    Check each returned node to see if it's a TagNode and then
    check for a name equal to TITLE or META until you encounter
    a tag with name BODY and then stop.

     
  • Derrick Oswald

    Derrick Oswald - 2004-03-29
    • assigned_to: nobody --> derrickoswald
     
  • Peter MH

    Peter MH - 2004-03-29

    Logged In: YES
    user_id=149976

    Well I havnt worked with the Lexer Class yet but I will give it
    a try.
    The HTML file belogs to a project so I will have to do some
    modifications before I can try to upload it.
    Peter MH

     
  • Peter MH

    Peter MH - 2004-03-29

    Logged In: YES
    user_id=149976

    Here is the file

     
  • Peter MH

    Peter MH - 2004-03-30

    Logged In: YES
    user_id=149976

    Using the Lexer is realy nice. Now I changed from parser class
    to lexer class and the OutOfMemoryError is gone.

     
  • Ralf Holzer

    Ralf Holzer - 2004-06-12

    Logged In: YES
    user_id=1062407

    I have the same problem with 1.5. I was trying to parse the
    following page:

    http://www.zaik.uni-koeln.de/pipermail/dmanet.mbox/dmanet.mbox

    I can't use the lexer, because I need to read the entire page as
    a string.

    Any ideas?

     
  • Derrick Oswald

    Derrick Oswald - 2004-06-13

    Logged In: YES
    user_id=605407

    Try giving it more memory:

    java -Xmx512M -jar lib/htmlparser.jar
    http://www.zaik.uni-koeln.de/pipermail/dmanet.mbox/dmanet.mbox

    This increases the heap to 512 megabytes, and that allows it
    to handle that over 9 MB URL. The same switch can be applied
    to your own program either at the command line or in a
    development environment.

     
  • Derrick Oswald

    Derrick Oswald - 2004-07-31

    Logged In: YES
    user_id=605407

    See test case MemoryTest.testBigFile ().

     
  • Derrick Oswald

    Derrick Oswald - 2004-07-31
    • assigned_to: derrickoswald --> nobody
     
  • Francisco Mesa

    Francisco Mesa - 2005-04-09

    Logged In: YES
    user_id=626861

    I use the 1.5 and get OutofMemory error with 50-70KB files
    and 30 files iterations. I try to gc() but not work. Seems a
    Parser dispose problem?

     
  • Derrick Oswald

    Derrick Oswald - 2005-04-10

    Logged In: YES
    user_id=605407

    Francisco,

    It's unlikely to be a 'dispose' problem, otherwise it
    certainly would have been reported long ago.

    Perhaps you are holding on to some nodes returned by the parser.

     
  • Derrick Oswald

    Derrick Oswald - 2006-03-29

    Logged In: YES
    user_id=605407

    Changing to a RFE, since this is likely to involve a
    significant rewrite of some underlying classes...
    was: Bug# 922439 OutOfMemory on huge HTML files (4,7MB)

     
  • Derrick Oswald

    Derrick Oswald - 2006-03-29
    • labels: 322296 -->
    • summary: OutOfMemory on huge HTML files (4,7MB) --> handle very large files
     
  • Trejkaz

    Trejkaz - 2009-01-11

    We're having the same issue, only the file is 60MB and eats 1GB+, and the remaining memory to the 1.3GB line is used by the rest of our app so increasing the JVM's memory isn't practical for us. :-)

    Switching to using Lexer directly seems like it might be a lot of work to get the line breaks correct in our conversion to plain text.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.