Menu

Possible Bug with Lexer

Help
2004-06-15
2004-06-21
  • Rodney S. Foley

    Rodney S. Foley - 2004-06-15

    Well it seems like there is a line length limitation most likely in the Lexer.

    I pass HTML pages around in my application as a single line CharSequence. Usually it is around  50,000 characters, and no problem.  However I am now getting a page that, on one line, is 111,894 characters. 
    The parser seems to be selectively parsing it or something, it never returns any links, and when I step through with a debugger, in the fitler.accept method it passing sometimes only 10 nodes sometimes it will pass around 100. The nodes passed to the accept method are only of the 3 high level nodes, remarks, strings, and tags, nothing more detailed than that.

    I can work around this by adding new lines to the html as I read the lines in to a StringBuffer, so that it is formatted as it is received from the web server.  So it is not a big issue with me, just thought I would point it out in case it is a bug, and not by design.

    However, since the Lexer takes a String it should be able to handle any String that is legal HTML.  Html doesn't require line breaks, so a single line HTML page is legal, which can get very long. 

    -Rodney

     
    • Derrick Oswald

      Derrick Oswald - 2004-06-16

      The only things I can think of are:

      1) Double slash comments in script code. This was something just added recently to avoid incorrect parsing of apostrophes in comments, see bug #919738 Text has not been extracted correctly using StringBean and bug #936392 ScriptTag visitor fails for comments with ' (duplicate of above)
      Depends how it got to be one long line, were you removing newline characters?

      2) Encoding problem. This would apply to non-ASCII character sets where reading in a character using the wrong encoding might miss an angle bracket causing a bad parse. See bug #973137 Double-bytes characters are messed after parsing. Depends how you get characters into the Parser.

       
    • Rodney S. Foley

      Rodney S. Foley - 2004-06-16

      Hmm... maybe I didn't explain this right, or I don't understand your response.

      The same HTML page works fine if it is formatted with all the normal line termination expected when viewing HTML source.

      It only fails when the entire source is on a single line (without any line termination) and is over a certain length.  I tested it to 65,554 just to see if it fits this magic number but I tested it just before this number at 65,600 and it works fine... so it is somewhere between 65,000 and 111,000 characters when it fails.

      I have changed the application to keep the line terminations the same as the original HTML received from a web server.

      However, there is nothing stopping a web server from sending the HTML as a single line, since it is still legal HTML and a web browsers will display it fine.

      Anyway, I wouldn't think this would be a high priority issue, more of an FYI. Since it doesn't seem to be causing problems with anyone other than me, because it is such an ODD way to handle HTML.

      Thanks...

      -Rodney

       
    • Rodney S. Foley

      Rodney S. Foley - 2004-06-16

      Derrick,

      Almost forgot... I wrote a simple test app to test this problem that uses two HTML files that are the same except for one with line breaks the other is a single line.  The one with line break works, the one as a single line doesn't.

      If you want I can upload this someplace for you if you think it well help.  Let me know I can do it from work tomorrow (Wednesday). 

      Sorry -- I should have uploaded this and provided it with my original message, I just thought of it tonight.

      -Rodney

       
      • Derrick Oswald

        Derrick Oswald - 2004-06-16

        A testcase is the best way.
        Create a bug report and attach the files to it.

         
        • Rodney S. Foley

          Rodney S. Foley - 2004-06-21

          I haven't forgot, just have been very busy with a project, and plan to make a self contained test case, and submit it with a bug as soon as I can. THX

           

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.