Re: [jflex-users] Possible bug?
The fast lexer generator for Java
Brought to you by:
lsf37,
steve_rowe
From: Gerwin K. <Ger...@ni...> - 2016-05-12 13:24:36
|
Sorry, I did receive it but got bogged down in other work and haven’t had a chance to look at it yet. Should have at least let you know.. I should be able to look at it this weekend. Cheers, Gerwin On 12.05.2016, at 23:03, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: Gerwin, Could you help me understand the status of this? At the end of April I sent you a small test case (6 files, including grammar, test driver, etc.) which I think demonstrates this problem. Since I haven't heard back and because I sent it off list, I'm wondering if you received it, or if it somehow ended up in a spam folder? Or is the situation that you have not been able to devote any time to this? I used a string reader to avoid any encoding issues, and added a test to insure that the string reader was delivering the control characters as expected. My initial conclusion is that the processing of jletterdigit possibly has a flaw in which a subset of the ASCII control characters are included. I haven't tried to confirm the situation in the JFlex source yet. No doubt you would be much more efficient than I in figuring this out, but I'll give it a try as time permits. Best, Bill Fenlason On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...<mailto:Ger...@ni...>> wrote: Hi William, this does sound like it could be a bug, yes. Do you have a small test spec and input with expected output? I’d like to try to reproduce across different versions, may be I can see what is going on. A common pitfall with such characters is the encoding, both of the spec file for JFlex and the input file to the compiled scanner. If you’re using the unicode escape sequences, the former shouldn’t matter, but the latter still might. Cheers, Gerwin On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: RL1.1 Hex Notation To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation. JFlex conforms. Syntax is provided to express values across the whole range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit hex value. ------------------------------------------------------------------------------------------------- If I understand it correctly, the above (taken from the JFlex User Manual) implies that all hex characters from \U0000 through \U10FFFF may be used in a lexical specification. I don't think that is the case, and this is why. As we know, <<EOF>> cannot be used for look ahead processing. It has been suggested here that one way to simulate it is to append a unique character to the end of the file, use it for look ahead, and then discard it. That approach was adopted. We developed an extension of java.io.Reader which allows any specified character to be transparently appended to the end of the file (Eclipse document, actually), and also a substitute character to be returned in case the specified character occurs in the file. It seemed that a reasonable choice for an EOF character was to use one of the ASCII control characters from \x00 thru \x1F, avoiding the commonly used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT (\x04) appeared to be good alternatives. Initial testing did not bear this out - in a test case, two versions of JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather than recognizing them as separate tokens. Additional testing convinced us that of the reasonable control character choices, only File Separator (FS - \x1C) and Group Separator (GS - \x1D) work as expected. Why should some control characters work, and others not work? My suspicion is that somewhere in the JFlex code there are specific character dependencies in the ASCII control character range. I believe that this is a bug, either in the code or in the above documentation, and is contrary to the idea that any hex character may be used in a specification. Am I mis-reading this documentation? Do others agree that this is a bug to be fixed? I've downloaded the JFlex source and am willing to look for the cause, but I have no idea where to start exploring. Does anyone have suggestions? Obviously \x1C as the EOF character is a pragmatic solution "because it works", but that seems a bit of a kludge.. Bill Fenlason ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users ________________________________ The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments. |