Re: [jflex-users] Possible bug?
The fast lexer generator for Java
Brought to you by:
lsf37,
steve_rowe
From: William F. <bil...@gm...> - 2016-05-12 17:16:44
|
Hi Gerwin, After looking at the very readable JFlex code, I could see that the problem was not with JFlex. The simple case below shows the root cause of the problem Is there any reasonable explanation for why many of the ASCII control characters are considered to be Java letters or digits? My gut tells me this is not what the Java designers had in mind. Maybe this is a question for Oracle? Bill Fenlason -------------------------------------------------------------------------------------------------- public class Main { public static void main(String[] args) { char c ; for (int i = 0; i < 64; i += 1) { c = (char) i; boolean b = Character.isJavaIdentifierPart(c); System.out.println("" + i +" (x" + x(i)+ ") " + (i > 31? c : " ") + (i < 10?" ":"") + " is java identifier part: " + b ); } } static String x (int i) { String s = "0123456789ABCDEF"; if (i <256) return "" + s.charAt(i/16) + s.charAt(i&15); return x(i/256) + x(i&255); } } /* --- results ---- 0 (x00) is java identifier part: true 1 (x01) is java identifier part: true 2 (x02) is java identifier part: true 3 (x03) is java identifier part: true 4 (x04) is java identifier part: true 5 (x05) is java identifier part: true 6 (x06) is java identifier part: true 7 (x07) is java identifier part: true 8 (x08) is java identifier part: true 9 (x09) is java identifier part: false 10 (x0A) is java identifier part: false 11 (x0B) is java identifier part: false 12 (x0C) is java identifier part: false 13 (x0D) is java identifier part: false 14 (x0E) is java identifier part: true 15 (x0F) is java identifier part: true 16 (x10) is java identifier part: true 17 (x11) is java identifier part: true 18 (x12) is java identifier part: true 19 (x13) is java identifier part: true 20 (x14) is java identifier part: true 21 (x15) is java identifier part: true 22 (x16) is java identifier part: true 23 (x17) is java identifier part: true 24 (x18) is java identifier part: true 25 (x19) is java identifier part: true 26 (x1A) is java identifier part: true 27 (x1B) is java identifier part: true 28 (x1C) is java identifier part: false 29 (x1D) is java identifier part: false 30 (x1E) is java identifier part: false 31 (x1F) is java identifier part: false 32 (x20) is java identifier part: false 33 (x21) ! is java identifier part: false 34 (x22) " is java identifier part: false 35 (x23) # is java identifier part: false 36 (x24) $ is java identifier part: true 37 (x25) % is java identifier part: false 38 (x26) & is java identifier part: false 39 (x27) ' is java identifier part: false 40 (x28) ( is java identifier part: false 41 (x29) ) is java identifier part: false 42 (x2A) * is java identifier part: false 43 (x2B) + is java identifier part: false 44 (x2C) , is java identifier part: false 45 (x2D) - is java identifier part: false 46 (x2E) . is java identifier part: false 47 (x2F) / is java identifier part: false 48 (x30) 0 is java identifier part: true 49 (x31) 1 is java identifier part: true 50 (x32) 2 is java identifier part: true 51 (x33) 3 is java identifier part: true 52 (x34) 4 is java identifier part: true 53 (x35) 5 is java identifier part: true 54 (x36) 6 is java identifier part: true 55 (x37) 7 is java identifier part: true 56 (x38) 8 is java identifier part: true 57 (x39) 9 is java identifier part: true 58 (x3A) : is java identifier part: false 59 (x3B) ; is java identifier part: false 60 (x3C) < is java identifier part: false 61 (x3D) = is java identifier part: false 62 (x3E) > is java identifier part: false 63 (x3F) ? is java identifier part: false */ On Thu, May 12, 2016 at 9:24 AM, Gerwin Klein <Ger...@ni...> wrote: > Sorry, I did receive it but got bogged down in other work and haven’t had > a chance to look at it yet. Should have at least let you know.. > > I should be able to look at it this weekend. > > Cheers, > Gerwin > > > > On 12.05.2016, at 23:03, William Fenlason <bil...@gm...> wrote: > > Gerwin, > > Could you help me understand the status of this? > > At the end of April I sent you a small test case (6 files, including > grammar, test driver, etc.) which I think demonstrates this problem. Since > I haven't heard back and because I sent it off list, I'm wondering if you > received it, or if it somehow ended up in a spam folder? Or is the > situation that you have not been able to devote any time to this? > > I used a string reader to avoid any encoding issues, and added a test to > insure that the string reader was delivering the control characters as > expected. My initial conclusion is that the processing of jletterdigit > possibly has a flaw in which a subset of the ASCII control characters are > included. I haven't tried to confirm the situation in the JFlex source > yet. No doubt you would be much more efficient than I in figuring this > out, but I'll give it a try as time permits. > > Best, > > Bill Fenlason > > > > > > > > > > > > > > On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...> > wrote: > >> Hi William, >> >> this does sound like it could be a bug, yes. >> >> Do you have a small test spec and input with expected output? I’d like to >> try to reproduce across different versions, may be I can see what is going >> on. >> >> A common pitfall with such characters is the encoding, both of the spec >> file for JFlex and the input file to the compiled scanner. If you’re using >> the unicode escape sequences, the former shouldn’t matter, but the latter >> still might. >> >> Cheers, >> Gerwin >> >> On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...> >> wrote: >> >> RL1.1 Hex Notation >> >> *To meet this requirement, an implementation shall supply a mechanism for >> specifying any Unicode code point (from U+0000 to U+10FFFF), using the >> hexadecimal code point representation.* >> >> JFlex conforms. Syntax is provided to express values across the whole >> range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where >> yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit >> hex value. >> >> >> ------------------------------------------------------------------------------------------------- >> >> If I understand it correctly, the above (taken from the JFlex User >> Manual) implies that all hex characters from \U0000 through \U10FFFF may be >> used in a lexical specification. I don't think that is the case, and this >> is why. >> >> As we know, <<EOF>> cannot be used for look ahead processing. It has >> been suggested here that one way to simulate it is to append a unique >> character to the end of the file, use it for look ahead, and then discard >> it. That approach was adopted. >> >> We developed an extension of java.io.Reader which allows any specified >> character to be transparently appended to the end of the file (Eclipse >> document, actually), and also a substitute character to be returned in case >> the specified character occurs in the file. >> >> It seemed that a reasonable choice for an EOF character was to use one of >> the ASCII control characters from \x00 thru \x1F, avoiding the commonly >> used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT >> (\x04) appeared to be good alternatives. >> >> Initial testing did not bear this out - in a test case, two versions of >> JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather >> than recognizing them as separate tokens. Additional testing convinced us >> that of the reasonable control character choices, only File Separator (FS - >> \x1C) and Group Separator (GS - \x1D) work as expected. >> >> Why should some control characters work, and others not work? My >> suspicion is that somewhere in the JFlex code there are specific character >> dependencies in the ASCII control character range. >> >> I believe that this is a bug, either in the code or in the above >> documentation, and is contrary to the idea that any hex character may be >> used in a specification. >> >> Am I mis-reading this documentation? Do others agree that this is a bug >> to be fixed? >> >> I've downloaded the JFlex source and am willing to look for the cause, >> but I have no idea where to start exploring. Does anyone have suggestions? >> >> Obviously \x1C as the EOF character is a pragmatic solution "because it >> works", but that seems a bit of a kludge.. >> >> Bill Fenlason >> >> >> >> ------------------------------------------------------------------------------ >> Find and fix application performance issues faster with Applications >> Manager >> Applications Manager provides deep performance insights into multiple >> tiers of >> your business applications. It resolves application problems quickly and >> reduces your MTTR. Get your free trial! >> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- >> jflex-users mailing list >> https://lists.sourceforge.net/lists/listinfo/jflex-users >> >> >> >> ------------------------------ >> >> The information in this e-mail may be confidential and subject to legal >> professional privilege and/or copyright. National ICT Australia Limited >> accepts no liability for any damage caused by this email or its attachments. >> > > > |