Re: [jflex-users] Possible bug?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Gerwin,

After looking at the very readable JFlex code, I could see that the problem
was not with JFlex.

The simple case below shows the root cause of the problem

Is there any reasonable explanation for why many of the ASCII control
characters are considered to be Java letters or digits?

My gut tells me this is not what the Java designers had in mind.  Maybe
this is a question for Oracle?

Bill Fenlason

--------------------------------------------------------------------------------------------------

public class Main {
   public static void main(String[] args) {
      char c ;

      for (int i = 0; i < 64; i += 1) {
          c = (char) i;

      boolean b = Character.isJavaIdentifierPart(c);

      System.out.println("" + i +" (x" + x(i)+ ") " + (i > 31? c : " ")
              + (i < 10?" ":"")
              + " is java identifier part: " + b );
      }
   }

   static String x (int i) {
       String s = "0123456789ABCDEF";
       if (i <256)
           return "" + s.charAt(i/16) + s.charAt(i&15);
       return x(i/256) + x(i&255);
   }
}

/* --- results ----

0 (x00)    is java identifier part: true
1 (x01)    is java identifier part: true
2 (x02)    is java identifier part: true
3 (x03)    is java identifier part: true
4 (x04)    is java identifier part: true
5 (x05)    is java identifier part: true
6 (x06)    is java identifier part: true
7 (x07)    is java identifier part: true
8 (x08)    is java identifier part: true
9 (x09)    is java identifier part: false
10 (x0A)   is java identifier part: false
11 (x0B)   is java identifier part: false
12 (x0C)   is java identifier part: false
13 (x0D)   is java identifier part: false
14 (x0E)   is java identifier part: true
15 (x0F)   is java identifier part: true
16 (x10)   is java identifier part: true
17 (x11)   is java identifier part: true
18 (x12)   is java identifier part: true
19 (x13)   is java identifier part: true
20 (x14)   is java identifier part: true
21 (x15)   is java identifier part: true
22 (x16)   is java identifier part: true
23 (x17)   is java identifier part: true
24 (x18)   is java identifier part: true
25 (x19)   is java identifier part: true
26 (x1A)   is java identifier part: true
27 (x1B)   is java identifier part: true
28 (x1C)   is java identifier part: false
29 (x1D)   is java identifier part: false
30 (x1E)   is java identifier part: false
31 (x1F)   is java identifier part: false
32 (x20)   is java identifier part: false
33 (x21) ! is java identifier part: false
34 (x22) " is java identifier part: false
35 (x23) # is java identifier part: false
36 (x24) $ is java identifier part: true
37 (x25) % is java identifier part: false
38 (x26) & is java identifier part: false
39 (x27) ' is java identifier part: false
40 (x28) ( is java identifier part: false
41 (x29) ) is java identifier part: false
42 (x2A) * is java identifier part: false
43 (x2B) + is java identifier part: false
44 (x2C) , is java identifier part: false
45 (x2D) - is java identifier part: false
46 (x2E) . is java identifier part: false
47 (x2F) / is java identifier part: false
48 (x30) 0 is java identifier part: true
49 (x31) 1 is java identifier part: true
50 (x32) 2 is java identifier part: true
51 (x33) 3 is java identifier part: true
52 (x34) 4 is java identifier part: true
53 (x35) 5 is java identifier part: true
54 (x36) 6 is java identifier part: true
55 (x37) 7 is java identifier part: true
56 (x38) 8 is java identifier part: true
57 (x39) 9 is java identifier part: true
58 (x3A) : is java identifier part: false
59 (x3B) ; is java identifier part: false
60 (x3C) < is java identifier part: false
61 (x3D) = is java identifier part: false
62 (x3E) > is java identifier part: false
63 (x3F) ? is java identifier part: false

*/

On Thu, May 12, 2016 at 9:24 AM, Gerwin Klein <Ger...@ni...>
wrote:

> Sorry, I did receive it but got bogged down in other work and haven’t had
> a chance to look at it yet. Should have at least let you know..
>
> I should be able to look at it this weekend.
>
> Cheers,
> Gerwin
>
>
>
> On 12.05.2016, at 23:03, William Fenlason <bil...@gm...> wrote:
>
> Gerwin,
>
> Could you help me understand the status of this?
>
> At the end of April I sent you a small test case (6 files, including
> grammar, test driver, etc.) which I think demonstrates this problem.  Since
> I haven't heard back and because I sent it off list, I'm wondering if you
> received it, or if it somehow ended up in a spam folder?  Or is the
> situation that you have not been able to devote any time to this?
>
> I used a string reader to avoid any encoding issues, and added a test to
> insure that the string reader was delivering the control characters as
> expected.  My initial conclusion is that the processing of jletterdigit
> possibly has a flaw in which a subset of the ASCII control characters are
> included.  I haven't tried to confirm the situation in the JFlex source
> yet.  No doubt you would be much more efficient than I in figuring this
> out, but I'll give it a try as time permits.
>
> Best,
>
> Bill Fenlason
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...>
> wrote:
>
>> Hi William,
>>
>> this does sound like it could be a bug, yes.
>>
>> Do you have a small test spec and input with expected output? I’d like to
>> try to reproduce across different versions, may be I can see what is going
>> on.
>>
>> A common pitfall with such characters is the encoding, both of the spec
>> file for JFlex and the input file to the compiled scanner. If you’re using
>> the unicode escape sequences, the former shouldn’t matter, but the latter
>> still might.
>>
>> Cheers,
>> Gerwin
>>
>> On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...>
>> wrote:
>>
>> RL1.1 Hex Notation
>>
>> *To meet this requirement, an implementation shall supply a mechanism for
>> specifying any Unicode code point (from U+0000 to U+10FFFF), using the
>> hexadecimal code point representation.*
>>
>> JFlex conforms. Syntax is provided to express values across the whole
>> range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where
>> yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit
>> hex value.
>>
>>
>> -------------------------------------------------------------------------------------------------
>>
>> If I understand it correctly, the above (taken from the JFlex User
>> Manual) implies that all hex characters from \U0000 through \U10FFFF may be
>> used in a lexical specification.  I don't think that is the case, and this
>> is why.
>>
>> As we know, <<EOF>> cannot be used for look ahead processing.  It has
>> been suggested here that one way to simulate it is to append a unique
>> character to the end of the file, use it for look ahead, and then discard
>> it.  That approach was adopted.
>>
>> We developed an extension of java.io.Reader which allows any specified
>> character to be transparently appended to the end of the file (Eclipse
>> document, actually), and also a substitute character to be returned in case
>> the specified character occurs in the file.
>>
>> It seemed that a reasonable choice for an EOF character was to use one of
>> the ASCII control characters from \x00 thru \x1F, avoiding the commonly
>> used ones like \x00 and \x07 thru \x0D.  Initially, ETX (\x03) and EOT
>> (\x04) appeared to be good alternatives.
>>
>> Initial testing did not bear this out - in a test case, two versions of
>> JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather
>> than recognizing them as separate tokens.  Additional testing convinced us
>> that of the reasonable control character choices, only File Separator (FS -
>> \x1C) and Group Separator (GS - \x1D) work as expected.
>>
>> Why should some control characters work, and others not work?  My
>> suspicion is that somewhere in the JFlex code there are specific character
>> dependencies in the ASCII control character range.
>>
>> I believe that this is a bug, either in the code or in the above
>> documentation, and is contrary to the idea that any hex character may be
>> used in a specification.
>>
>> Am I mis-reading this documentation?  Do others agree that this is a bug
>> to be fixed?
>>
>> I've downloaded the JFlex source and am willing to look for the cause,
>> but I have no idea where to start exploring.  Does anyone have suggestions?
>>
>> Obviously \x1C as the EOF character is a pragmatic solution "because it
>> works", but that seems a bit of a kludge..
>>
>> Bill Fenlason
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Find and fix application performance issues faster with Applications
>> Manager
>> Applications Manager provides deep performance insights into multiple
>> tiers of
>> your business applications. It resolves application problems quickly and
>> reduces your MTTR. Get your free trial!
>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z--
>> jflex-users mailing list
>> https://lists.sourceforge.net/lists/listinfo/jflex-users
>>
>>
>>
>> ------------------------------
>>
>> The information in this e-mail may be confidential and subject to legal
>> professional privilege and/or copyright. National ICT Australia Limited
>> accepts no liability for any damage caused by this email or its attachments.
>>
>
>
>

Re: [jflex-users] Possible bug?

The fast lexer generator for Java

Re: [jflex-users] Possible bug?