jflex-users Mailing List for JFlex
The fast lexer generator for Java
Brought to you by:
lsf37,
steve_rowe
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(5) |
Sep
(1) |
Oct
(5) |
Nov
|
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(3) |
Feb
(12) |
Mar
(14) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
(3) |
Dec
(6) |
2003 |
Jan
(8) |
Feb
(5) |
Mar
(7) |
Apr
(2) |
May
(5) |
Jun
|
Jul
(5) |
Aug
(4) |
Sep
(7) |
Oct
|
Nov
(21) |
Dec
(7) |
2004 |
Jan
(6) |
Feb
(5) |
Mar
|
Apr
(1) |
May
(10) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
(4) |
Oct
|
Nov
(2) |
Dec
(2) |
2005 |
Jan
(13) |
Feb
(2) |
Mar
(6) |
Apr
(4) |
May
(2) |
Jun
|
Jul
(4) |
Aug
(12) |
Sep
(3) |
Oct
(6) |
Nov
(1) |
Dec
|
2006 |
Jan
(7) |
Feb
(3) |
Mar
(11) |
Apr
(5) |
May
(1) |
Jun
(2) |
Jul
(2) |
Aug
|
Sep
(13) |
Oct
|
Nov
(3) |
Dec
(6) |
2007 |
Jan
(1) |
Feb
(4) |
Mar
(2) |
Apr
|
May
(4) |
Jun
(11) |
Jul
(2) |
Aug
(4) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2008 |
Jan
(1) |
Feb
(4) |
Mar
(7) |
Apr
|
May
(8) |
Jun
(1) |
Jul
(2) |
Aug
(4) |
Sep
(3) |
Oct
|
Nov
|
Dec
|
2009 |
Jan
(3) |
Feb
(10) |
Mar
(6) |
Apr
|
May
(6) |
Jun
(8) |
Jul
(7) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
(4) |
2010 |
Jan
|
Feb
|
Mar
|
Apr
(15) |
May
|
Jun
(7) |
Jul
|
Aug
(5) |
Sep
|
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
(7) |
May
(2) |
Jun
|
Jul
(2) |
Aug
(4) |
Sep
(3) |
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
(2) |
Jul
|
Aug
(6) |
Sep
|
Oct
|
Nov
(3) |
Dec
|
2014 |
Jan
(8) |
Feb
(3) |
Mar
(5) |
Apr
|
May
(7) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(4) |
Dec
|
2015 |
Jan
(2) |
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
|
2016 |
Jan
(1) |
Feb
(3) |
Mar
(3) |
Apr
(2) |
May
(7) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2019 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
From: Gerwin K. <ge...@do...> - 2019-12-09 10:25:53
|
Just wanted to report on the list that the issue Pascal found and reported has been fixed in the development version, with the fix to be included in the upcoming 1.8.0 release. The defect turned out to be in the code for removal of dead states after the computation of the negated automaton. In the scanner generation process, after an NFA is negated, there can be states from which no final state is reachable any more. These have to be removed from the automaton for the scanning engine to work correctly, and under specific circumstances that removal went wrong. This bug triggers very rarely. One can determine wether a lexer spec was affected by looking at the number of DFA states before minimisation in JFlex 1.7.0 and (the upcoming) JFLex 1.8.0 or the current development snapshot. If the number of states differ, it may have been affected by the bug, if the number of states is equal, it was not. Thanks again to Pascal for reporting this one, it was one of the more interesting bugs in JFlex in the past few years. Cheers, Gerwin > On 24 Oct 2019, at 14:08, Gerwin Klein <ge...@do...> wrote: > > Hi Pascal, > > I haven’t really gotten to the bottom of it yet, but it is some interaction between the presence of a negated character class and the negation operator. > > If you need a work-around, changing the spec to the equivalent > > EXP = [\u{0}-`b-\u{10FFFF}] [^]* [\u{0}-`b-\u{10FFFF}] > > should make it work as expected (you can tell when jflex warns that the second action can never be matched). > > Cheers, > Gerwin > >> On 22 Oct 2019, at 02:50, Pascal HENNEQUIN <pas...@te...> wrote: >> >> hello, >> I found an issue with the negation operator "!" >> With the following specification, string "baba" is not matched >> by either EXP ou !EXP . >> >> Pascal Hennequin >> >> >> ------------------------------- >> %% >> %standalone >> %{ >> void ECHO(String cat) { System.out.print("["+cat+":"+yytext()+"]"); } >> %} >> >> EXP = ( [^a] [^]* [^a] ) >> ALL = {EXP} | ! {EXP} >> >> %% >> {ALL} { ECHO("1"); } >> baba { ECHO("2"); } >> --------------------------------- >> >> >> -- >> jflex-users mailing list >> https://lists.sourceforge.net/lists/listinfo/jflex-users > > > > -- > jflex-users mailing list > https://lists.sourceforge.net/lists/listinfo/jflex-users |
From: Gerwin K. <ge...@do...> - 2019-10-24 04:00:07
|
> On 22 Oct 2019, at 08:05, Alan Eliasen <el...@mi... <mailto:el...@mi...>> wrote: > To begin withn, I don't understand what [^] is supposed to match. It looks like a negating character class, but with nothing to negate. This makes no sense, so obviously something else was intended. What was it? In JFlex, [] matches nothing, and [^] is the character class that negates that, i.e. it matches any single input character. It’s a generalisation of “.” See also the section “Semantics” on character classes on https://www.jflex.de/manual.html <https://www.jflex.de/manual.html> . The operator ! negates entire expressions. Since Pascal is matching something of the form "r | !r", this should match literally everything (either r matches or it doesn’t), and the second line in his spec should therefore never get a chance to run (but for some reason it does for the input he sent). Cheers, Gerwin |
From: Gerwin K. <ge...@do...> - 2019-10-24 03:55:41
|
Hi Pascal, I haven’t really gotten to the bottom of it yet, but it is some interaction between the presence of a negated character class and the negation operator. If you need a work-around, changing the spec to the equivalent EXP = [\u{0}-`b-\u{10FFFF}] [^]* [\u{0}-`b-\u{10FFFF}] should make it work as expected (you can tell when jflex warns that the second action can never be matched). Cheers, Gerwin > On 22 Oct 2019, at 02:50, Pascal HENNEQUIN <pas...@te...> wrote: > > hello, > I found an issue with the negation operator "!" > With the following specification, string "baba" is not matched > by either EXP ou !EXP . > > Pascal Hennequin > > > ------------------------------- > %% > %standalone > %{ > void ECHO(String cat) { System.out.print("["+cat+":"+yytext()+"]"); } > %} > > EXP = ( [^a] [^]* [^a] ) > ALL = {EXP} | ! {EXP} > > %% > {ALL} { ECHO("1"); } > baba { ECHO("2"); } > --------------------------------- > > > -- > jflex-users mailing list > https://lists.sourceforge.net/lists/listinfo/jflex-users |
From: Alan E. <el...@mi...> - 2019-10-21 21:05:52
|
On October 21, 2019 9:50:02 AM MDT, Pascal HENNEQUIN <pas...@te...> wrote: >hello, > I found an issue with the negation operator "!" > With the following specification, string "baba" is not matched > by either EXP ou !EXP . > >Pascal Hennequin > > >------------------------------- >%% >%standalone >%{ >void ECHO(String cat) { System.out.print("["+cat+":"+yytext()+"]"); } >%} > >EXP = ( [^a] [^]* [^a] ) >ALL = {EXP} | ! {EXP} > >%% >{ALL} { ECHO("1"); } >baba { ECHO("2"); } >--------------------------------- > > >-- >jflex-users mailing list >https://lists.sourceforge.net/lists/listinfo/jflex-users To begin withn, I don't understand what [^] is supposed to match. It looks like a negating character class, but with nothing to negate. This makes no sense, so obviously something else was intended. What was it? -- Sent from my Android device with K-9 Mail. Please excuse my brevity. |
From: Pascal H. <pas...@te...> - 2019-10-21 15:50:14
|
hello, I found an issue with the negation operator "!" With the following specification, string "baba" is not matched by either EXP ou !EXP . Pascal Hennequin ------------------------------- %% %standalone %{ void ECHO(String cat) { System.out.print("["+cat+":"+yytext()+"]"); } %} EXP = ( [^a] [^]* [^a] ) ALL = {EXP} | ! {EXP} %% {ALL} { ECHO("1"); } baba { ECHO("2"); } --------------------------------- |
From: davide s. <sfo...@st...> - 2017-09-09 13:18:46
|
Hi there, I'm having some problem with the generation of the .java file. After the DFA minimization, during the code writing an exception is thrown and the class content truncated: java.lang.IllegalArgumentException: character value expected at jflex.PackEmitter.emitUC(PackEmitter.java:108) at jflex.CountEmitter.emit(CountEmitter.java:102) at jflex.Emitter.emitDynamicInit(Emitter.java:530) at jflex.Emitter.emit(Emitter.java:1431) at jflex.Main.generate(Main.java:112) at jflex.Main.generate(Main.java:394) at jflex.Main.main(Main.java:411) Thanks for your help -- Davide Sforza |
From: Hanns H. R. <co...@sc...> - 2017-05-19 23:39:29
|
hi there, has anyone implemented a .flex definition for Markdown yet? best, .h.h. |
From: Scott W. <sco...@gm...> - 2017-02-15 14:00:31
|
The string literal token reg exp for my language is: STRING_LITERAL='([^'\\\n]|\\.)*' For the consumer of this lexer (IntelliJ IDEA custom language plugin), I also need to have a token that represents an unterminated string literal (technically they could be the same token). For the most part, the following works: UNTERMINATED_STRING_LITERAL='([^'\\\n]|\\.)*['\n] However, when the entire document is, for example, the following: String str = 'foo.bar<eof> it's not recognized. Is there some way to include the notion of end-of-file in the token like I'm able to include the notion of end-of-line? I've tried using the Java Pattern \Z and \z, but those apparently aren't valid for JFlex's regular expression syntax. Oh, and because of how I'm using this, all line endings are already normalized to \n, so I don't need to consider \r or \r\n here. Thanks much in advance! Scott |
From: Gerwin K. <Ger...@ni...> - 2016-05-14 03:55:29
|
Looks like this is pretty much sorted out: Yes, [:jletterdigit:] is intended to mean exactly isJavaIdentifierPart, with all its faults. The idea is to give access to the Java platform definitions, so it would not be a good idea to tweak it. There’s nothing stopping you from defining your own character class macro, though. Maybe something along the lines of the following? ignorable = [\u0000-\u0008,\u000E-\u001B,\u007F-\u009F] letterdigit = [[:jletterdigit:] -- {ignorable}] Cheers, Gerwin On 14.05.2016, at 06:01, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: Lee, Yes, I agree. Certainly isIdentifierIgnorable() is preferable. Allowing nonprinting characters or "ignorable" characters within identifiers makes no sense to me. If the characters are "ignorable", does that mean that equals() is affected? Are two identifiers, one with embedded control characters and one without (but otherwise the same) equal? If not, what does "ignorable" mean? If so, are the equals() overrides cost justified? Currently JFlex defines [:jletterdigit:] to be identical with isJavaIdentifierPart. For my purposes it would be nice if JFlex specified that [:jletterdigit:] does NOT include ignorable characters, but I doubt that Gerwin feels the same, nor should he. I don't know if there are potential problems in JFlex with regard to identifiers containing control characters, but obviously they should be avoided. Probably they only occur in special situations like mine, where a control character is artificially inserted into the input. Bottom line - I think it can be argued that including "ignorable", nonprinting characters in isJavaIdentifierPart() was a design error, but obviously we have to live with it. On Fri, May 13, 2016 at 1:31 PM, Lee Carver <le...@pn...<mailto:le...@pn...>> wrote: This appears to be by (weird) design. My guess is that a call to isIdentifierIgnorable() would be a better test then > 31. The Oracle JavaSE-7 documents this behavior for isJavaIdentifierPart() - <> A character may be part of a Java identifier if any of the following are true: ... - isIdentifierIgnorable(codePoint) returns true for the character </> And under isIdentifierIgnorable(char ch) we have - <> Determines if the specified character should be regarded as an ignorable character in a Java identifier or a Unicode identifier. The following Unicode characters are ignorable in a Java identifier or a Unicode identifier: ISO control characters that are not whitespace '\u0000' through '\u0008' '\u000E' through '\u001B' '\u007F' through '\u009F' all characters that have the FORMAT general category value </> On Thu, May 12, 2016 at 11:25 AM, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: PS Perhaps a possible thing to do in JFlex is to change line 90 of LexParse.cup to return Character.isJavaIdentifierPart(c) && c > 31; although having to code around what is (imho) a Java flaw is distasteful. Bill On Thu, May 12, 2016 at 9:24 AM, Gerwin Klein <Ger...@ni...<mailto:Ger...@ni...>> wrote: Sorry, I did receive it but got bogged down in other work and haven’t had a chance to look at it yet. Should have at least let you know.. I should be able to look at it this weekend. Cheers, Gerwin On 12.05.2016, at 23:03, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: Gerwin, Could you help me understand the status of this? At the end of April I sent you a small test case (6 files, including grammar, test driver, etc.) which I think demonstrates this problem. Since I haven't heard back and because I sent it off list, I'm wondering if you received it, or if it somehow ended up in a spam folder? Or is the situation that you have not been able to devote any time to this? I used a string reader to avoid any encoding issues, and added a test to insure that the string reader was delivering the control characters as expected. My initial conclusion is that the processing of jletterdigit possibly has a flaw in which a subset of the ASCII control characters are included. I haven't tried to confirm the situation in the JFlex source yet. No doubt you would be much more efficient than I in figuring this out, but I'll give it a try as time permits. Best, Bill Fenlason On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...<mailto:Ger...@ni...>> wrote: Hi William, this does sound like it could be a bug, yes. Do you have a small test spec and input with expected output? I’d like to try to reproduce across different versions, may be I can see what is going on. A common pitfall with such characters is the encoding, both of the spec file for JFlex and the input file to the compiled scanner. If you’re using the unicode escape sequences, the former shouldn’t matter, but the latter still might. Cheers, Gerwin On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: RL1.1 Hex Notation To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation. JFlex conforms. Syntax is provided to express values across the whole range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit hex value. ------------------------------------------------------------------------------------------------- If I understand it correctly, the above (taken from the JFlex User Manual) implies that all hex characters from \U0000 through \U10FFFF may be used in a lexical specification. I don't think that is the case, and this is why. As we know, <<EOF>> cannot be used for look ahead processing. It has been suggested here that one way to simulate it is to append a unique character to the end of the file, use it for look ahead, and then discard it. That approach was adopted. We developed an extension of java.io.Reader which allows any specified character to be transparently appended to the end of the file (Eclipse document, actually), and also a substitute character to be returned in case the specified character occurs in the file. It seemed that a reasonable choice for an EOF character was to use one of the ASCII control characters from \x00 thru \x1F, avoiding the commonly used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT (\x04) appeared to be good alternatives. Initial testing did not bear this out - in a test case, two versions of JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather than recognizing them as separate tokens. Additional testing convinced us that of the reasonable control character choices, only File Separator (FS - \x1C) and Group Separator (GS - \x1D) work as expected. Why should some control characters work, and others not work? My suspicion is that somewhere in the JFlex code there are specific character dependencies in the ASCII control character range. I believe that this is a bug, either in the code or in the above documentation, and is contrary to the idea that any hex character may be used in a specification. Am I mis-reading this documentation? Do others agree that this is a bug to be fixed? I've downloaded the JFlex source and am willing to look for the cause, but I have no idea where to start exploring. Does anyone have suggestions? Obviously \x1C as the EOF character is a pragmatic solution "because it works", but that seems a bit of a kludge.. Bill Fenlason ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users ________________________________ The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments. ------------------------------------------------------------------------------ Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j -- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users |
From: William F. <bil...@gm...> - 2016-05-13 20:01:41
|
Lee, Yes, I agree. Certainly isIdentifierIgnorable() is preferable. Allowing nonprinting characters or "ignorable" characters within identifiers makes no sense to me. If the characters are "ignorable", does that mean that equals() is affected? Are two identifiers, one with embedded control characters and one without (but otherwise the same) equal? If not, what does "ignorable" mean? If so, are the equals() overrides cost justified? Currently JFlex defines [:jletterdigit:] to be identical with isJavaIdentifierPart. For my purposes it would be nice if JFlex specified that [:jletterdigit:] does NOT include ignorable characters, but I doubt that Gerwin feels the same, nor should he. I don't know if there are potential problems in JFlex with regard to identifiers containing control characters, but obviously they should be avoided. Probably they only occur in special situations like mine, where a control character is artificially inserted into the input. Bottom line - I think it can be argued that including "ignorable", nonprinting characters in isJavaIdentifierPart() was a design error, but obviously we have to live with it. On Fri, May 13, 2016 at 1:31 PM, Lee Carver <le...@pn...> wrote: > This appears to be by (weird) design. My guess is that a call to > isIdentifierIgnorable() would be a better test then > 31. > > The Oracle JavaSE-7 documents this behavior for isJavaIdentifierPart() - > > <> > A character may be part of a Java identifier if any of the following are > true: > ... > - isIdentifierIgnorable(codePoint) returns true for the character > > </> > > And under isIdentifierIgnorable(char ch) we have - > > <> > Determines if the specified character should be regarded as an ignorable > character in a Java identifier or a Unicode identifier. > The following Unicode characters are ignorable in a Java identifier or a > Unicode identifier: > > ISO control characters that are not whitespace > '\u0000' through '\u0008' > '\u000E' through '\u001B' > '\u007F' through '\u009F' > all characters that have the FORMAT general category value > </> > > On Thu, May 12, 2016 at 11:25 AM, William Fenlason <bil...@gm... > > wrote: > >> PS >> >> Perhaps a possible thing to do in JFlex is to change line 90 of >> LexParse.cup to >> >> return Character.isJavaIdentifierPart(c) && c > 31; >> >> although having to code around what is (imho) a Java flaw is distasteful. >> >> Bill >> >> >> >> On Thu, May 12, 2016 at 9:24 AM, Gerwin Klein <Ger...@ni...> >> wrote: >> >>> Sorry, I did receive it but got bogged down in other work and haven’t >>> had a chance to look at it yet. Should have at least let you know.. >>> >>> I should be able to look at it this weekend. >>> >>> Cheers, >>> Gerwin >>> >>> >>> >>> On 12.05.2016, at 23:03, William Fenlason <bil...@gm...> >>> wrote: >>> >>> Gerwin, >>> >>> Could you help me understand the status of this? >>> >>> At the end of April I sent you a small test case (6 files, including >>> grammar, test driver, etc.) which I think demonstrates this problem. Since >>> I haven't heard back and because I sent it off list, I'm wondering if you >>> received it, or if it somehow ended up in a spam folder? Or is the >>> situation that you have not been able to devote any time to this? >>> >>> I used a string reader to avoid any encoding issues, and added a test to >>> insure that the string reader was delivering the control characters as >>> expected. My initial conclusion is that the processing of jletterdigit >>> possibly has a flaw in which a subset of the ASCII control characters are >>> included. I haven't tried to confirm the situation in the JFlex source >>> yet. No doubt you would be much more efficient than I in figuring this >>> out, but I'll give it a try as time permits. >>> >>> Best, >>> >>> Bill Fenlason >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni... >>> > wrote: >>> >>>> Hi William, >>>> >>>> this does sound like it could be a bug, yes. >>>> >>>> Do you have a small test spec and input with expected output? I’d like >>>> to try to reproduce across different versions, may be I can see what is >>>> going on. >>>> >>>> A common pitfall with such characters is the encoding, both of the spec >>>> file for JFlex and the input file to the compiled scanner. If you’re using >>>> the unicode escape sequences, the former shouldn’t matter, but the latter >>>> still might. >>>> >>>> Cheers, >>>> Gerwin >>>> >>>> On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...> >>>> wrote: >>>> >>>> RL1.1 Hex Notation >>>> >>>> *To meet this requirement, an implementation shall supply a mechanism >>>> for specifying any Unicode code point (from U+0000 to U+10FFFF), using the >>>> hexadecimal code point representation.* >>>> >>>> JFlex conforms. Syntax is provided to express values across the whole >>>> range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where >>>> yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 >>>> digit hex value. >>>> >>>> >>>> ------------------------------------------------------------------------------------------------- >>>> >>>> If I understand it correctly, the above (taken from the JFlex User >>>> Manual) implies that all hex characters from \U0000 through \U10FFFF may be >>>> used in a lexical specification. I don't think that is the case, and this >>>> is why. >>>> >>>> As we know, <<EOF>> cannot be used for look ahead processing. It has >>>> been suggested here that one way to simulate it is to append a unique >>>> character to the end of the file, use it for look ahead, and then discard >>>> it. That approach was adopted. >>>> >>>> We developed an extension of java.io.Reader which allows any specified >>>> character to be transparently appended to the end of the file (Eclipse >>>> document, actually), and also a substitute character to be returned in case >>>> the specified character occurs in the file. >>>> >>>> It seemed that a reasonable choice for an EOF character was to use one >>>> of the ASCII control characters from \x00 thru \x1F, avoiding the commonly >>>> used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT >>>> (\x04) appeared to be good alternatives. >>>> >>>> Initial testing did not bear this out - in a test case, two versions of >>>> JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather >>>> than recognizing them as separate tokens. Additional testing convinced us >>>> that of the reasonable control character choices, only File Separator (FS - >>>> \x1C) and Group Separator (GS - \x1D) work as expected. >>>> >>>> Why should some control characters work, and others not work? My >>>> suspicion is that somewhere in the JFlex code there are specific character >>>> dependencies in the ASCII control character range. >>>> >>>> I believe that this is a bug, either in the code or in the above >>>> documentation, and is contrary to the idea that any hex character may be >>>> used in a specification. >>>> >>>> Am I mis-reading this documentation? Do others agree that this is a >>>> bug to be fixed? >>>> >>>> I've downloaded the JFlex source and am willing to look for the cause, >>>> but I have no idea where to start exploring. Does anyone have suggestions? >>>> >>>> Obviously \x1C as the EOF character is a pragmatic solution "because it >>>> works", but that seems a bit of a kludge.. >>>> >>>> Bill Fenlason >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Find and fix application performance issues faster with Applications >>>> Manager >>>> Applications Manager provides deep performance insights into multiple >>>> tiers of >>>> your business applications. It resolves application problems quickly and >>>> reduces your MTTR. Get your free trial! >>>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- >>>> jflex-users mailing list >>>> https://lists.sourceforge.net/lists/listinfo/jflex-users >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> The information in this e-mail may be confidential and subject to legal >>>> professional privilege and/or copyright. National ICT Australia Limited >>>> accepts no liability for any damage caused by this email or its attachments. >>>> >>> >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Mobile security can be enabling, not merely restricting. Employees who >> bring their own devices (BYOD) to work are irked by the imposition of MDM >> restrictions. Mobile Device Manager Plus allows you to control only the >> apps on BYO-devices by containerizing them, leaving personal data >> untouched! >> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j >> -- >> jflex-users mailing list >> https://lists.sourceforge.net/lists/listinfo/jflex-users >> >> > |
From: Lee C. <le...@pn...> - 2016-05-13 17:55:16
|
This appears to be by (weird) design. My guess is that a call to isIdentifierIgnorable() would be a better test then > 31. The Oracle JavaSE-7 documents this behavior for isJavaIdentifierPart() - <> A character may be part of a Java identifier if any of the following are true: ... - isIdentifierIgnorable(codePoint) returns true for the character </> And under isIdentifierIgnorable(char ch) we have - <> Determines if the specified character should be regarded as an ignorable character in a Java identifier or a Unicode identifier. The following Unicode characters are ignorable in a Java identifier or a Unicode identifier: ISO control characters that are not whitespace '\u0000' through '\u0008' '\u000E' through '\u001B' '\u007F' through '\u009F' all characters that have the FORMAT general category value </> On Thu, May 12, 2016 at 11:25 AM, William Fenlason <bil...@gm...> wrote: > PS > > Perhaps a possible thing to do in JFlex is to change line 90 of > LexParse.cup to > > return Character.isJavaIdentifierPart(c) && c > 31; > > although having to code around what is (imho) a Java flaw is distasteful. > > Bill > > > > On Thu, May 12, 2016 at 9:24 AM, Gerwin Klein <Ger...@ni...> > wrote: > >> Sorry, I did receive it but got bogged down in other work and haven’t had >> a chance to look at it yet. Should have at least let you know.. >> >> I should be able to look at it this weekend. >> >> Cheers, >> Gerwin >> >> >> >> On 12.05.2016, at 23:03, William Fenlason <bil...@gm...> wrote: >> >> Gerwin, >> >> Could you help me understand the status of this? >> >> At the end of April I sent you a small test case (6 files, including >> grammar, test driver, etc.) which I think demonstrates this problem. Since >> I haven't heard back and because I sent it off list, I'm wondering if you >> received it, or if it somehow ended up in a spam folder? Or is the >> situation that you have not been able to devote any time to this? >> >> I used a string reader to avoid any encoding issues, and added a test to >> insure that the string reader was delivering the control characters as >> expected. My initial conclusion is that the processing of jletterdigit >> possibly has a flaw in which a subset of the ASCII control characters are >> included. I haven't tried to confirm the situation in the JFlex source >> yet. No doubt you would be much more efficient than I in figuring this >> out, but I'll give it a try as time permits. >> >> Best, >> >> Bill Fenlason >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...> >> wrote: >> >>> Hi William, >>> >>> this does sound like it could be a bug, yes. >>> >>> Do you have a small test spec and input with expected output? I’d like >>> to try to reproduce across different versions, may be I can see what is >>> going on. >>> >>> A common pitfall with such characters is the encoding, both of the spec >>> file for JFlex and the input file to the compiled scanner. If you’re using >>> the unicode escape sequences, the former shouldn’t matter, but the latter >>> still might. >>> >>> Cheers, >>> Gerwin >>> >>> On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...> >>> wrote: >>> >>> RL1.1 Hex Notation >>> >>> *To meet this requirement, an implementation shall supply a mechanism >>> for specifying any Unicode code point (from U+0000 to U+10FFFF), using the >>> hexadecimal code point representation.* >>> >>> JFlex conforms. Syntax is provided to express values across the whole >>> range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where >>> yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 >>> digit hex value. >>> >>> >>> ------------------------------------------------------------------------------------------------- >>> >>> If I understand it correctly, the above (taken from the JFlex User >>> Manual) implies that all hex characters from \U0000 through \U10FFFF may be >>> used in a lexical specification. I don't think that is the case, and this >>> is why. >>> >>> As we know, <<EOF>> cannot be used for look ahead processing. It has >>> been suggested here that one way to simulate it is to append a unique >>> character to the end of the file, use it for look ahead, and then discard >>> it. That approach was adopted. >>> >>> We developed an extension of java.io.Reader which allows any specified >>> character to be transparently appended to the end of the file (Eclipse >>> document, actually), and also a substitute character to be returned in case >>> the specified character occurs in the file. >>> >>> It seemed that a reasonable choice for an EOF character was to use one >>> of the ASCII control characters from \x00 thru \x1F, avoiding the commonly >>> used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT >>> (\x04) appeared to be good alternatives. >>> >>> Initial testing did not bear this out - in a test case, two versions of >>> JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather >>> than recognizing them as separate tokens. Additional testing convinced us >>> that of the reasonable control character choices, only File Separator (FS - >>> \x1C) and Group Separator (GS - \x1D) work as expected. >>> >>> Why should some control characters work, and others not work? My >>> suspicion is that somewhere in the JFlex code there are specific character >>> dependencies in the ASCII control character range. >>> >>> I believe that this is a bug, either in the code or in the above >>> documentation, and is contrary to the idea that any hex character may be >>> used in a specification. >>> >>> Am I mis-reading this documentation? Do others agree that this is a bug >>> to be fixed? >>> >>> I've downloaded the JFlex source and am willing to look for the cause, >>> but I have no idea where to start exploring. Does anyone have suggestions? >>> >>> Obviously \x1C as the EOF character is a pragmatic solution "because it >>> works", but that seems a bit of a kludge.. >>> >>> Bill Fenlason >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Find and fix application performance issues faster with Applications >>> Manager >>> Applications Manager provides deep performance insights into multiple >>> tiers of >>> your business applications. It resolves application problems quickly and >>> reduces your MTTR. Get your free trial! >>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- >>> jflex-users mailing list >>> https://lists.sourceforge.net/lists/listinfo/jflex-users >>> >>> >>> >>> ------------------------------ >>> >>> The information in this e-mail may be confidential and subject to legal >>> professional privilege and/or copyright. National ICT Australia Limited >>> accepts no liability for any damage caused by this email or its attachments. >>> >> >> >> > > > ------------------------------------------------------------------------------ > Mobile security can be enabling, not merely restricting. Employees who > bring their own devices (BYOD) to work are irked by the imposition of MDM > restrictions. Mobile Device Manager Plus allows you to control only the > apps on BYO-devices by containerizing them, leaving personal data > untouched! > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > -- > jflex-users mailing list > https://lists.sourceforge.net/lists/listinfo/jflex-users > > |
From: William F. <bil...@gm...> - 2016-05-12 18:26:02
|
PS Perhaps a possible thing to do in JFlex is to change line 90 of LexParse.cup to return Character.isJavaIdentifierPart(c) && c > 31; although having to code around what is (imho) a Java flaw is distasteful. Bill On Thu, May 12, 2016 at 9:24 AM, Gerwin Klein <Ger...@ni...> wrote: > Sorry, I did receive it but got bogged down in other work and haven’t had > a chance to look at it yet. Should have at least let you know.. > > I should be able to look at it this weekend. > > Cheers, > Gerwin > > > > On 12.05.2016, at 23:03, William Fenlason <bil...@gm...> wrote: > > Gerwin, > > Could you help me understand the status of this? > > At the end of April I sent you a small test case (6 files, including > grammar, test driver, etc.) which I think demonstrates this problem. Since > I haven't heard back and because I sent it off list, I'm wondering if you > received it, or if it somehow ended up in a spam folder? Or is the > situation that you have not been able to devote any time to this? > > I used a string reader to avoid any encoding issues, and added a test to > insure that the string reader was delivering the control characters as > expected. My initial conclusion is that the processing of jletterdigit > possibly has a flaw in which a subset of the ASCII control characters are > included. I haven't tried to confirm the situation in the JFlex source > yet. No doubt you would be much more efficient than I in figuring this > out, but I'll give it a try as time permits. > > Best, > > Bill Fenlason > > > > > > > > > > > > > > On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...> > wrote: > >> Hi William, >> >> this does sound like it could be a bug, yes. >> >> Do you have a small test spec and input with expected output? I’d like to >> try to reproduce across different versions, may be I can see what is going >> on. >> >> A common pitfall with such characters is the encoding, both of the spec >> file for JFlex and the input file to the compiled scanner. If you’re using >> the unicode escape sequences, the former shouldn’t matter, but the latter >> still might. >> >> Cheers, >> Gerwin >> >> On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...> >> wrote: >> >> RL1.1 Hex Notation >> >> *To meet this requirement, an implementation shall supply a mechanism for >> specifying any Unicode code point (from U+0000 to U+10FFFF), using the >> hexadecimal code point representation.* >> >> JFlex conforms. Syntax is provided to express values across the whole >> range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where >> yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit >> hex value. >> >> >> ------------------------------------------------------------------------------------------------- >> >> If I understand it correctly, the above (taken from the JFlex User >> Manual) implies that all hex characters from \U0000 through \U10FFFF may be >> used in a lexical specification. I don't think that is the case, and this >> is why. >> >> As we know, <<EOF>> cannot be used for look ahead processing. It has >> been suggested here that one way to simulate it is to append a unique >> character to the end of the file, use it for look ahead, and then discard >> it. That approach was adopted. >> >> We developed an extension of java.io.Reader which allows any specified >> character to be transparently appended to the end of the file (Eclipse >> document, actually), and also a substitute character to be returned in case >> the specified character occurs in the file. >> >> It seemed that a reasonable choice for an EOF character was to use one of >> the ASCII control characters from \x00 thru \x1F, avoiding the commonly >> used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT >> (\x04) appeared to be good alternatives. >> >> Initial testing did not bear this out - in a test case, two versions of >> JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather >> than recognizing them as separate tokens. Additional testing convinced us >> that of the reasonable control character choices, only File Separator (FS - >> \x1C) and Group Separator (GS - \x1D) work as expected. >> >> Why should some control characters work, and others not work? My >> suspicion is that somewhere in the JFlex code there are specific character >> dependencies in the ASCII control character range. >> >> I believe that this is a bug, either in the code or in the above >> documentation, and is contrary to the idea that any hex character may be >> used in a specification. >> >> Am I mis-reading this documentation? Do others agree that this is a bug >> to be fixed? >> >> I've downloaded the JFlex source and am willing to look for the cause, >> but I have no idea where to start exploring. Does anyone have suggestions? >> >> Obviously \x1C as the EOF character is a pragmatic solution "because it >> works", but that seems a bit of a kludge.. >> >> Bill Fenlason >> >> >> >> ------------------------------------------------------------------------------ >> Find and fix application performance issues faster with Applications >> Manager >> Applications Manager provides deep performance insights into multiple >> tiers of >> your business applications. It resolves application problems quickly and >> reduces your MTTR. Get your free trial! >> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- >> jflex-users mailing list >> https://lists.sourceforge.net/lists/listinfo/jflex-users >> >> >> >> ------------------------------ >> >> The information in this e-mail may be confidential and subject to legal >> professional privilege and/or copyright. National ICT Australia Limited >> accepts no liability for any damage caused by this email or its attachments. >> > > > |
From: William F. <bil...@gm...> - 2016-05-12 17:16:44
|
Hi Gerwin, After looking at the very readable JFlex code, I could see that the problem was not with JFlex. The simple case below shows the root cause of the problem Is there any reasonable explanation for why many of the ASCII control characters are considered to be Java letters or digits? My gut tells me this is not what the Java designers had in mind. Maybe this is a question for Oracle? Bill Fenlason -------------------------------------------------------------------------------------------------- public class Main { public static void main(String[] args) { char c ; for (int i = 0; i < 64; i += 1) { c = (char) i; boolean b = Character.isJavaIdentifierPart(c); System.out.println("" + i +" (x" + x(i)+ ") " + (i > 31? c : " ") + (i < 10?" ":"") + " is java identifier part: " + b ); } } static String x (int i) { String s = "0123456789ABCDEF"; if (i <256) return "" + s.charAt(i/16) + s.charAt(i&15); return x(i/256) + x(i&255); } } /* --- results ---- 0 (x00) is java identifier part: true 1 (x01) is java identifier part: true 2 (x02) is java identifier part: true 3 (x03) is java identifier part: true 4 (x04) is java identifier part: true 5 (x05) is java identifier part: true 6 (x06) is java identifier part: true 7 (x07) is java identifier part: true 8 (x08) is java identifier part: true 9 (x09) is java identifier part: false 10 (x0A) is java identifier part: false 11 (x0B) is java identifier part: false 12 (x0C) is java identifier part: false 13 (x0D) is java identifier part: false 14 (x0E) is java identifier part: true 15 (x0F) is java identifier part: true 16 (x10) is java identifier part: true 17 (x11) is java identifier part: true 18 (x12) is java identifier part: true 19 (x13) is java identifier part: true 20 (x14) is java identifier part: true 21 (x15) is java identifier part: true 22 (x16) is java identifier part: true 23 (x17) is java identifier part: true 24 (x18) is java identifier part: true 25 (x19) is java identifier part: true 26 (x1A) is java identifier part: true 27 (x1B) is java identifier part: true 28 (x1C) is java identifier part: false 29 (x1D) is java identifier part: false 30 (x1E) is java identifier part: false 31 (x1F) is java identifier part: false 32 (x20) is java identifier part: false 33 (x21) ! is java identifier part: false 34 (x22) " is java identifier part: false 35 (x23) # is java identifier part: false 36 (x24) $ is java identifier part: true 37 (x25) % is java identifier part: false 38 (x26) & is java identifier part: false 39 (x27) ' is java identifier part: false 40 (x28) ( is java identifier part: false 41 (x29) ) is java identifier part: false 42 (x2A) * is java identifier part: false 43 (x2B) + is java identifier part: false 44 (x2C) , is java identifier part: false 45 (x2D) - is java identifier part: false 46 (x2E) . is java identifier part: false 47 (x2F) / is java identifier part: false 48 (x30) 0 is java identifier part: true 49 (x31) 1 is java identifier part: true 50 (x32) 2 is java identifier part: true 51 (x33) 3 is java identifier part: true 52 (x34) 4 is java identifier part: true 53 (x35) 5 is java identifier part: true 54 (x36) 6 is java identifier part: true 55 (x37) 7 is java identifier part: true 56 (x38) 8 is java identifier part: true 57 (x39) 9 is java identifier part: true 58 (x3A) : is java identifier part: false 59 (x3B) ; is java identifier part: false 60 (x3C) < is java identifier part: false 61 (x3D) = is java identifier part: false 62 (x3E) > is java identifier part: false 63 (x3F) ? is java identifier part: false */ On Thu, May 12, 2016 at 9:24 AM, Gerwin Klein <Ger...@ni...> wrote: > Sorry, I did receive it but got bogged down in other work and haven’t had > a chance to look at it yet. Should have at least let you know.. > > I should be able to look at it this weekend. > > Cheers, > Gerwin > > > > On 12.05.2016, at 23:03, William Fenlason <bil...@gm...> wrote: > > Gerwin, > > Could you help me understand the status of this? > > At the end of April I sent you a small test case (6 files, including > grammar, test driver, etc.) which I think demonstrates this problem. Since > I haven't heard back and because I sent it off list, I'm wondering if you > received it, or if it somehow ended up in a spam folder? Or is the > situation that you have not been able to devote any time to this? > > I used a string reader to avoid any encoding issues, and added a test to > insure that the string reader was delivering the control characters as > expected. My initial conclusion is that the processing of jletterdigit > possibly has a flaw in which a subset of the ASCII control characters are > included. I haven't tried to confirm the situation in the JFlex source > yet. No doubt you would be much more efficient than I in figuring this > out, but I'll give it a try as time permits. > > Best, > > Bill Fenlason > > > > > > > > > > > > > > On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...> > wrote: > >> Hi William, >> >> this does sound like it could be a bug, yes. >> >> Do you have a small test spec and input with expected output? I’d like to >> try to reproduce across different versions, may be I can see what is going >> on. >> >> A common pitfall with such characters is the encoding, both of the spec >> file for JFlex and the input file to the compiled scanner. If you’re using >> the unicode escape sequences, the former shouldn’t matter, but the latter >> still might. >> >> Cheers, >> Gerwin >> >> On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...> >> wrote: >> >> RL1.1 Hex Notation >> >> *To meet this requirement, an implementation shall supply a mechanism for >> specifying any Unicode code point (from U+0000 to U+10FFFF), using the >> hexadecimal code point representation.* >> >> JFlex conforms. Syntax is provided to express values across the whole >> range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where >> yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit >> hex value. >> >> >> ------------------------------------------------------------------------------------------------- >> >> If I understand it correctly, the above (taken from the JFlex User >> Manual) implies that all hex characters from \U0000 through \U10FFFF may be >> used in a lexical specification. I don't think that is the case, and this >> is why. >> >> As we know, <<EOF>> cannot be used for look ahead processing. It has >> been suggested here that one way to simulate it is to append a unique >> character to the end of the file, use it for look ahead, and then discard >> it. That approach was adopted. >> >> We developed an extension of java.io.Reader which allows any specified >> character to be transparently appended to the end of the file (Eclipse >> document, actually), and also a substitute character to be returned in case >> the specified character occurs in the file. >> >> It seemed that a reasonable choice for an EOF character was to use one of >> the ASCII control characters from \x00 thru \x1F, avoiding the commonly >> used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT >> (\x04) appeared to be good alternatives. >> >> Initial testing did not bear this out - in a test case, two versions of >> JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather >> than recognizing them as separate tokens. Additional testing convinced us >> that of the reasonable control character choices, only File Separator (FS - >> \x1C) and Group Separator (GS - \x1D) work as expected. >> >> Why should some control characters work, and others not work? My >> suspicion is that somewhere in the JFlex code there are specific character >> dependencies in the ASCII control character range. >> >> I believe that this is a bug, either in the code or in the above >> documentation, and is contrary to the idea that any hex character may be >> used in a specification. >> >> Am I mis-reading this documentation? Do others agree that this is a bug >> to be fixed? >> >> I've downloaded the JFlex source and am willing to look for the cause, >> but I have no idea where to start exploring. Does anyone have suggestions? >> >> Obviously \x1C as the EOF character is a pragmatic solution "because it >> works", but that seems a bit of a kludge.. >> >> Bill Fenlason >> >> >> >> ------------------------------------------------------------------------------ >> Find and fix application performance issues faster with Applications >> Manager >> Applications Manager provides deep performance insights into multiple >> tiers of >> your business applications. It resolves application problems quickly and >> reduces your MTTR. Get your free trial! >> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- >> jflex-users mailing list >> https://lists.sourceforge.net/lists/listinfo/jflex-users >> >> >> >> ------------------------------ >> >> The information in this e-mail may be confidential and subject to legal >> professional privilege and/or copyright. National ICT Australia Limited >> accepts no liability for any damage caused by this email or its attachments. >> > > > |
From: Gerwin K. <Ger...@ni...> - 2016-05-12 13:24:36
|
Sorry, I did receive it but got bogged down in other work and haven’t had a chance to look at it yet. Should have at least let you know.. I should be able to look at it this weekend. Cheers, Gerwin On 12.05.2016, at 23:03, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: Gerwin, Could you help me understand the status of this? At the end of April I sent you a small test case (6 files, including grammar, test driver, etc.) which I think demonstrates this problem. Since I haven't heard back and because I sent it off list, I'm wondering if you received it, or if it somehow ended up in a spam folder? Or is the situation that you have not been able to devote any time to this? I used a string reader to avoid any encoding issues, and added a test to insure that the string reader was delivering the control characters as expected. My initial conclusion is that the processing of jletterdigit possibly has a flaw in which a subset of the ASCII control characters are included. I haven't tried to confirm the situation in the JFlex source yet. No doubt you would be much more efficient than I in figuring this out, but I'll give it a try as time permits. Best, Bill Fenlason On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...<mailto:Ger...@ni...>> wrote: Hi William, this does sound like it could be a bug, yes. Do you have a small test spec and input with expected output? I’d like to try to reproduce across different versions, may be I can see what is going on. A common pitfall with such characters is the encoding, both of the spec file for JFlex and the input file to the compiled scanner. If you’re using the unicode escape sequences, the former shouldn’t matter, but the latter still might. Cheers, Gerwin On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: RL1.1 Hex Notation To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation. JFlex conforms. Syntax is provided to express values across the whole range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit hex value. ------------------------------------------------------------------------------------------------- If I understand it correctly, the above (taken from the JFlex User Manual) implies that all hex characters from \U0000 through \U10FFFF may be used in a lexical specification. I don't think that is the case, and this is why. As we know, <<EOF>> cannot be used for look ahead processing. It has been suggested here that one way to simulate it is to append a unique character to the end of the file, use it for look ahead, and then discard it. That approach was adopted. We developed an extension of java.io.Reader which allows any specified character to be transparently appended to the end of the file (Eclipse document, actually), and also a substitute character to be returned in case the specified character occurs in the file. It seemed that a reasonable choice for an EOF character was to use one of the ASCII control characters from \x00 thru \x1F, avoiding the commonly used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT (\x04) appeared to be good alternatives. Initial testing did not bear this out - in a test case, two versions of JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather than recognizing them as separate tokens. Additional testing convinced us that of the reasonable control character choices, only File Separator (FS - \x1C) and Group Separator (GS - \x1D) work as expected. Why should some control characters work, and others not work? My suspicion is that somewhere in the JFlex code there are specific character dependencies in the ASCII control character range. I believe that this is a bug, either in the code or in the above documentation, and is contrary to the idea that any hex character may be used in a specification. Am I mis-reading this documentation? Do others agree that this is a bug to be fixed? I've downloaded the JFlex source and am willing to look for the cause, but I have no idea where to start exploring. Does anyone have suggestions? Obviously \x1C as the EOF character is a pragmatic solution "because it works", but that seems a bit of a kludge.. Bill Fenlason ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users ________________________________ The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments. |
From: William F. <bil...@gm...> - 2016-05-12 13:03:50
|
Gerwin, Could you help me understand the status of this? At the end of April I sent you a small test case (6 files, including grammar, test driver, etc.) which I think demonstrates this problem. Since I haven't heard back and because I sent it off list, I'm wondering if you received it, or if it somehow ended up in a spam folder? Or is the situation that you have not been able to devote any time to this? I used a string reader to avoid any encoding issues, and added a test to insure that the string reader was delivering the control characters as expected. My initial conclusion is that the processing of jletterdigit possibly has a flaw in which a subset of the ASCII control characters are included. I haven't tried to confirm the situation in the JFlex source yet. No doubt you would be much more efficient than I in figuring this out, but I'll give it a try as time permits. Best, Bill Fenlason On Thu, Apr 28, 2016 at 7:22 AM, Gerwin Klein <Ger...@ni...> wrote: > Hi William, > > this does sound like it could be a bug, yes. > > Do you have a small test spec and input with expected output? I’d like to > try to reproduce across different versions, may be I can see what is going > on. > > A common pitfall with such characters is the encoding, both of the spec > file for JFlex and the input file to the compiled scanner. If you’re using > the unicode escape sequences, the former shouldn’t matter, but the latter > still might. > > Cheers, > Gerwin > > On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...> wrote: > > RL1.1 Hex Notation > > *To meet this requirement, an implementation shall supply a mechanism for > specifying any Unicode code point (from U+0000 to U+10FFFF), using the > hexadecimal code point representation.* > > JFlex conforms. Syntax is provided to express values across the whole > range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where > yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit > hex value. > > > ------------------------------------------------------------------------------------------------- > > If I understand it correctly, the above (taken from the JFlex User Manual) > implies that all hex characters from \U0000 through \U10FFFF may be used in > a lexical specification. I don't think that is the case, and this is why. > > As we know, <<EOF>> cannot be used for look ahead processing. It has been > suggested here that one way to simulate it is to append a unique character > to the end of the file, use it for look ahead, and then discard it. That > approach was adopted. > > We developed an extension of java.io.Reader which allows any specified > character to be transparently appended to the end of the file (Eclipse > document, actually), and also a substitute character to be returned in case > the specified character occurs in the file. > > It seemed that a reasonable choice for an EOF character was to use one of > the ASCII control characters from \x00 thru \x1F, avoiding the commonly > used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT > (\x04) appeared to be good alternatives. > > Initial testing did not bear this out - in a test case, two versions of > JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather > than recognizing them as separate tokens. Additional testing convinced us > that of the reasonable control character choices, only File Separator (FS - > \x1C) and Group Separator (GS - \x1D) work as expected. > > Why should some control characters work, and others not work? My > suspicion is that somewhere in the JFlex code there are specific character > dependencies in the ASCII control character range. > > I believe that this is a bug, either in the code or in the above > documentation, and is contrary to the idea that any hex character may be > used in a specification. > > Am I mis-reading this documentation? Do others agree that this is a bug > to be fixed? > > I've downloaded the JFlex source and am willing to look for the cause, but > I have no idea where to start exploring. Does anyone have suggestions? > > Obviously \x1C as the EOF character is a pragmatic solution "because it > works", but that seems a bit of a kludge.. > > Bill Fenlason > > > > ------------------------------------------------------------------------------ > Find and fix application performance issues faster with Applications > Manager > Applications Manager provides deep performance insights into multiple > tiers of > your business applications. It resolves application problems quickly and > reduces your MTTR. Get your free trial! > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- > jflex-users mailing list > https://lists.sourceforge.net/lists/listinfo/jflex-users > > > > ------------------------------ > > The information in this e-mail may be confidential and subject to legal > professional privilege and/or copyright. National ICT Australia Limited > accepts no liability for any damage caused by this email or its attachments. > |
From: Gerwin K. <Ger...@ni...> - 2016-04-28 11:22:50
|
Hi William, this does sound like it could be a bug, yes. Do you have a small test spec and input with expected output? I’d like to try to reproduce across different versions, may be I can see what is going on. A common pitfall with such characters is the encoding, both of the spec file for JFlex and the input file to the compiled scanner. If you’re using the unicode escape sequences, the former shouldn’t matter, but the latter still might. Cheers, Gerwin On 26 Apr 2016, at 13:29, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: RL1.1 Hex Notation To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation. JFlex conforms. Syntax is provided to express values across the whole range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit hex value. ------------------------------------------------------------------------------------------------- If I understand it correctly, the above (taken from the JFlex User Manual) implies that all hex characters from \U0000 through \U10FFFF may be used in a lexical specification. I don't think that is the case, and this is why. As we know, <<EOF>> cannot be used for look ahead processing. It has been suggested here that one way to simulate it is to append a unique character to the end of the file, use it for look ahead, and then discard it. That approach was adopted. We developed an extension of java.io.Reader which allows any specified character to be transparently appended to the end of the file (Eclipse document, actually), and also a substitute character to be returned in case the specified character occurs in the file. It seemed that a reasonable choice for an EOF character was to use one of the ASCII control characters from \x00 thru \x1F, avoiding the commonly used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT (\x04) appeared to be good alternatives. Initial testing did not bear this out - in a test case, two versions of JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather than recognizing them as separate tokens. Additional testing convinced us that of the reasonable control character choices, only File Separator (FS - \x1C) and Group Separator (GS - \x1D) work as expected. Why should some control characters work, and others not work? My suspicion is that somewhere in the JFlex code there are specific character dependencies in the ASCII control character range. I believe that this is a bug, either in the code or in the above documentation, and is contrary to the idea that any hex character may be used in a specification. Am I mis-reading this documentation? Do others agree that this is a bug to be fixed? I've downloaded the JFlex source and am willing to look for the cause, but I have no idea where to start exploring. Does anyone have suggestions? Obviously \x1C as the EOF character is a pragmatic solution "because it works", but that seems a bit of a kludge.. Bill Fenlason ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z-- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users ________________________________ The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments. |
From: William F. <bil...@gm...> - 2016-04-26 03:29:54
|
RL1.1 Hex Notation *To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation.* JFlex conforms. Syntax is provided to express values across the whole range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit hex value. ------------------------------------------------------------------------------------------------- If I understand it correctly, the above (taken from the JFlex User Manual) implies that all hex characters from \U0000 through \U10FFFF may be used in a lexical specification. I don't think that is the case, and this is why. As we know, <<EOF>> cannot be used for look ahead processing. It has been suggested here that one way to simulate it is to append a unique character to the end of the file, use it for look ahead, and then discard it. That approach was adopted. We developed an extension of java.io.Reader which allows any specified character to be transparently appended to the end of the file (Eclipse document, actually), and also a substitute character to be returned in case the specified character occurs in the file. It seemed that a reasonable choice for an EOF character was to use one of the ASCII control characters from \x00 thru \x1F, avoiding the commonly used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT (\x04) appeared to be good alternatives. Initial testing did not bear this out - in a test case, two versions of JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather than recognizing them as separate tokens. Additional testing convinced us that of the reasonable control character choices, only File Separator (FS - \x1C) and Group Separator (GS - \x1D) work as expected. Why should some control characters work, and others not work? My suspicion is that somewhere in the JFlex code there are specific character dependencies in the ASCII control character range. I believe that this is a bug, either in the code or in the above documentation, and is contrary to the idea that any hex character may be used in a specification. Am I mis-reading this documentation? Do others agree that this is a bug to be fixed? I've downloaded the JFlex source and am willing to look for the cause, but I have no idea where to start exploring. Does anyone have suggestions? Obviously \x1C as the EOF character is a pragmatic solution "because it works", but that seems a bit of a kludge.. Bill Fenlason |
From: Gerwin K. <Ger...@ni...> - 2016-03-28 00:19:03
|
Thanks for reporting that. It should now be fixed. Cheers, Gerwin On 28.03.2016, at 08:48, William Fenlason <bil...@gm...<mailto:bil...@gm...>> wrote: On the download page (http://jflex.de/download.html), the JFlex Maven plugin shows the name: jflex-maven-plugin-1.6.1.zip two times. Each of the two (identical) download buttons actually download the same file: jflex-maven-1.6.1.tar.gz I would assume that the first button should read: jflex-maven-plugin-1.6.1.tar.gz, and the second button should download the file: jflex-maven-1.6.1.zip (assuming it exists). I noticed this because I was trying to download the zip file. Bill Fenlason ------------------------------------------------------------------------------ Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140-- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users ________________________________ The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments. |
From: William F. <bil...@gm...> - 2016-03-27 21:48:18
|
On the download page (http://jflex.de/download.html), the JFlex Maven plugin shows the name: jflex-maven-plugin-1.6.1.zip two times. Each of the two (identical) download buttons actually download the same file: jflex-maven-1.6.1.tar.gz I would assume that the first button should read: jflex-maven-plugin-1.6.1.tar.gz, and the second button should download the file: jflex-maven-1.6.1.zip (assuming it exists). I noticed this because I was trying to download the zip file. Bill Fenlason |
From: Steve R. <sa...@gm...> - 2016-02-26 14:46:41
|
Hi Ralph, I’m guessing that you have rules to match & ignore whitespace in the default state, but since you don’t have one of those for ISTATUS_STATE, the space after ISTATUS blocks recognition of “ACTIVE”. Steve > On Feb 25, 2016, at 8:08 AM, Ralph Stommel <r.s...@co...> wrote: > > Dear JFLEX-Users, > > I am using JFLEX together with BYACC. It has been working perfectly in all my projects so far. > However, in order to prevent my JFLEX scanner from recognizing a generic quoted string after having recognized a tokenISTATUS I have specified the following exclusive lexical start state scenario: > > %% > > %byaccj > %ignorecase > %xstate ISTATUS_STATE > > … > > ACTIVE = (active)|([\"](active)[\"]) > … > QUOTED_STRING = ([\"][^\n\r]*(\"\")*[^\n\r]*[\"]) > %% > … > <ISTATUS_STATE>{ACTIVE} {yyparser.yylval = new ParserVal(yytext()); yybegin(YYINITIAL); return Parser.ACTIVE;} > … > {ISTATUS} {yyparser.yylval = new ParserVal(yytext()); yybegin(ISTATUS_STATE); return Parser.ISTATUS;} > … > {QUOTED_STRING} {yyparser.yylval = new ParserVal(yytext()); return Parser.QUOTED_STRING;} > … > > The string that is parsed looks as follows: > … ISTATUS “ACTIVE” … > I.e. the quoted string “ACTIVE” is directly following the token ISTATUS. > When debugging the lexer I can see that yybegin(ISTATUS_STATE) is set after recognizing the ISTATUS token. > But then the “ACTIVE” string is not recognized and the lexer terminates with zzScanError(ZZ_NO_MATCH) instead; > Without the lexical state spec the ACTIVE token is recognized by the lexer. > > Does anyone see where I am wrong in my usage scenario above or would anyone know how to make this work? > Many thanks in advance for your help. > > Ralph > > > > ------------------------------------------------------------------------------ > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140-- > jflex-users mailing list > https://lists.sourceforge.net/lists/listinfo/jflex-users |
From: Ralph S. <r.s...@co...> - 2016-02-25 13:22:02
|
Dear JFLEX-Users, I am using JFLEX together with BYACC. It has been working perfectly in all my projects so far. However, in order to prevent my JFLEX scanner from recognizing a generic quoted string after having recognized a token ISTATUS I have specified the following exclusive lexical start state scenario: %% %byaccj %ignorecase %xstate ISTATUS_STATE ... ACTIVE = (active)|([\"](active)[\"]) ... QUOTED_STRING = ([\"][^\n\r]*(\"\")*[^\n\r]*[\"]) %% ... <ISTATUS_STATE>{ACTIVE} {yyparser.yylval = new ParserVal(yytext()); yybegin(YYINITIAL); return Parser.ACTIVE;} ... {ISTATUS} {yyparser.yylval = new ParserVal(yytext()); yybegin(ISTATUS_STATE); return Parser.ISTATUS;} ... {QUOTED_STRING} {yyparser.yylval = new ParserVal(yytext()); return Parser.QUOTED_STRING;} ... The string that is parsed looks as follows: ... ISTATUS "ACTIVE" ... I.e. the quoted string "ACTIVE" is directly following the token ISTATUS. When debugging the lexer I can see that yybegin(ISTATUS_STATE) is set after recognizing the ISTATUS token. But then the "ACTIVE" string is not recognized and the lexer terminates with zzScanError(ZZ_NO_MATCH) instead; Without the lexical state spec the ACTIVE token is recognized by the lexer. Does anyone see where I am wrong in my usage scenario above or would anyone know how to make this work? Many thanks in advance for your help. Ralph |
From: <de....@io...> - 2015-11-14 22:12:51
|
Hello. I'm trying to break up a file into words, "$$" and "\". Specifically, a "word" is any non-whitespace character. An input such as: "a b c$$ $$ \ e \\\" .. Would yield the tokens: 1. a 2. b 3. c 4. $$ 5. $$ 6. \ 7. e 8. \ 9. \ 10. \ I'm having problems coming up with a pattern or set of patterns that will achieve this, however. The obvious definition, such as: Word = \P{Whitespace}+ Space = \p{Whitespace}+ Command = \p{Alpha}+ Slash = \\ Dollars = "$$" %% <YYINITIAL> { { Space } { /* Ignore */ } { Slash } { throw new RuntimeException("Slash"); } { Dollars } { throw new RuntimeException("Dollars"); } { Word } { final TokenText.Builder b = TokenText.builder(); b.position(this.position()); b.name(this.yytext()); return b.build(); } } ... Will obviously not work, because although " $$ " and " \ " will be matched by the Slash and Dollars patterns, an input such as "f$$" will be matched by the Word pattern, rather than yielding two tokens "f" and "$$". What is the simplest way to achieve this with jflex? M |
From: <de....@io...> - 2015-11-14 22:06:18
|
On 2015-11-14T21:36:03 +0000 <de....@io...> wrote: > I'm trying to break up a file into words, "$$" and "\". Specifically, a > "word" is any non-whitespace character. An input such as: Sorry, that should have read: A "word" is any sequence of one or more non-whitespace characters. M |
From: Gerwin K. <Ger...@ni...> - 2015-03-01 22:17:11
|
That’s right. If it can be both at the beginning of the line or not, you could just define comment = ;[a-zA-Z0-9]+ Cheers, Gerwin On 02.03.2015, at 02:39, master <ma...@lu...<mailto:ma...@lu...>> wrote: Am 01.03.2015 um 15:22 schrieb master: Hi, I'am new to JFlex and started with a simple scanner for assembler text files. A comment line in this assembler starts with ';' and kann appear at the beginning of a line, or somewhere after an expression. In the macro section I defined a macro for comments as follows: comment = ^;([a-zA-Z0-9]+) | (;[a-zA-Z0-9]+) But running jFlex always stated: Syntax error. comment = ^;([a-zA-Z0-9]+) | (;[a-zA-Z0-9]+) ^ where he points to the '^' sign. Why can I not use the legal '^' sign here for referencing the beginning of a line? Best regards ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ -- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users Hi, after reading the manual again, I found the reason in chapter 4.2.11 Macrodefinition: ... must not contain the ^, / or $ operators. ): Best regards ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/-- jflex-users mailing list https://lists.sourceforge.net/lists/listinfo/jflex-users ________________________________ The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments. |
From: master <ma...@lu...> - 2015-03-01 15:41:13
|
Am 01.03.2015 um 15:22 schrieb master: > Hi, > I'am new to JFlex and started with a simple scanner for assembler text > files. > A comment line in this assembler starts with ';' and kann appear at > the beginning of a line, or somewhere after an expression. > In the macro section I defined a macro for comments as follows: > > comment = ^;([a-zA-Z0-9]+) | (;[a-zA-Z0-9]+) > > But running jFlex always stated: > > Konsole output > Syntax error. > comment = ^;([a-zA-Z0-9]+) | (;[a-zA-Z0-9]+) > ^ > > where he points to the '^' sign. > > Why can I not use the legal '^' sign here for referencing the > beginning of a line? > > Best regards > > > > > > > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > > > -- > jflex-users mailing list > https://lists.sourceforge.net/lists/listinfo/jflex-users Hi, after reading the manual again, I found the reason in chapter 4.2.11 Macrodefinition: ... must not contain the ^, / or $ operators. ): Best regards |