[jflex-users] Possible bug?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

RL1.1 Hex Notation

*To meet this requirement, an implementation shall supply a mechanism for
specifying any Unicode code point (from U+0000 to U+10FFFF), using the
hexadecimal code point representation.*

JFlex conforms. Syntax is provided to express values across the whole
range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where yyyyyy
is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit hex value.

-------------------------------------------------------------------------------------------------

If I understand it correctly, the above (taken from the JFlex User Manual)
implies that all hex characters from \U0000 through \U10FFFF may be used in
a lexical specification.  I don't think that is the case, and this is why.

As we know, <<EOF>> cannot be used for look ahead processing.  It has been
suggested here that one way to simulate it is to append a unique character
to the end of the file, use it for look ahead, and then discard it.  That
approach was adopted.

We developed an extension of java.io.Reader which allows any specified
character to be transparently appended to the end of the file (Eclipse
document, actually), and also a substitute character to be returned in case
the specified character occurs in the file.

It seemed that a reasonable choice for an EOF character was to use one of
the ASCII control characters from \x00 thru \x1F, avoiding the commonly
used ones like \x00 and \x07 thru \x0D.  Initially, ETX (\x03) and EOT
(\x04) appeared to be good alternatives.

Initial testing did not bear this out - in a test case, two versions of
JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather
than recognizing them as separate tokens.  Additional testing convinced us
that of the reasonable control character choices, only File Separator (FS -
\x1C) and Group Separator (GS - \x1D) work as expected.

Why should some control characters work, and others not work?  My suspicion
is that somewhere in the JFlex code there are specific character
dependencies in the ASCII control character range.

I believe that this is a bug, either in the code or in the above
documentation, and is contrary to the idea that any hex character may be
used in a specification.

Am I mis-reading this documentation?  Do others agree that this is a bug to
be fixed?

I've downloaded the JFlex source and am willing to look for the cause, but
I have no idea where to start exploring.  Does anyone have suggestions?

Obviously \x1C as the EOF character is a pragmatic solution "because it
works", but that seems a bit of a kludge..

Bill Fenlason

[jflex-users] Possible bug?

The fast lexer generator for Java

[jflex-users] Possible bug?