From: William F. <bil...@gm...> - 2016-04-26 03:29:54
|
RL1.1 Hex Notation *To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation.* JFlex conforms. Syntax is provided to express values across the whole range, via \uXXXX, where XXXX is a 4-digit hex value; \Uyyyyyy, where yyyyyy is a 6-digit hex value; and \u{X+( X+)*}, where X+ is a 1-6 digit hex value. ------------------------------------------------------------------------------------------------- If I understand it correctly, the above (taken from the JFlex User Manual) implies that all hex characters from \U0000 through \U10FFFF may be used in a lexical specification. I don't think that is the case, and this is why. As we know, <<EOF>> cannot be used for look ahead processing. It has been suggested here that one way to simulate it is to append a unique character to the end of the file, use it for look ahead, and then discard it. That approach was adopted. We developed an extension of java.io.Reader which allows any specified character to be transparently appended to the end of the file (Eclipse document, actually), and also a substitute character to be returned in case the specified character occurs in the file. It seemed that a reasonable choice for an EOF character was to use one of the ASCII control characters from \x00 thru \x1F, avoiding the commonly used ones like \x00 and \x07 thru \x0D. Initially, ETX (\x03) and EOT (\x04) appeared to be good alternatives. Initial testing did not bear this out - in a test case, two versions of JFlex (1.4.3 and 1.6.1) appended these characters to other tokens rather than recognizing them as separate tokens. Additional testing convinced us that of the reasonable control character choices, only File Separator (FS - \x1C) and Group Separator (GS - \x1D) work as expected. Why should some control characters work, and others not work? My suspicion is that somewhere in the JFlex code there are specific character dependencies in the ASCII control character range. I believe that this is a bug, either in the code or in the above documentation, and is contrary to the idea that any hex character may be used in a specification. Am I mis-reading this documentation? Do others agree that this is a bug to be fixed? I've downloaded the JFlex source and am willing to look for the cause, but I have no idea where to start exploring. Does anyone have suggestions? Obviously \x1C as the EOF character is a pragmatic solution "because it works", but that seems a bit of a kludge.. Bill Fenlason |