Q 7.0 character escape syntax (was: Re: [q-lang-users] ANN: Q 7.0 release candidate)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi John,

I'm taking this discussion back to the the mailing list, as I feel that 
this issue should be discussed by everybody on the list who is 
interested in the upcoming Q 7.0 release.

(Just a quick update for everybody: Q 7.0 RC2 is almost done and I also 
have the native Windows port working. Moreover, thanks to John Cowan's 
tireless testing and bug reporting, Cygwin is now supported, too. 
However, there is still an issue related to numeric character escapes, 
as detailed below. Note that we now need an escape syntax which is able 
to support the entire Unicode range, not just ASCII.)

John Cowan wrote:
> I found a problem:  when you type
> 
> 	"\300" ++ "4"
> 
> to the interpreter, it replies
> 
> 	"\3004"

> There doesn't seem to be any way to defeat
> the greediness of the \N construct, and there needs to be.  It seems
> to me that the most Q-ish approach is to allow parentheses around N,
> and output "\(300)4".

Thanks for reporting this. I certainly want to fix this before releasing 
RC2. Your proposal makes sense to me, and would be fairly easy to 
implement, too.

(NB: The problem here is that an escape like "\1234" will always denote 
character #1234, and there's currently no way to escape, say, character 
#123, followed by a literal "4" character (other than escaping "4", too, 
which is silly). In fact, I think that this misfeature is present in 
*all* recent Q versions.)

> In addition, Unicode folks really really really detest decimal numbers
> for Unicode characters.  While you're fixing the above, please allow a
> leading x (as in "\x0100") for hexadecimal character escapes.

Recent Q versions already allow either decimal, octal or hexadecimal 
notation in an escape, using the same syntax as in integer literals. 
Thus, e.g., \27, \033 and \0x1b all denote the ASCII escape character. 
With Q 7.0 I still use the same notation, only the range of character 
codes is bigger, allowing for all 0x110000 Unicode characters. I think 
that this notation is cleaner and simpler than having to remember all 
kinds of funny escape notations, like the \ooo, \xhh, \uhhhh and 
\Uhhhhhhhh escapes of C, Python et al. But the advantage of the latter 
is that apparently many other languages already use them, so they are a 
kind of de facto standard.

I'm not sure what The Right Thing is in this case. So what should we do: 
Keep the existing \ddd, \0ooo, \0xhhh notation and extend that with the 
\(<int>) notation? Or rather jump on the C/Python/... bandwagon and 
employ the widely used \ooo, \xhh, \uhhhh, \Uhhhhhhhh notation? (Note 
that then I'd also have to slash the decimal escape syntax of pre-7.0 
releases, potentially breaking existing scripts in places where it might 
not be easily noticed.) Any other proposals? Opinions?

> Another minor point:  currently stray \ characters are basically ignored:
> "\z" is equivalent to "z".  IMHO this should be changed, making them
> syntax errors; that allows you to add a meaning for \z at some future
> date without worries that existing poorly-written scripts will break.

This convention was adopted from C (I think that the standard doesn't 
actually specify this, but IIRC all C compilers I've used did it that 
way). At least the newer versions of gcc generate a warning message in 
the case of an unrecognized escape, though. I could easily do this in 
the Q compiler as well, but unless you run the interpreter with the -w 
option you won't notice the difference. ;-) OTOH, spitting out a syntax 
error in this case seems a bit too harsh for my taste. What do others think?

Cheers,
Albert

-- 
Dr. Albert Gr"af
Dept. of Music-Informatics, University of Mainz, Germany
Email:  Dr....@t-..., ag...@mu...
WWW:    http://www.musikinformatik.uni-mainz.de/ag