|
From: Benedikt F. <b.f...@gm...> - 2025-10-22 10:28:26
|
Am 22.10.25 um 11:08 schrieb Philipp Klaus Krause: > Am 22.10.25 um 09:36 schrieb "Janko Stamenović" via sdcc-devel: >> I understand it's only to have a checkmark "yes we have that." > > Currently, the C23 standard technically only requires us to check for > valid unicode, since valid unicode that is not allowed in identifiers > results in undefined behavior, but IMO, that was a mistake (a "shall" > outside of a constraint section results in UB, while in a constraint > section it requires a diagnostic), and will likely be fixed, with a > recommendation to also apply the fix to implementations of earlier > standards. > Also, in my experience, people do use non-ASCII identifiers when > anyone working onthe code can resonably be expected to be able to > handle them, and the use of non-ASCII identifiers make the intent > clearer. Examples would be using greek letters in implementations of > math, and the use of terms from local legislation/regulation in code > that needs to comply with it. > So IMO, we should have reasonable support for non-ASCII identifiers. > Also, we do have some support in rewrite decode_UCNs_to_utf8, so we > probably have some users depending on it already. > >> […] >> >> I suggest that SDCC does the same: that way many costs of using >> these could be paid only by those who decide to need them >> (effectively nobody) and the core tools remain decoupled. > > The cost would be the added dependency on a unicode library. For now, > that unicode library would be used only to rewrite decode_UCNs_to_utf8 > in src/SDCC.lex, so we'd no longer implement the check for XID_Start > and XID_Continue ourselves, and also do a normalization to > normalization form C before the check. > > The C standard, annex D, explicitly mentions the possibility of the > compiler not doing the normalization to normalization form C, and > instead just requiring that the input be already normalized to > normalization form C, but IMO that doesn't help us much, as we'd still > need a check for the identifier being in normalization form C. > > Philipp > > P.S.: ICU4C could be an alternative to libunistring; apparently both > are available on virtually all host platforms. But libunistring looks > "lighter" to me, and should be good enough for our use-case, so I > think I'd prefer that one. I am still trying to figure out in which ways exactly SDCC is currently not standard compliant. Is it just missing diagnostics or are there other issues, too? The monstrosity of a regular expression that parses utf8 identifiers in our lexer currently accepts exactly the byte sequences that are composed of byte sequences that correspond to valid utf8 code points within the character set that C11 allowed in identifiers. That means that no normalization is happening in the lexer and that it will still happily accept e.g. a "pile of poo" emoji thrown at it. Does the standard explicitly require diagnostics in these cases? Greetings Benedikt |