|
From: Philipp K. K. <pk...@sp...> - 2025-10-22 09:08:57
|
Am 22.10.25 um 09:36 schrieb "Janko Stamenović" via sdcc-devel: > I understand it's only to have a checkmark "yes we have that." Currently, the C23 standard technically only requires us to check for valid unicode, since valid unicode that is not allowed in identifiers results in undefined behavior, but IMO, that was a mistake (a "shall" outside of a constraint section results in UB, while in a constraint section it requires a diagnostic), and will likely be fixed, with a recommendation to also apply the fix to implementations of earlier standards. Also, in my experience, people do use non-ASCII identifiers when anyone working onthe code can resonably be expected to be able to handle them, and the use of non-ASCII identifiers make the intent clearer. Examples would be using greek letters in implementations of math, and the use of terms from local legislation/regulation in code that needs to comply with it. So IMO, we should have reasonable support for non-ASCII identifiers. Also, we do have some support in rewrite decode_UCNs_to_utf8, so we probably have some users depending on it already. > […] > > I suggest that SDCC does the same: that way many costs of using > these could be paid only by those who decide to need them > (effectively nobody) and the core tools remain decoupled. The cost would be the added dependency on a unicode library. For now, that unicode library would be used only to rewrite decode_UCNs_to_utf8 in src/SDCC.lex, so we'd no longer implement the check for XID_Start and XID_Continue ourselves, and also do a normalization to normalization form C before the check. The C standard, annex D, explicitly mentions the possibility of the compiler not doing the normalization to normalization form C, and instead just requiring that the input be already normalized to normalization form C, but IMO that doesn't help us much, as we'd still need a check for the identifier being in normalization form C. Philipp P.S.: ICU4C could be an alternative to libunistring; apparently both are available on virtually all host platforms. But libunistring looks "lighter" to me, and should be good enough for our use-case, so I think I'd prefer that one. |