|
From: Philipp K. K. <pk...@sp...> - 2025-10-23 08:37:27
|
Am 23.10.25 um 09:12 schrieb "Janko Stamenović" via sdcc-devel:
>
> And I don't even know if Unicode considers that U+212B "should or
> shouldn't be used in an identifier". It doesn't help and doesn't
> change anything, also not "for security" because if any script
> which could "look too much the same" whould be banned from
> identifiers, it would be a huge discrimination.
>
> In short, IMO, normalization should be an answer to a specific
> problem, not something done "just because". Specifically, if the
> problem is some security evaluation ("does something look the same
> as something else?"), such a tool should be used and its outputs
> analyzed independently of a compiler, because different scripts
> could look the same anywaz, and, on another side, the security
> breaches could very well be implemented in pure ASCII too and can
> be missed in the visual control of the sources, so expecting a
> compiler that solves all the problems that could ever happen isn't
> realistic.
>
> My current impression is still that an existence of automatic
> normalization and automatic detection of non-normalized forms in a
> compiler won't change anything in practice, except that a checkbox
> could be ticked "yes we have that"?
>
* The standard says that two identifiers are the same if they are the
same after normalization to normalization form C.
* Someone could e.g. have a Variable named Übernachtungspauschale
written with U (U+0055 LATIN CAPITAL LETTER U) + ̈ (U+0308 COMBINING
DIAERESIS), because their text editor does it like that, then someone
works on the source with another text editor that uses Ü(U+00DC LATIN
CAPITAL LETTER U WITH DIAERESIS), and it would be confusing to the user
if those were not treated as the same by SDCC due to lack of normalization.
* Well, I don't know either if UAX #31 says that U+212B can be used in
an identifier or not. So I'D want SDCC to warn me if I try to use it,
despite it not being allowed in identifiers. My code should be portable
unless I intentionally use a compiler-specific feature.
* Regarding the security implication of unicode (e.g. homoglyph
attacks); Having the normalization and the checks for valid identifiers
does help here. C23 is safer than C11 was (which AFAIK allowed more
unicode in identifiers).
But for a full solution we'd have to do more (N2932, rejected for C2y,
but WG14 wanted it as TS, which so far didn't happen), but AFAIK,
currently libraries that implement everything we'd want for security are
not that widespread (they exist, in particular libu8ident, but I don't
think many distros package them).
Philipp
|