Re: [sdcc-devel] Using libunistring in SDCC?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

L.S.,

1) re Unicode "similar characters" (jargon: homoglyphs): please do not try
to deal with this in the compiler as you're begging for a maintenance
nightmare then -- as you (mr. Krause) mentioned in the beginning that you
wished to stay away from adding Unicode-related upkeep to your SDCC
maintenance cost.  [my interpretation of your words]

1.a) Dealing with (and prohibiting) homoglyphs (and language/script
mixtures, b.t.w.) has a very strong "coding standards" smell with all its
inherent bike shedding and religion; While I do understand why folks want
it and write proposals, I do not like that stuff creeping into any
programming language specification, as homoglyphs are a volatile moving
target. This sort of 'higher layer' checks are perfectly suited for solving
by linters and other static code QA tools, such as PVS. And the bike
shedding can be offloaded onto them: @ y'all: your team can pick their
favorite linter/QA tool [chain] and case closed, as far as I'm concerned.

1.b) homoglyphs are a moving target in multiple aspects: f.e.
language/script: lowercase L isn't only quite similar to numeric 1, but
also Turkish dotless i. "Only one script per variable allowed" one then
responds. Well, Turks also use 'l' (lowercase L) so that's tough luck. Do
you really want that hassle inside SDCC? run a linter; if you're worried
about this sort of stuff, you're best served with running one (plus PVS and
similar static code QA tools) anyway.

Next, there's the often neglected "font" aspect: programmers have a strong
preference about the editors they wish to use. And it's "styling templates"
galore for a reason. Those include using your favorite (monospaced?) font
for viewing/editing your source code. Homoglyphs are also highly dependent
on the precise font used: e.g.: dinosaurs like me have been around the
actual typewriters and teletypes where 1 was utterly indiscernible from
lowercase L: l. Argh! IBM didn't use a Courier-simile in some of their
products, where the font was designed to make the 1 and l pretty obviously
different. Ditto for modern age font design: some fonts have shapes that
act as homoglyphs where other fonts -- used for the same purpose -- do not
have that particular homoglyph!
What is a homoglyph for me is not necessarily a homoglyph for you. And vice
versa!

And what if the Unicode committee introduced a few more glyphs to the
current set? Unicode 19.0? 20.0? When you add "warns/errors when your code
contains homoglyphs" to your compiler, you will have a problem when one of
the new glyphs turns out to be usable as a homoglyph.  And: anyone around
well versed in Asian scripts, such as Mandarin or the Japanese ones? If you
want to protect/warn about homoglyphs, how about the confusion embedded in
those scripts? Shouldn't they be treated at the same quality/human
protection level?   Whoa. This calls for a dedicated tool, and a bespoke
one, if you ask me.
Which leads me to the biggest problem re homoglyphs in a
non-unicode-dedicated tool, like a C compiler such as SDCC: the human
element itself:

1.c) homoglyphs are, from my experience and perspective at least, not a
technical or technological issue, but rather a human factor. Think: just
another (unavoidable) flavour of PEBCAK. That's okay, but this doesn't
belong inside a compiler core. (Sure, some of the technology aspects don't
help, but let's take that as a given for now.)
When you scour the internet for reported homoglyph attacks and lists of the
suspects, you'll quickly find lists/tables of Unicode glyphs per character,
e.g. https://github.com/codebox/homoglyph/blob/master/raw_data/chars.txt if
you want a nice, maintained one.
Well, I don't know about y'all, but there's plenty in those lists out there
where I am amazed some of the listed entries actually have been confused,
ever! Am I some sort of *improved human*, then? Most certainly *not*! Have
I been caught by homoglyphs before today? Yes. Multiple times! And it will
happen again. Did it cost me? Yes, quite a bit of time and frustration, on
several occasions. Should my C/C++ compiler warn or error out on this sort
of thing? **absolutely not**!
-->
- every time I had a WTF due to a homoglyph, it was resolved by running a
linter (or hexdumper ('od' et al), which also is a kind of linter, AFAIAC)
and getting that A-HA! moment.  +1 for an external, and thus very flexible
and bespoke toolkits.
- There are (rare!) scenarios where I actively WANT homoglyphs in my C
code: when, f.e., I am writing a homoglyph-detecting library myself, where
the data is embedded in the sourcecode (through whatever preprocessing
means, hand or machine). Should SDCC be forced to become smart enough to
differentiate between homoglyphs in actual code, on the one hand, and
inside C Strings, on the other hand -- where homoglyphs SHOULD be allowed
in the latter case to allow one to write such an (obvious) source
file/library? This is quickly becoming a pain I wouldn't want near me as a
compiler engineer.

Ergo: push all those hassles to the outside, for all compilers (not just
SDCC).  At least offload into a dedicated, public, library if you can't
resist and have to (I have to, for my products, for very
different reasons).  (or better yet: look around and grab one you like from
someone else who enjoys this sort of Unicode + human factor thing.)
This is what CI toolchains are made for, AFAIAC. (At least one of the
reasons to have a CI process for your project.)

TL;DR: homoglyph support: please don't. Ever.

2) Unicode normal form C: NFC. ... before this thread started, I would have
been in favor of having that in the compiler. I waited a few days before
actually writing this email (other priorities and sometimes it's better to
wait a little longer, anyway). My original quick off-hand thought was:
warning about this is easy: you pull everything through the normalizer (at
the very start of the parse, for nobody wants to fall in the trap of
https://github.com/advisories/GHSA-wpmx-564x-h2mh : CVE-2023-52081
<https://github.com/advisories/GHSA-wpmx-564x-h2mh>, b.t.w.), then you
strcmp() the input vs. output and if there's *any* change, you barf a
hairball.

However, myself having worked on homoglyph-filtering in non-C (web)
projects, and taking my argument above ("use external lint/QA tools") for
homoglyphs to its obvious conclusion, then (1.c) can be argued to also
apply to "single normal form not always desirable": there are circumstances
where you would want normal form D instead (e.g. when your code is messing
around with older Apple FS). Sure, sure, there's always the neater \u
escapes for all that, top marks right there, but again that's a coding QA
issue in the eye of the beholder, not a [programming] language thing, for
my money.
Hm, let the benchmark for this be: can SDCC be used to compile IOCCC
entries? I vote YES! ;-))
https://en.wikipedia.org/wiki/International_Obfuscated_C_Code_Contest --
b.t.w. they're called IOCCC, I for International, so my third eye sees some
homoglyphs on the horizon there. JJJoy! [Ren & Stimpy]

Ergo: if I don't like homoglyph detecting & coping logic inside SDCC, I
also MUST not like having enforced input file Normalization (NFC) either.
The compiler MAY wish to barf a hairball (warning or error) if identifiers
are found to be in denormal form, that's fine, but C string content SHOULD
be allowed to be in decomposed or other non-NFC format.

A tough call, deciding which way to go with this. Warning about any
non-NFC-formedness would require SDCC to include some sort of Unicode
normalization code anyway and that then makes me lean towards using the UCI
C library directly as that is the reference implementation for this.

3) There was a mention of frighteningly-large regex expressions in the
lexer sometime before in this thread (it was phrased differently though):
FYI: modern regex engines support \p and \P Unicode class particles, which
act similar to the older [[:isalpha:]]: that way you only need a simple
regex set expression for any legal C identifier. See
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape
for a nice description how these work, and regex101.com can be used to test
such animals, and more. If this is a problem for SDCC, then maybe switching
regex engine might be an option. re2 and libtre (tre) come to mind: both
have worked well for me.
- https://github.com/laurikari/tre
- https://github.com/google/re2
See also: https://javascript.info/regexp-unicode ,
https://github.com/google/re2/wiki/syntax

4) re the unicode C library to use: my minor gripe against GNU libunistring
is that its build system and support is rather Unix-centric
automake/autoconf based (a hassle on MSVC/Windows systems) and its license.
Of course, using the grand old UCI C library would be an option AFAIAC
(despite many complaints abouts its binary size and a few folks getting
upset about its API), or perhaps using 'git subtree' or alike on
libunistring and keeping the source files generated by the 'external tools'
(wget et al) in there alongside.  Alas, from my perspective that's
maintenance-functionally equivalent to a tracking fork, with the
maintenance cost of that. Given that hassle will be SDCC's anyway,
personally I'd choose UCI and maybe include a clean and simple wrapper that
serves my needs.

It's not all that important what is picked in the end, though if you wish
to remain a C-but-no-C++ codebase for SDCC, I don't know of any nice and
easy alternatives, except perhaps utf8proc, which is used by the Julia
folks: https://github.com/JuliaStrings/utf8proc
All the others I've run into, including some wrappers for UCI itself, are
all C++ codebases.

I am a mere consumer in matters SDCC, but I hope this helps all moving
forward in the least (for you) agonizing direction re Unicode compiler
inputs.

While this message may have started out pretty depressing (1), I hope
there's something nice and positive to take away at the end.

Best regards,

Ger Hobbelt

P.S. Note that the latest MSVC2022 C/C++ compilers yak a warning about
illegal bytes (definitely not a legal byte) in UTF8 source code files,
where the warning lists a very helpful (not!!) byte offset into the source
file: if these warnings bother you, there's a little more work involved
than the usual "click-on-warning-and-jump-to-the-offending-spot". Anyway,
that's what MSVC2022 does when you feed it a CP850(?) file where it expects
UTF8. I haven't seen it barf on decomposed Unicode sequences, but then I
haven't actively looked for how the compiler might behave exactly vs.
standards/specifications, while I am near-certain one or two of my library
source files will carry denormalized Unicode codepoint sequences (in UTF8)
-- haven't looked further/deeper than: code looks good, no weird nor
incomprehensible nastiness, compiles, runs, passes tests (including the
nasty ones) == Good To Go(tm). Sorry.

P.P.S. when anyone worries about "loss of information" through the
normalization and/or homoglyph filtering processes, there's the QSN/J8
encoding (unofficial) spec to consider for when you need 100% guaranteed
back&forth zero-loss conversions:
https://oils.pub/release/latest/doc/qsn.html ,
https://oils.pub/release/latest/doc/j8-notation.html as those were designed
for making Unicode-oriented editors deal with arbitrary byte inputs -- I
believe we won't need those as 'original input data' can be produced in the
warnings/errors through other means (tracking original positions vs.
normalized/filtered output that triggered the warning/error in the
lexer/parser), but it is a viable alternative way of accomplishing this
subgoal, particularly if you choose to do input file NFC normalization
anyway. Still, both ways (tracking orig-pos or J8-enc) add way too much
hairiness for a compiler, to my taste.

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------

On Fri, Oct 24, 2025 at 1:03 PM "Janko Stamenović" via sdcc-devel <
sdc...@li...> wrote:

> That's true. As even the normalized form doesn't tell about the
> "likeness appearance", there would have to be some function
> transforming the normalized forms to the "likeness representation"
> which could be then compared reasonably efficiently. Which is also
> something that should be implemented in some specialized
> libraries.
>
> I still hope these kind of source-analysis features would remain
> optional, however.
>
> Regarding SDCC current hashing, it can be the "good enough"
> solution even if the asymptotic behavior would degrade, if the
> degradation never becomes noticeable in practice. And if the fixed
> hash table size could result in observable slowdown in practice,
> that could also be improved.
>
>
>
> --- Ursprüngliche Nachricht ---
> Von: Philipp Klaus Krause <pk...@sp...>
> Datum: 23.10.2025 18:12:39
> An: sdc...@li...
> Betreff: Re: [sdcc-devel] Using libunistring in SDCC?
>
> Am 23.10.25 um 17:54 schrieb "Janko Stamenović" via sdcc-devel:
> > Hopefully such checks are done by external tools and not by a
> > compiler, as I can imagine a combinatorial explosion of testing
> > what is similar to what, especially in projects with huge headers?
>
> I don't think there'd be an explosion. Any identifier encountered needs
>
> to be tested against all previous ones anyway (to check if they are the
> same). Unless hat is done via a hash map, binary tree or such, this is
> already a number of string comparions quadratic in the number of strings
>
> (SDCC AFAIK does currently use a hash map, but into an array of fixed
> size, so we still have asymptotically quadratic effort).
>
> The change would be replacing the strcmp() comparison with a more
> complicated one considering homoglyphs, but in the end, most pairs of
> identifiers would still differ early. So still negligible effort
> compared to other compiler stages.
>
> Philipp
>
>
>
> _______________________________________________
> sdcc-devel mailing list
> sdc...@li...
> https://lists.sourceforge.net/lists/listinfo/sdcc-devel
>
>
>
>
> _______________________________________________
> sdcc-devel mailing list
> sdc...@li...
> https://lists.sourceforge.net/lists/listinfo/sdcc-devel
>

Re: [sdcc-devel] Using libunistring in SDCC?

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Re: [sdcc-devel] Using libunistring in SDCC?