|
From: Ger H. <ge...@ho...> - 2025-10-24 15:07:49
|
L.S., 1) re Unicode "similar characters" (jargon: homoglyphs): please do not try to deal with this in the compiler as you're begging for a maintenance nightmare then -- as you (mr. Krause) mentioned in the beginning that you wished to stay away from adding Unicode-related upkeep to your SDCC maintenance cost. [my interpretation of your words] 1.a) Dealing with (and prohibiting) homoglyphs (and language/script mixtures, b.t.w.) has a very strong "coding standards" smell with all its inherent bike shedding and religion; While I do understand why folks want it and write proposals, I do not like that stuff creeping into any programming language specification, as homoglyphs are a volatile moving target. This sort of 'higher layer' checks are perfectly suited for solving by linters and other static code QA tools, such as PVS. And the bike shedding can be offloaded onto them: @ y'all: your team can pick their favorite linter/QA tool [chain] and case closed, as far as I'm concerned. 1.b) homoglyphs are a moving target in multiple aspects: f.e. language/script: lowercase L isn't only quite similar to numeric 1, but also Turkish dotless i. "Only one script per variable allowed" one then responds. Well, Turks also use 'l' (lowercase L) so that's tough luck. Do you really want that hassle inside SDCC? run a linter; if you're worried about this sort of stuff, you're best served with running one (plus PVS and similar static code QA tools) anyway. Next, there's the often neglected "font" aspect: programmers have a strong preference about the editors they wish to use. And it's "styling templates" galore for a reason. Those include using your favorite (monospaced?) font for viewing/editing your source code. Homoglyphs are also highly dependent on the precise font used: e.g.: dinosaurs like me have been around the actual typewriters and teletypes where 1 was utterly indiscernible from lowercase L: l. Argh! IBM didn't use a Courier-simile in some of their products, where the font was designed to make the 1 and l pretty obviously different. Ditto for modern age font design: some fonts have shapes that act as homoglyphs where other fonts -- used for the same purpose -- do not have that particular homoglyph! What is a homoglyph for me is not necessarily a homoglyph for you. And vice versa! And what if the Unicode committee introduced a few more glyphs to the current set? Unicode 19.0? 20.0? When you add "warns/errors when your code contains homoglyphs" to your compiler, you will have a problem when one of the new glyphs turns out to be usable as a homoglyph. And: anyone around well versed in Asian scripts, such as Mandarin or the Japanese ones? If you want to protect/warn about homoglyphs, how about the confusion embedded in those scripts? Shouldn't they be treated at the same quality/human protection level? Whoa. This calls for a dedicated tool, and a bespoke one, if you ask me. Which leads me to the biggest problem re homoglyphs in a non-unicode-dedicated tool, like a C compiler such as SDCC: the human element itself: 1.c) homoglyphs are, from my experience and perspective at least, not a technical or technological issue, but rather a human factor. Think: just another (unavoidable) flavour of PEBCAK. That's okay, but this doesn't belong inside a compiler core. (Sure, some of the technology aspects don't help, but let's take that as a given for now.) When you scour the internet for reported homoglyph attacks and lists of the suspects, you'll quickly find lists/tables of Unicode glyphs per character, e.g. https://github.com/codebox/homoglyph/blob/master/raw_data/chars.txt if you want a nice, maintained one. Well, I don't know about y'all, but there's plenty in those lists out there where I am amazed some of the listed entries actually have been confused, ever! Am I some sort of *improved human*, then? Most certainly *not*! Have I been caught by homoglyphs before today? Yes. Multiple times! And it will happen again. Did it cost me? Yes, quite a bit of time and frustration, on several occasions. Should my C/C++ compiler warn or error out on this sort of thing? **absolutely not**! --> - every time I had a WTF due to a homoglyph, it was resolved by running a linter (or hexdumper ('od' et al), which also is a kind of linter, AFAIAC) and getting that A-HA! moment. +1 for an external, and thus very flexible and bespoke toolkits. - There are (rare!) scenarios where I actively WANT homoglyphs in my C code: when, f.e., I am writing a homoglyph-detecting library myself, where the data is embedded in the sourcecode (through whatever preprocessing means, hand or machine). Should SDCC be forced to become smart enough to differentiate between homoglyphs in actual code, on the one hand, and inside C Strings, on the other hand -- where homoglyphs SHOULD be allowed in the latter case to allow one to write such an (obvious) source file/library? This is quickly becoming a pain I wouldn't want near me as a compiler engineer. Ergo: push all those hassles to the outside, for all compilers (not just SDCC). At least offload into a dedicated, public, library if you can't resist and have to (I have to, for my products, for very different reasons). (or better yet: look around and grab one you like from someone else who enjoys this sort of Unicode + human factor thing.) This is what CI toolchains are made for, AFAIAC. (At least one of the reasons to have a CI process for your project.) TL;DR: homoglyph support: please don't. Ever. 2) Unicode normal form C: NFC. ... before this thread started, I would have been in favor of having that in the compiler. I waited a few days before actually writing this email (other priorities and sometimes it's better to wait a little longer, anyway). My original quick off-hand thought was: warning about this is easy: you pull everything through the normalizer (at the very start of the parse, for nobody wants to fall in the trap of https://github.com/advisories/GHSA-wpmx-564x-h2mh : CVE-2023-52081 <https://github.com/advisories/GHSA-wpmx-564x-h2mh>, b.t.w.), then you strcmp() the input vs. output and if there's *any* change, you barf a hairball. However, myself having worked on homoglyph-filtering in non-C (web) projects, and taking my argument above ("use external lint/QA tools") for homoglyphs to its obvious conclusion, then (1.c) can be argued to also apply to "single normal form not always desirable": there are circumstances where you would want normal form D instead (e.g. when your code is messing around with older Apple FS). Sure, sure, there's always the neater \u escapes for all that, top marks right there, but again that's a coding QA issue in the eye of the beholder, not a [programming] language thing, for my money. Hm, let the benchmark for this be: can SDCC be used to compile IOCCC entries? I vote YES! ;-)) https://en.wikipedia.org/wiki/International_Obfuscated_C_Code_Contest -- b.t.w. they're called IOCCC, I for International, so my third eye sees some homoglyphs on the horizon there. JJJoy! [Ren & Stimpy] Ergo: if I don't like homoglyph detecting & coping logic inside SDCC, I also MUST not like having enforced input file Normalization (NFC) either. The compiler MAY wish to barf a hairball (warning or error) if identifiers are found to be in denormal form, that's fine, but C string content SHOULD be allowed to be in decomposed or other non-NFC format. A tough call, deciding which way to go with this. Warning about any non-NFC-formedness would require SDCC to include some sort of Unicode normalization code anyway and that then makes me lean towards using the UCI C library directly as that is the reference implementation for this. 3) There was a mention of frighteningly-large regex expressions in the lexer sometime before in this thread (it was phrased differently though): FYI: modern regex engines support \p and \P Unicode class particles, which act similar to the older [[:isalpha:]]: that way you only need a simple regex set expression for any legal C identifier. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape for a nice description how these work, and regex101.com can be used to test such animals, and more. If this is a problem for SDCC, then maybe switching regex engine might be an option. re2 and libtre (tre) come to mind: both have worked well for me. - https://github.com/laurikari/tre - https://github.com/google/re2 See also: https://javascript.info/regexp-unicode , https://github.com/google/re2/wiki/syntax 4) re the unicode C library to use: my minor gripe against GNU libunistring is that its build system and support is rather Unix-centric automake/autoconf based (a hassle on MSVC/Windows systems) and its license. Of course, using the grand old UCI C library would be an option AFAIAC (despite many complaints abouts its binary size and a few folks getting upset about its API), or perhaps using 'git subtree' or alike on libunistring and keeping the source files generated by the 'external tools' (wget et al) in there alongside. Alas, from my perspective that's maintenance-functionally equivalent to a tracking fork, with the maintenance cost of that. Given that hassle will be SDCC's anyway, personally I'd choose UCI and maybe include a clean and simple wrapper that serves my needs. It's not all that important what is picked in the end, though if you wish to remain a C-but-no-C++ codebase for SDCC, I don't know of any nice and easy alternatives, except perhaps utf8proc, which is used by the Julia folks: https://github.com/JuliaStrings/utf8proc All the others I've run into, including some wrappers for UCI itself, are all C++ codebases. I am a mere consumer in matters SDCC, but I hope this helps all moving forward in the least (for you) agonizing direction re Unicode compiler inputs. While this message may have started out pretty depressing (1), I hope there's something nice and positive to take away at the end. Best regards, Ger Hobbelt P.S. Note that the latest MSVC2022 C/C++ compilers yak a warning about illegal bytes (definitely not a legal byte) in UTF8 source code files, where the warning lists a very helpful (not!!) byte offset into the source file: if these warnings bother you, there's a little more work involved than the usual "click-on-warning-and-jump-to-the-offending-spot". Anyway, that's what MSVC2022 does when you feed it a CP850(?) file where it expects UTF8. I haven't seen it barf on decomposed Unicode sequences, but then I haven't actively looked for how the compiler might behave exactly vs. standards/specifications, while I am near-certain one or two of my library source files will carry denormalized Unicode codepoint sequences (in UTF8) -- haven't looked further/deeper than: code looks good, no weird nor incomprehensible nastiness, compiles, runs, passes tests (including the nasty ones) == Good To Go(tm). Sorry. P.P.S. when anyone worries about "loss of information" through the normalization and/or homoglyph filtering processes, there's the QSN/J8 encoding (unofficial) spec to consider for when you need 100% guaranteed back&forth zero-loss conversions: https://oils.pub/release/latest/doc/qsn.html , https://oils.pub/release/latest/doc/j8-notation.html as those were designed for making Unicode-oriented editors deal with arbitrary byte inputs -- I believe we won't need those as 'original input data' can be produced in the warnings/errors through other means (tracking original positions vs. normalized/filtered output that triggered the warning/error in the lexer/parser), but it is a viable alternative way of accomplishing this subgoal, particularly if you choose to do input file NFC normalization anyway. Still, both ways (tracking orig-pos or J8-enc) add way too much hairiness for a compiler, to my taste. -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 -------------------------------------------------- On Fri, Oct 24, 2025 at 1:03 PM "Janko Stamenović" via sdcc-devel < sdc...@li...> wrote: > That's true. As even the normalized form doesn't tell about the > "likeness appearance", there would have to be some function > transforming the normalized forms to the "likeness representation" > which could be then compared reasonably efficiently. Which is also > something that should be implemented in some specialized > libraries. > > I still hope these kind of source-analysis features would remain > optional, however. > > Regarding SDCC current hashing, it can be the "good enough" > solution even if the asymptotic behavior would degrade, if the > degradation never becomes noticeable in practice. And if the fixed > hash table size could result in observable slowdown in practice, > that could also be improved. > > > > --- Ursprüngliche Nachricht --- > Von: Philipp Klaus Krause <pk...@sp...> > Datum: 23.10.2025 18:12:39 > An: sdc...@li... > Betreff: Re: [sdcc-devel] Using libunistring in SDCC? > > Am 23.10.25 um 17:54 schrieb "Janko Stamenović" via sdcc-devel: > > Hopefully such checks are done by external tools and not by a > > compiler, as I can imagine a combinatorial explosion of testing > > what is similar to what, especially in projects with huge headers? > > I don't think there'd be an explosion. Any identifier encountered needs > > to be tested against all previous ones anyway (to check if they are the > same). Unless hat is done via a hash map, binary tree or such, this is > already a number of string comparions quadratic in the number of strings > > (SDCC AFAIK does currently use a hash map, but into an array of fixed > size, so we still have asymptotically quadratic effort). > > The change would be replacing the strcmp() comparison with a more > complicated one considering homoglyphs, but in the end, most pairs of > identifiers would still differ early. So still negligible effort > compared to other compiler stages. > > Philipp > > > > _______________________________________________ > sdcc-devel mailing list > sdc...@li... > https://lists.sourceforge.net/lists/listinfo/sdcc-devel > > > > > _______________________________________________ > sdcc-devel mailing list > sdc...@li... > https://lists.sourceforge.net/lists/listinfo/sdcc-devel > |