Re: [sdcc-devel] Using libunistring in SDCC?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Am 22.10.25 um 17:00 schrieb "Janko Stamenović" via sdcc-devel:
> 
> End of eighties or early nineties, there was a specification
> requiring of compilers to support trigraphs.
> 
> https://en.wikipedia.org/wiki/Digraphs_and_trigraphs_(programming)
> 
> "Borland supplied a separate program, the trigraph preprocessor
> (TRIGRAPH.EXE), to be used only when trigraph processing is
> desired (the rationale was to maximise speed of compilation)."
> 
> Apprantly trigraphs "were Tnot commonly encountered outside
> compiler test suites" and "trigraph support has been removed from
> C as of C23".
> 
> In that light maybe implementing a stand-alone "unicode
> normalization" preprocessor executable could still be the safest
> way to reduce Unicode library dependencies of other binaries?

I don't know of anyone ever using trigraphs intentionally. That is a 
major difference to non-ASCII identifiers. I don't think non-ASCII 
identifiers will go away the way trigraphs did. Also trigraphs could be 
anywhere in source code, even in string literals.

Since trigraphs are processed early, we just leave that to the 
preprocessor (for the standards that have them), and don't need to andle 
tehm in the compiler.

> Re: Philipp: "In the preprocessor, we just have preprocessor
> tokens, we only know which ones are identifiers once we are in the
> lexer, which is part of the compiler"
> 
> Wouldn't the "TRIGRAPH.EXE"-like "normalize", which would
> indiscriminately process and normalize the whole text before the
> preprocessor and compiler see it, make the whole SDCC fully
> standard compliant even if the wording of the standard changes? If
> somebody would try a test with the identifiers that normalize to
> different material, wouldn't everything "just work"? Wouldn't the
> mismatch would cause the same errors as if I type once `int some`
> and later `sme++`?
> 
> Wouldn't then be possible to develop "normalize" fully
> independently of anything else we do?

We need to normalize identifiers. We probably don't care if we normalize 
comments or not. But IMO, we really shouldn't normalize string literals. 
Users might have chosen a particular encoding in there intentionally, 
and rely on string literals being unchanged (AFAIK, we even support 
having non-UTF8 in there).

Normalization by itself is not enough. We need to check if the input is 
valid, i.e. the XID_Start stuff, etc for identifiers. And if the input 
is normalized. So the external normalization doesn't really help us, 
since we'd still need the checks in the compiler.

I guess maintaining that extra binary (and requiring users to use it) is 
more hassle than the extra dependency on a library that is part of every 
distro anyway would be.

Philipp

Re: [sdcc-devel] Using libunistring in SDCC?

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Re: [sdcc-devel] Using libunistring in SDCC?