|
From: Philipp K. K. <pk...@sp...> - 2025-10-22 15:24:57
|
Am 22.10.25 um 17:00 schrieb "Janko Stamenović" via sdcc-devel: > > End of eighties or early nineties, there was a specification > requiring of compilers to support trigraphs. > > https://en.wikipedia.org/wiki/Digraphs_and_trigraphs_(programming) > > "Borland supplied a separate program, the trigraph preprocessor > (TRIGRAPH.EXE), to be used only when trigraph processing is > desired (the rationale was to maximise speed of compilation)." > > Apprantly trigraphs "were Tnot commonly encountered outside > compiler test suites" and "trigraph support has been removed from > C as of C23". > > In that light maybe implementing a stand-alone "unicode > normalization" preprocessor executable could still be the safest > way to reduce Unicode library dependencies of other binaries? I don't know of anyone ever using trigraphs intentionally. That is a major difference to non-ASCII identifiers. I don't think non-ASCII identifiers will go away the way trigraphs did. Also trigraphs could be anywhere in source code, even in string literals. Since trigraphs are processed early, we just leave that to the preprocessor (for the standards that have them), and don't need to andle tehm in the compiler. > Re: Philipp: "In the preprocessor, we just have preprocessor > tokens, we only know which ones are identifiers once we are in the > lexer, which is part of the compiler" > > Wouldn't the "TRIGRAPH.EXE"-like "normalize", which would > indiscriminately process and normalize the whole text before the > preprocessor and compiler see it, make the whole SDCC fully > standard compliant even if the wording of the standard changes? If > somebody would try a test with the identifiers that normalize to > different material, wouldn't everything "just work"? Wouldn't the > mismatch would cause the same errors as if I type once `int some` > and later `sme++`? > > Wouldn't then be possible to develop "normalize" fully > independently of anything else we do? We need to normalize identifiers. We probably don't care if we normalize comments or not. But IMO, we really shouldn't normalize string literals. Users might have chosen a particular encoding in there intentionally, and rely on string literals being unchanged (AFAIK, we even support having non-UTF8 in there). Normalization by itself is not enough. We need to check if the input is valid, i.e. the XID_Start stuff, etc for identifiers. And if the input is normalized. So the external normalization doesn't really help us, since we'd still need the checks in the compiler. I guess maintaining that extra binary (and requiring users to use it) is more hassle than the extra dependency on a library that is part of every distro anyway would be. Philipp |