|
From: Philipp K. K. <pk...@sp...> - 2025-10-21 20:03:19
|
IMO, dealing with Unicode is quite complicated, and not something I want to go too deeply into. After all, we are building a compiler, not some Unicode tool. But to build a compiler, these days requires some Unicode functionality. In particular well-formedness checks (we have them), normalization (we don't have that), checking of properties (we don't have that, except for the trivial stuff). In particular, an identifier in C23 is something that starts with a character with the XID_Start property or '_' (or maybe '$'), followed by any number of characters with the XID_Continue property (or maybe '$'). Two identifiers are equivalent (ignoring the details about significant characters) if their identifiers are equal in Unicode normalization form C (which is defined as Unicode decomposition followed by Unicode composition). The details for all this keep changing with Unicode standard updates. I don't want to implement or maintain those utilities. So I suggest we use an existing library. Due to its wide availability (it is not just part of typical GNU/Linux distributions, but also available as msys2 package for mingw, packaged for OpenBSD, FreeBSD, etc), I suggest using GNU libunistring. Philipp |