|
From: <apn...@ya...> - 2025-11-11 15:16:32
|
Not true any more though it was true for Tcl 8. Java MUTF-8 encodes non-BMP code points as pair of surrogates encoded into two 3-byte UTF-8 units. Tcl 9 encodes as it as standard 4 byte UTF8 sequence. Nor does it use CESU-8 for external data. Someone should correct Wikipedia 😊 From: Andreas Kupries <aku...@su...> Sent: Tuesday, November 11, 2025 6:36 PM To: Pietro Cerutti <ga...@ga...>; apn...@ya...; Tcl Core List <tcl...@li...> Subject: Re: [TCLCORE] Manpage updates for review Note https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 > Java <https://en.wikipedia.org/wiki/Java_(programming_language)> internally uses UTF-16 for the char data type and, consequentially, > the Character, String, and the StringBuffer classes, <https://en.wikipedia.org/wiki/UTF-8#cite_note-61> [61] but > for I/O uses Modified UTF-8 (MUTF-8), in which the null character <https://en.wikipedia.org/wiki/Null_character> U+0000 > uses the two-byte overlong encoding 0xC0, 0x80, instead of just 0x00. <https://en.wikipedia.org/wiki/UTF-8#cite_note-:2-18> [18] And: > Tcl <https://en.wikipedia.org/wiki/Tcl> also uses the same modified UTF-8 <https://en.wikipedia.org/wiki/UTF-8#cite_note-68> [68] as Java for internal > representation of Unicode data, but uses strict CESU-8 for external data. (https://en.wikipedia.org/wiki/CESU-8) On Tue, Nov 11, 2025 at 2:00 PM Pietro Cerutti via Tcl-Core <tcl...@li... <mailto:tcl...@li...> > wrote: On Nov 11 2025, 05:33 +0000, apnmbx-public--- via Tcl-Core <tcl...@li... <mailto:tcl...@li...> > wrote: [-- Type: text/html; charset=utf-8, Encoding: quoted-printable, Size: 4.9K --] >The branch [1]apn-doc-update contains manpage updates addressing two areas – > > > > ● added a section in Tcl.n that defines Tcl string value as a sequence of > Unicode code points. > ● updates to various command and C API pages that wrongly identify Tcl’s > internal format as UTF-8. For this purpose the encoding name TUTF-8 has > been introduced to reference Tcl’s internal modified UTF-8 format. > > > >Reviews appreciated and improvements welcome. Both have been a pet peeve with >me for a long time (and probably no one else!) in that the first is important >missing information and the second is misinformation. Would it make sense to descibe TUTF-8 in its own dedicated man page and referent to it, instead of duplicating the description across different man pages? -- Pietro Cerutti I have pledged to give 10% of income to effective charities and invite you to join me - https://givingwhatwecan.org _______________________________________________ Tcl-Core mailing list Tcl...@li... <mailto:Tcl...@li...> https://lists.sourceforge.net/lists/listinfo/tcl-core -- Andreas Kupries - SUSE Software Solutions Germany GmbH, Frankenstr. 146, 90461 Nürnberg, Germany, <http://www.suse.com/> www.suse.com, Geschäftsführer: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg) |