#5145 encoding and string bytelength man pages confusing

obsolete: 8.5.13
open
Jan Nijtmans
5
2012-12-18
2012-12-18
Kieran
No

I hope you don't mind me raising a fairly minor documentation
bug ... I couldn't see a category for man pages?

While tracking down a character encoding issue with colleagues,
we encountered confusion about how strings are represented in
Tcl and we felt the following two statements from the man pages
weren't as helpful as they could be ...

From the INTRODUCTION section of the Tcl encoding man page :
( http://www.tcl.tk/man/tcl8.6/TclCmd/encoding.htm )
"Strings in Tcl are encoded using 16-bit Unicode characters"

From the "string bytelength" entry in the COMMAND section
of the Tcl string man page :
( http://www.tcl.tk/man/tcl8.6/TclCmd/string.htm )
string bytelength /string/
"Returns a decimal string giving the number of bytes used to
represent /string/ in memory. Because UTF-8 uses one to three
bytes to represent Unicode characters, the byte length will
not be the same as the character length in general."

These two statements on the face of it seem rather contradictory
(does Tcl use 16-bit Unicode or UTF-8?), unless you happen to
understand how values are really stored internally in Tcl
(in which case you probably don't need the man pages!).

My understanding is that a Tcl value can be stored internally
in one or two of the following ways:

- a non-string representation (e.g. an IEEE float)
- a UTF-8-like encoded string (NUL is not strictly UTF-8)
- a UCS-2 encoded string

But how strings are /encoded/ isn't actually very relevant
to the encoding man page - more important is how they
appear to the Tcl programmer.

So perhaps it would be better if the Tcl encoding man page
rather than saying:
"Strings in Tcl are encoded using 16-bit Unicode characters"
instead said something like:
"Strings in Tcl appear to the programmer as a sequence of
16-bit Unicode code points"
or simply:
"Strings in Tcl consist of 16-bit Unicode characters"

And as for the string bytelength man page, instead of saying:
"[it] returns a decimal string giving the number of bytes used
to represent /string/ in memory",
it would seem to be more accurate to say:
"[it] returns a decimal string giving the number of bytes that
Tcl would use to represent /string/ in memory using Tcl's
internal UTF-8 encoding"

I realise that since Tcl 8.5 the string man page does advise
against the use of string bytelength, but that doesn't help
someone looking at existing code that uses it and trying to
figure out what it's doing!

Thanks.

Discussion