From: SourceForge.net <no...@so...> - 2003-07-06 01:15:28
|
Bugs item #735364, was opened at 2003-05-09 21:05 Message generated for change (Comment added) made by setok You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=735364&group_id=10894 Category: 12. ByteArray Object Group: = 8.4.2 Status: Open Resolution: None Priority: 5 Submitted By: Kristoffer Lawson (setok) Assigned to: Donal K. Fellows (dkf) Summary: Imprecise description of binary scan char 'a' Initial Comment: The manpage states of the scan character 'a', "The data is a character string of length _count_". However it does not specify what this length is referring to. Does it mean number of characters or number of bytes? As it talks of a character string, this would lead one to believe that it means number of characters, yet the implementation apparently is for number of bytes. If it actually means bytes, this should be clearly mentioned and instead of 'character string' the term 'byte array', or something similar, might be more appropriate. ---------------------------------------------------------------------- >Comment By: Kristoffer Lawson (setok) Date: 2003-07-06 04:15 Message: Logged In: YES user_id=137542 I'm getting confused with the discussion here. Isn't it just easiest to document the 'a' specifier as taking a count of bytes? Assuming the string just contains a byte array. Why does one need to bother about encoding? Take whatever is there directly as a byte array. That's at least exactly the behaviour I would want ... IMO strings are always byte arrays! Just that one character might use several bytes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2003-07-04 15:06 Message: Logged In: YES user_id=79902 Strictly, the behaviour of [binary scan] (or any other code that converts strings to ByteArrayObjs) is only fully defined when the input string only contains characters in the range \u0000-\u00FF. Strings are not byte arrays, but byte arrays can be encoded in strings. We do not define what encoding is used with the 'a' [binary scan] specifier; perhaps we should (I think we use ISO8859-1 though [encoding system] would also be reasonable.) ---------------------------------------------------------------------- Comment By: Pat Thoyts (patthoyts) Date: 2003-05-13 12:32 Message: Logged In: YES user_id=202636 Lets illustrate this: set s "\u266b\u266a" ;# two unicode characters. string length $s -> 2 string bytelen $s -> 6 (ok counting nul terminator as well) binary scan $s c* r -> 1 set r -> 107 106 - so just the low byte of each character binary scan %s a* r -> 1 set r -> kj - ascii representation of the low byte of each char. Maybe I'm missing something to do with encodings? ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2003-05-12 19:57 Message: Logged In: NO If bytes are characters then this too has to be defined and a new name invented for what I would consider characters. Normally, with wide characters and Unicode I would not consider characters to be the same as bytes. One unicode character can use up more than one byte. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2003-05-12 11:42 Message: Logged In: YES user_id=79902 Bytes are characters. (I believe the conversion to byte array truncates...) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=735364&group_id=10894 |