#2328 Imprecise description of binary scan char 'a'

obsolete: 8.4.2
closed-fixed
5
2003-07-11
2003-05-09
No

The manpage states of the scan character 'a', "The data
is a character string of length _count_". However it
does not specify what this length is referring to. Does
it mean number of characters or number of bytes? As it
talks of a character string, this would lead one to
believe that it means number of characters, yet the
implementation apparently is for number of bytes.

If it actually means bytes, this should be clearly
mentioned and instead of 'character string' the term
'byte array', or something similar, might be more
appropriate.

Discussion

    • summary: Imprecise description of scan char 'a' --> Imprecise description of binary scan char 'a'
     
  • Logged In: YES
    user_id=79902

    Bytes are characters. (I believe the conversion to byte
    array truncates...)

     
  • Logged In: NO

    If bytes are characters then this too has to be defined and
    a new name invented for what I would consider characters.
    Normally, with wide characters and Unicode I would not
    consider characters to be the same as bytes. One unicode
    character can use up more than one byte.

     
  • Pat Thoyts
    Pat Thoyts
    2003-05-13

    Logged In: YES
    user_id=202636

    Lets illustrate this:
    set s "\u266b\u266a" ;# two unicode characters.
    string length $s -> 2
    string bytelen $s -> 6 (ok counting nul terminator as well)
    binary scan $s c* r -> 1
    set r -> 107 106 - so just the low byte of each
    character
    binary scan %s a* r -> 1
    set r -> kj - ascii representation of the low byte
    of each char.

    Maybe I'm missing something to do with encodings?

     
    • assigned_to: nijtmans --> dkf
     
  • Logged In: YES
    user_id=79902

    Strictly, the behaviour of [binary scan] (or any other code
    that converts strings to ByteArrayObjs) is only fully defined
    when the input string only contains characters in the range
    \u0000-\u00FF. Strings are not byte arrays, but byte arrays
    can be encoded in strings.

    We do not define what encoding is used with the 'a' [binary
    scan] specifier; perhaps we should (I think we use ISO8859-1
    though [encoding system] would also be reasonable.)

     
  • Logged In: YES
    user_id=137542

    I'm getting confused with the discussion here. Isn't it just
    easiest to document the 'a' specifier as taking a count of
    bytes? Assuming the string just contains a byte array. Why
    does one need to bother about encoding? Take whatever is
    there directly as a byte array. That's at least exactly the
    behaviour I would want ...

    IMO strings are always byte arrays! Just that one character
    might use several bytes.

     
  • Logged In: YES
    user_id=79902

    Documented/tested the current behaviour (in both HEAD and
    8.4 branch.) At least this way we don't need a TIP to
    "improve" things, though it is still an open question
    whether things ought to be the way they are...

     
    • status: open --> closed-fixed