Menu

#2 UTF-8 String Incorrectly Read

open
nobody
5
2007-01-18
2007-01-18
Anonymous
No

The following UTF-8 string gets truncated such that the last 'd' is omitted.

Åland

The exact Hessian protocol data can be found in the attached file.

Note that this is merely an example of one UTF-8 string that was not correctly interpreted.

Discussion

  • Nobody/Anonymous

    Hessian Protocol (sniffed packet data)

     
  • vatel

    vatel - 2007-09-28

    Logged In: YES
    user_id=1781039
    Originator: NO

    We too - we have this problem with UTF-8.
    I investigated why it happens: when reading String via ByteArray.readUTF() function, Flex expect that the string length will be in BYTES. But according to Hessian serialization protocol the string length is in CHARACTERS. That's why the problem occurs with non-ascii characters.

    http://livedocs.adobe.com/flex/2/langref/flash/utils/ByteArray.html#readUTF\()
    http://hessian.caucho.com/doc/hessian-serialization.html##string

    Ideas how this can be fixed:

    1) do not use Flex's readUTF and read manually - byte-by-byte using ByteArray.readByte() and some UTF-8 decoder (calculating characters count step-by-step).
    For example, you can "port" Java's UTF-8 decoder to Flex:
    https://openjdk.dev.java.net/source/browse/openjdk/jdk/trunk/j2se/src/share/classes/sun/nio/cs/UTF_8.java?rev=227&view=markup
    Or there is another decoder in GNU libc (see iconv/gconv_simple.c in glibc)

    2) Seems that each String or string part (chunk) is terminated by zero byte (0x00) in Hessian. This could serve as "end of string" mark.
    This way you can calculate the number of bytes and then call ByteArray.readUTFBytes() function in Flex (it accepts "bytes number" parameter).

     

Log in to post a comment.

MongoDB Logo MongoDB