#2 UTF-16LE encoding corrupts binary data

open-invalid
nobody
None
5
2007-10-18
2007-10-09
Anonymous
No

Because Java Strings are backed up by a char[] array, conversion to and from byte[] array leads to data corruption whenever a char cannot be identified to a unicode sequence.
The correct way to handle binary encoded strings is to directly access the char[] array.

Working example (JIUtil.java, deserializeData) :
// BSTR Decoding
if ((FLAG & JIFlags.FLAG_REPRESENTATION_STRING_BSTR) == JIFlags.FLAG_REPRESENTATION_STRING_BSTR) {
// Read for user
ndr.readUnsignedLong();// eating max length
ndr.readUnsignedLong();// eating length in bytes
int actuallength = ndr.readUnsignedLong();
char[] buffer = new char[actuallength];
int i = 0;
while (i < actuallength) {
retVal = ndr.readUnsignedShort();
buffer[i] = (char) retVal;
i++;
}
retString = new String(buffer);
}

Discussion

  • Logged In: YES
    user_id=214683
    Originator: NO

    which place in this file are you looking at ...all BSTR strings are converted using UTF-16LE encoding.

    Thanks,
    best regards,
    Vikram

     
    • status: open --> closed-invalid
     
    • status: closed-invalid --> open-invalid
     
  • Logged In: NO

    The code is in JIUtil.java, deserializeData.
    The conversion to UTF-16LE does NOT preserve data.
    The code I've posted is working because it does not alter the data sent.
    I work with an application that sends data compressed with ZLib in BSTR. After conversion to UTF-16LE, the data could not be deflated because they are altered.
    Why should raw data be converted ? Bytes sent on one side should be preserved on the other side.

     
  • Logged In: YES
    user_id=214683
    Originator: NO

    Two things here...what encoding will you use to correctly deserialize data ? It would take up system encoding and that would be completely wrong , since BSTR is encoded using Unicode (UTF-16LE) by windows. and secondly, even though it say's binary String , please don't use it to send Binary data , use a Variant with a byte array , that is never subjected to encoding ...

    best regards,
    Vikram

     
  • Logged In: NO

    I totally agree with you about using BSTR to handle binary data. But DCOM is often added onto existing C implementations or used to simply wrap dll interfaces. I have also seen ActiveX that handle BSTR as char*, expecting a trailing nul character...
    Therefore, I think the best to do is to be neutral regarding the encoding. This means that instead of using String(byte[]) and String.getBytes() which perform data transformation, you simply use String(char[]) and String.toCharArray() which just do a simple data copy (arraycopy).
    Thus, if a UTF16-LE encoded BSTR was sent by windows, it will automatically be wrapped into a UTF-16LE encoded java String. But, if the original encoding is unknown, you do not break the data and the developer remains free to handle the encoding by himself. I think that this way, you conform to the general contract of BSTR, without enforcing data transformation.

     
  • Logged In: YES
    user_id=214683
    Originator: NO

    Yes that is all understood but I don't see how the String will get automatically wrapped into UTF-16LE when the character array is passed to it and further the String is accessed by the developer. It will use the system encoding to interpret and we are back to square one.

    What I can do for you is that change the logic in such a way that you can get direct access to the byte[] ...but this will take sometime, maybe a month or so. Till then please use your modified build.

    Thanks,
    best regards,
    Vikram