Menu

#111 Unicode compression scheme is invalid (at least when using Access 2000 DB)

2.0.7
closed
None
1
2019-03-26
2014-10-27
No

The currently used unicode compression scheme (based on the Unicode standard, in Expand.java) is not right. The current implementation gives wrong results, if the string contains characters ASCII value of which is smaller than 32 (space) or bigger than 127 (or the ASCII range). Libmdb works correctly in our case.

At least Access 2000 and 2003 database files use Microsoft's own compression scheme, namely UTF16-LE, where a byte is not written if it's zero. At least http://office.microsoft.com/en-us/access-help/about-compressing-the-data-in-a-text-memo-or-hyperlink-field-mdb-HP005260416.aspx gives some hint on the Microsoft's implementation. Likewise, libmdb's version, residing in iconv.c looks like this:

/* Uncompress 'Unicode Compressed' string into tmp */
if (!IS_JET3(mdb) && (slen>=2)
 && ((src[0]&0xff)==0xff) && ((src[1]&0xff)==0xfe)) {
    unsigned int compress=1;
    src += 2;
    slen -= 2;
    tmp = (char *)g_malloc(slen*2);
    while (slen) {
        if (*src == 0) {
            compress = (compress) ? 0 : 1;
            src++;
            slen--;
        } else if (compress) {
            tmp[tlen++] = *src++;
            tmp[tlen++] = 0;
            slen--;
        } else if (slen >= 2){
            tmp[tlen++] = *src++;
            tmp[tlen++] = *src++;
            slen-=2;
        }
    }
}

Discussion

  • Markus Sunela

    Markus Sunela - 2014-10-27

    Here is a patch, that fixed our problem. Not the most efficient implementation of the algorithm, but should give the idea.

    ~~~~~~~
    --- ColumnImpl.java
    +++ ColumnImpl.java
    @@ -1110,8 +1110,6 @@
    String decodeTextValue(byte[] data)
    throws IOException
    {
    - try {
    -
    // see if data is compressed. the 0xFF, 0xFE sequence indicates that
    // compression is used (sort of, see algorithm below)
    boolean isCompressed = ((data.length > 1) &&
    @@ -1119,9 +1117,6 @@
    (data[1] == TEXT_COMPRESSION_HEADER[1]));

       if(isCompressed) {
    

    -
    - Expand expander = new Expand();
    -
    // this is a whacky compression combo that switches back and forth
    // between compressed/uncompressed using a 0x00 byte (starting in
    // compressed mode)
    @@ -1129,39 +1124,32 @@
    // start after two bytes indicating compression use
    int dataStart = TEXT_COMPRESSION_HEADER.length;
    int dataEnd = dataStart;
    + int dataPos = dataStart;
    boolean inCompressedMode = true;
    - while(dataEnd < data.length) {
    - if(data[dataEnd] == (byte)0x00) {
    -
    - // handle current segment
    - decodeTextSegment(data, dataStart, dataEnd, inCompressedMode,
    - expander, textBuf);
    + byte[] tmp = new byte[2];
    + ByteBuffer buffer = ByteBuffer.wrap(tmp);
    + while(dataPos < data.length) {
    + if(data[dataPos] == (byte)0x00) {
    + dataPos++;
    inCompressedMode = !inCompressedMode;
    - ++dataEnd;
    - dataStart = dataEnd;
    -
    + } else if (inCompressedMode) {
    + tmp[0] = data[dataPos++];
    + tmp[1] = 0;
    + textBuf.append(getCharset().decode(buffer));
    + buffer.rewind();
    } else {
    - ++dataEnd;
    + tmp[0] = data[dataPos++];
    + tmp[1] = data[dataPos++];
    + textBuf.append(getCharset().decode(buffer));
    + buffer.rewind();
    }
    }
    - // handle last segment
    - decodeTextSegment(data, dataStart, dataEnd, inCompressedMode,
    - expander, textBuf);
    -
    return textBuf.toString();

       }
    
       return decodeUncompressedText(data, getCharset());
    
    • } catch (IllegalInputException e) {
    • throw (IOException)
    • new IOException("Can't expand text column").initCause(e);
    • } catch (EndOfInputException e) {
    • throw (IOException)
    • new IOException("Can't expand text column").initCause(e);
      }
    • }

    /**
    * Decodes a segnment of a text value into the given buffer according to the
    ~~~~~~

     
  • James Ahlborn

    James Ahlborn - 2014-10-27

    Can you please attach an example database which has problem data?

     
  • Markus Sunela

    Markus Sunela - 2014-10-29

    I attached small database containing latin-1 strings, control characters, ancient Greek, Russian and Japanese text. The other attachment contains screenshot from Access showing the actual data.

    The test code used was

    Database db = DatabaseBuilder.open(new File(param[0]));
    Table table = db.getTable("Table");
    FileOutputStream fos = new FileOutputStream("output.txt");
    BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fos, "UTF-8"));
    for (Row row : table)
        writer.append(row.getString("Unicode") + "\n");
    db.close();
    writer.close();
    

    Using Jackcess 2.0.6 the output is:

    only ascii chars
    ääkkösiä
    Мир
    ϰαὶ τότ' ἐγὼ Κύϰλωπα
    漢字仮名交じり文
    3L92ßAB(᪥‡!V
    ümlaut

    Using the patched version (which does not use the standard unicode compression scheme, but the one that just skips the first zero bytes) the output becomes:

    only ascii chars
    ääkkösiä
    Мир
    ϰαὶ τότ' ἐγὼ Κύϰλωπα
    漢字仮名交じり文
    3L952_AB(¥!!V
    ümlaut

    Basically everything works as it should, except the second to last string, which contains some control characters in full (3L9<gs>52<stx>_AB(¥<enq>!!V).</enq></stx></gs>

     

    Last edit: Markus Sunela 2014-10-29
  • James Ahlborn

    James Ahlborn - 2014-11-15
    • status: open --> accepted
    • assigned_to: James Ahlborn
    • Group: Unassigned --> 2.0.7
     
  • James Ahlborn

    James Ahlborn - 2014-11-15

    Sorry that it took me a while to fix this. The changes were pretty simple, but i needed to do some work to verify that they were correct (and just didn't have the time). Fixed in trunk, will be in the 2.0.7 release.

     

    Last edit: James Ahlborn 2019-03-19
  • James Ahlborn

    James Ahlborn - 2014-11-15
    • status: accepted --> closed
     
  • Gord Thompson

    Gord Thompson - 2014-11-17

    I just did a test with "jackcess-2.0.7-SNAPSHOT.jar" (SVN revision 893) because I wanted to check the Unicode compression behaviour when a Text field contained a mix of compressible and non-compressible characters. I created a new Access 2010 database with a table named "Table1":

    ID - AutoNumber, Primary Key  
    Source - Text(10), Unicode_compression=Yes  
    TextField - Text(255), Unicode_compression=Yes
    

    TableDesign.png

    I created a new row in Access Datasheet view with

    Source: Access
    TestField: Greek: γιορτή means "feast"

    I closed Access and ran the following using my snapshot build of 2.0.7

    Table table = db.getTable("Table1");
    table.addRow(Column.AUTO_NUMBER, "Jackcess", "Greek: γιορτή means \"feast\"");
    

    When I open the database in Access the new row looks the same,

    Datasheet.png

    but when I open it in HexEdit the "TextField" for the new row is not compressed (when the one created by Access is).

    HexEdit.png

    I also notice that the new TextField value does not begin with the FF FE prefix.

     
    • James Ahlborn

      James Ahlborn - 2014-11-17

      awesome, i'll take all the extra testing on this i can get!

      what you are seeing, however, is that jackcess only does "simple" compression. the unicode compression scheme allows for a mix of compressed an uncompressed. jackcess can read the complex format, but will only write the simple format (simple in this case is "only compress if all characters in the string are compressible"). your test string contains a mix of compressible and non-compressible characters, so jackcess is not compressing it. in fact, this behavior is not changed, you will see the same behavior if using an older version of jackcess (or at least, you should!).

       
      • Solomon Rutzky

        Solomon Rutzky - 2019-03-18

        Hi there. I ran across while doing research for a post of mine about an undocumented function in SQL Server that might have been related to this, What Does the Undocumented UNCOMPRESS Function Do? (but it's not).

        There are a couple of things I wanted to mention:

        1. @makusuko : As mentioned in the initial bug report at the top:

          Access 2000 and 2003 database files use Microsoft's own compression scheme, namely UTF16-LE

          UTF-16 LE has absolutely nothing to do with either Microsoft or compression. UTF-16 LE merely specifies the Little Endian byte ordering variation of the UTF-16 encoding scheme for Unicode data, as prescribed by the Unicode Consortium.

          The 0xFFFE byte sequence is the Byte Order Mark (BOM) that denotes UTF-16 LE. It appears that for Microsoft Access (2000 and newer), Microsoft decided to use the BOM itself as an indicator for compression being used, however:

          1. this is likely the only place you will ever see a connection between 0xFFFE and compression, and
          2. whether the BOM is present or not, the data itself is UTF-16 Little Endian. In fact, when this BOM is not present, then the data is fully UTF-16 LE. But when the BOM is present, then it is only partially UTF-16 LE because the "compressed" portion isn't part of the UTF-16 LE specification. So, compression or not, the code points that are above U+00FF (i.e. those that cannot be compressed) are always stored as UTF-16 LE.
        2. @jahlborn : As noted in the comment directly above:

          jackcess only does "simple" compression ... (simple in this case is "only compress if all characters in the string are compressible").

          Not sure of the reason why Jackcess only compresses if all characters are compressible (i.e. code points U+0001 through U+00FF), but if it has anything at all to do with the fact that some code points will prevent any Unicode compression (in MS Access), and not knowing which code points prevent compression of compressible code points, I think I can help with that.

          Based on the algorithm to uncompress using 0x00 as a toggle between compressed and non-compressed values, I was curious what would happen if code points that a 0x00 byte in them were used. The code points that are compressed are only those that start with 0x00 (i.e. 0x0001 - 0x00FF). But when those are "compressed", you can still have code points ending in 0x00, such as 0x0100, 0x0200, and so on. I tried a string with code point U+5000, which encodes as 0x0050 in UTF-16 LE. It also contained a mix of code points below and above the U+00FF boundary. That string is: Bob***ŴĤŶ倀健伷倀Œ*** . I found that this string does not compress at all, yet removing the two instances of U+5000 then allows it to compress.

          It appears that the "can we compress?" algorithm checks for two-byte sequences that are multiples of 256, and if any are found, then it won't compress. So, you can update your isUnicodeCompressible method to test for:

          if((c < MIN_COMPRESS_CHAR) || (c > MAX_COMPRESS_CHAR)
                    || (c % 256.0 == 0.0)) {
          

          or whatever the appropriate Java syntax would be. Then, for the compression algorithm, compress chars <= MAX_COMPRESS_CHAR , and inject 0x00 at the transitions between <= MAX_COMPRESS_CHAR and > MAX_COMPRESS_CHAR .

        Take care,
        Solomon...

        P.S. For more info on working with collations / encodings in general, please visit: Collations Info

         
        • James Ahlborn

          James Ahlborn - 2019-03-19

          First, i just wanted to say thanks for you interest and willingness to share your knowledge. I think that at this point, however, you've mostly concluded the same things that jackcess implements.

          1. i'm aware that UTF16-LE is a character encoding, not a compression scheme

          2. i'm not sure that i entirely followed what you were saying, but essentially, a sequence of 2 byte characters is "compressible" (using access unicode compression) if they all have the first byte as 0 (excluding the NUL character, i.e. 0x0000). this is what the "isUnicodeCompressible()" method already looks for, so i'm not entirely sure what your proposal was trying to add to this logic (the additional mod check)?

          As for why jackcess only compresses if all characters are compressible simply comes down to the complexity of the algorithm. mixing compressed and uncompressed character sequences means that you don't know how long the resulting sequence will be, and you need to make decisions on the fly as to whether or not you want to compress a given sequence of characters. i opted for a "simple" version which just does all or nothing.

           
          • Solomon Rutzky

            Solomon Rutzky - 2019-03-26

            Hi James.

            • Regarding #1 (definition of UTF-16LE): Sounds good. I was mainly mentioning it because there was no other clarification of the term on this page, leaving the only "definition" being what the O.P. mentioned it was, and I wanted to make sure that readers who did not know otherwise did not leave this page with a false definition.
            • Regarding #2 (the "isUnicodeCompressible()" method): Yes, I understand the current operation being that it checks for all code points outside of U+0000 that start with a\x00 byte. What my proposed modification (i.e. incorporating the mod 256) does is check for code points that end with a \x00 byte. The reason I was mentioning this minor change is that it is what would need to be done if you were wanting to do more than the simple compression, because it's what it appears that MS Access is doing: compress if no code points in the sting end with \x00 (this condition accounts for U+0000).

             

            if all characters are compressible simply comes down to the complexity of the algorithm.

            Yes, the complexity of the algorithm does increase, but only slightly.

             

            mixing compressed and uncompressed character sequences means that you don't know how long the resulting sequence will be

            I could be wrong, but I don't believe that this statement is correct. If you are already walking the entire string, char by char, to determine if it's 100% < \xFF, then you merely need to keep track of a byte counter and an extra bool variable to denote the compression status, and depending on the status, either add 1 or 2 per character to that counter. Additionally, add 1 every time there's a transition to account for the \x00 byte. Finally, add 2 to the total to account for the Byte Order Mark that indicates that compression is being used (you are already doing this part when you find strings that are 100% < \xFF).

            To be fair, whatever you decide does not affect me either way since I don't use MS Access, or even Java. I was just trying to explain what MS Access was doing in case you were interested in implementing the non-simple compression, given that you seem to be 90% or more the way there already. That's all :-).

            Take care, Solomon..

             

Log in to post a comment.