The currently used unicode compression scheme (based on the Unicode standard, in Expand.java) is not right. The current implementation gives wrong results, if the string contains characters ASCII value of which is smaller than 32 (space) or bigger than 127 (or the ASCII range). Libmdb works correctly in our case.
At least Access 2000 and 2003 database files use Microsoft's own compression scheme, namely UTF16-LE, where a byte is not written if it's zero. At least http://office.microsoft.com/en-us/access-help/about-compressing-the-data-in-a-text-memo-or-hyperlink-field-mdb-HP005260416.aspx gives some hint on the Microsoft's implementation. Likewise, libmdb's version, residing in iconv.c looks like this:
/* Uncompress 'Unicode Compressed' string into tmp */
if (!IS_JET3(mdb) && (slen>=2)
&& ((src[0]&0xff)==0xff) && ((src[1]&0xff)==0xfe)) {
unsigned int compress=1;
src += 2;
slen -= 2;
tmp = (char *)g_malloc(slen*2);
while (slen) {
if (*src == 0) {
compress = (compress) ? 0 : 1;
src++;
slen--;
} else if (compress) {
tmp[tlen++] = *src++;
tmp[tlen++] = 0;
slen--;
} else if (slen >= 2){
tmp[tlen++] = *src++;
tmp[tlen++] = *src++;
slen-=2;
}
}
}
Here is a patch, that fixed our problem. Not the most efficient implementation of the algorithm, but should give the idea.
~~~~~~~
--- ColumnImpl.java
+++ ColumnImpl.java
@@ -1110,8 +1110,6 @@
String decodeTextValue(byte[] data)
throws IOException
{
- try {
-
// see if data is compressed. the 0xFF, 0xFE sequence indicates that
// compression is used (sort of, see algorithm below)
boolean isCompressed = ((data.length > 1) &&
@@ -1119,9 +1117,6 @@
(data[1] == TEXT_COMPRESSION_HEADER[1]));
-
- Expand expander = new Expand();
-
// this is a whacky compression combo that switches back and forth
// between compressed/uncompressed using a 0x00 byte (starting in
// compressed mode)
@@ -1129,39 +1124,32 @@
// start after two bytes indicating compression use
int dataStart = TEXT_COMPRESSION_HEADER.length;
int dataEnd = dataStart;
+ int dataPos = dataStart;
boolean inCompressedMode = true;
- while(dataEnd < data.length) {
- if(data[dataEnd] == (byte)0x00) {
-
- // handle current segment
- decodeTextSegment(data, dataStart, dataEnd, inCompressedMode,
- expander, textBuf);
+ byte[] tmp = new byte[2];
+ ByteBuffer buffer = ByteBuffer.wrap(tmp);
+ while(dataPos < data.length) {
+ if(data[dataPos] == (byte)0x00) {
+ dataPos++;
inCompressedMode = !inCompressedMode;
- ++dataEnd;
- dataStart = dataEnd;
-
+ } else if (inCompressedMode) {
+ tmp[0] = data[dataPos++];
+ tmp[1] = 0;
+ textBuf.append(getCharset().decode(buffer));
+ buffer.rewind();
} else {
- ++dataEnd;
+ tmp[0] = data[dataPos++];
+ tmp[1] = data[dataPos++];
+ textBuf.append(getCharset().decode(buffer));
+ buffer.rewind();
}
}
- // handle last segment
- decodeTextSegment(data, dataStart, dataEnd, inCompressedMode,
- expander, textBuf);
-
return textBuf.toString();
}
/**
* Decodes a segnment of a text value into the given buffer according to the
~~~~~~
Can you please attach an example database which has problem data?
I attached small database containing latin-1 strings, control characters, ancient Greek, Russian and Japanese text. The other attachment contains screenshot from Access showing the actual data.
The test code used was
Using Jackcess 2.0.6 the output is:
Using the patched version (which does not use the standard unicode compression scheme, but the one that just skips the first zero bytes) the output becomes:
Basically everything works as it should, except the second to last string, which contains some control characters in full (3L9<gs>52<stx>_AB(¥<enq>!!V).</enq></stx></gs>
Last edit: Markus Sunela 2014-10-29
Sorry that it took me a while to fix this. The changes were pretty simple, but i needed to do some work to verify that they were correct (and just didn't have the time). Fixed in trunk, will be in the 2.0.7 release.
Last edit: James Ahlborn 2019-03-19
I just did a test with "jackcess-2.0.7-SNAPSHOT.jar" (SVN revision 893) because I wanted to check the Unicode compression behaviour when a Text field contained a mix of compressible and non-compressible characters. I created a new Access 2010 database with a table named "Table1":
I created a new row in Access Datasheet view with
Source: Access
TestField: Greek: γιορτή means "feast"
I closed Access and ran the following using my snapshot build of 2.0.7
When I open the database in Access the new row looks the same,
but when I open it in HexEdit the "TextField" for the new row is not compressed (when the one created by Access is).
I also notice that the new TextField value does not begin with the FF FE prefix.
awesome, i'll take all the extra testing on this i can get!
what you are seeing, however, is that jackcess only does "simple" compression. the unicode compression scheme allows for a mix of compressed an uncompressed. jackcess can read the complex format, but will only write the simple format (simple in this case is "only compress if all characters in the string are compressible"). your test string contains a mix of compressible and non-compressible characters, so jackcess is not compressing it. in fact, this behavior is not changed, you will see the same behavior if using an older version of jackcess (or at least, you should!).
Hi there. I ran across while doing research for a post of mine about an undocumented function in SQL Server that might have been related to this, What Does the Undocumented UNCOMPRESS Function Do? (but it's not).
There are a couple of things I wanted to mention:
@makusuko : As mentioned in the initial bug report at the top:
UTF-16 LE has absolutely nothing to do with either Microsoft or compression. UTF-16 LE merely specifies the Little Endian byte ordering variation of the UTF-16 encoding scheme for Unicode data, as prescribed by the Unicode Consortium.
The
0xFFFEbyte sequence is the Byte Order Mark (BOM) that denotes UTF-16 LE. It appears that for Microsoft Access (2000 and newer), Microsoft decided to use the BOM itself as an indicator for compression being used, however:0xFFFEand compression, and@jahlborn : As noted in the comment directly above:
Not sure of the reason why Jackcess only compresses if all characters are compressible (i.e. code points U+0001 through U+00FF), but if it has anything at all to do with the fact that some code points will prevent any Unicode compression (in MS Access), and not knowing which code points prevent compression of compressible code points, I think I can help with that.
Based on the algorithm to uncompress using
0x00as a toggle between compressed and non-compressed values, I was curious what would happen if code points that a0x00byte in them were used. The code points that are compressed are only those that start with0x00(i.e.0x0001-0x00FF). But when those are "compressed", you can still have code points ending in0x00, such as0x0100,0x0200, and so on. I tried a string with code point U+5000, which encodes as0x0050in UTF-16 LE. It also contained a mix of code points below and above the U+00FF boundary. That string is:Bob***ŴĤŶ倀健伷倀Œ***. I found that this string does not compress at all, yet removing the two instances of U+5000 then allows it to compress.It appears that the "can we compress?" algorithm checks for two-byte sequences that are multiples of 256, and if any are found, then it won't compress. So, you can update your isUnicodeCompressible method to test for:
or whatever the appropriate Java syntax would be. Then, for the compression algorithm, compress chars <=
MAX_COMPRESS_CHAR, and inject0x00at the transitions between <=MAX_COMPRESS_CHARand >MAX_COMPRESS_CHAR.Take care,
Solomon...
P.S. For more info on working with collations / encodings in general, please visit: Collations Info
First, i just wanted to say thanks for you interest and willingness to share your knowledge. I think that at this point, however, you've mostly concluded the same things that jackcess implements.
i'm aware that UTF16-LE is a character encoding, not a compression scheme
i'm not sure that i entirely followed what you were saying, but essentially, a sequence of 2 byte characters is "compressible" (using access unicode compression) if they all have the first byte as 0 (excluding the NUL character, i.e. 0x0000). this is what the "isUnicodeCompressible()" method already looks for, so i'm not entirely sure what your proposal was trying to add to this logic (the additional mod check)?
As for why jackcess only compresses if all characters are compressible simply comes down to the complexity of the algorithm. mixing compressed and uncompressed character sequences means that you don't know how long the resulting sequence will be, and you need to make decisions on the fly as to whether or not you want to compress a given sequence of characters. i opted for a "simple" version which just does all or nothing.
Hi James.
\x00byte. What my proposed modification (i.e. incorporating the mod 256) does is check for code points that end with a\x00byte. The reason I was mentioning this minor change is that it is what would need to be done if you were wanting to do more than the simple compression, because it's what it appears that MS Access is doing: compress if no code points in the sting end with\x00(this condition accounts for U+0000).Yes, the complexity of the algorithm does increase, but only slightly.
I could be wrong, but I don't believe that this statement is correct. If you are already walking the entire string, char by char, to determine if it's 100% <
\xFF, then you merely need to keep track of a byte counter and an extraboolvariable to denote the compression status, and depending on the status, either add 1 or 2 per character to that counter. Additionally, add 1 every time there's a transition to account for the\x00byte. Finally, add 2 to the total to account for the Byte Order Mark that indicates that compression is being used (you are already doing this part when you find strings that are 100% <\xFF).To be fair, whatever you decide does not affect me either way since I don't use MS Access, or even Java. I was just trying to explain what MS Access was doing in case you were interested in implementing the non-simple compression, given that you seem to be 90% or more the way there already. That's all :-).
Take care, Solomon..