I have a database file that was updated incorrectly via ODBC by a PHP script using single-byte characters from the Windows-1250 character set. So, for example, the character 'Ŕ' (U+0154) was inserted as the single byte 0xC0:

When I try to retrieve the text using Jackcess 2.1.6 with
DatabaseBuilder dbb = new DatabaseBuilder(new File(dbPath));
try (Database db = dbb.open()) {
Table tbl = db.getTable("Table1");
for (Row r : tbl) {
String str = r.getString("TextField");
System.out.println(str);
}
}
I get
Latin Capital Letter R With Acute: À
because 'À' is U+00C0.
I know that I can juggle the bytes to fix the problem, but I also thought that I should be able to simply add
dbb.setCharset(Charset.forName("cp1250"));
before calling DatabaseBuilder#open. However, when I do that I get
java.io.IOException: Did not find required parent table id (Db=specials.mdb)
at com.healthmarketscience.jackcess.impl.DatabaseImpl.readSystemCatalog(DatabaseImpl.java:882)
at com.healthmarketscience.jackcess.impl.DatabaseImpl.<init>(DatabaseImpl.java:533)
at com.healthmarketscience.jackcess.impl.DatabaseImpl.open(DatabaseImpl.java:400)
at com.healthmarketscience.jackcess.DatabaseBuilder.open(DatabaseBuilder.java:252)
at com.example.jackcessdemo.JackcessDemoMain.main(JackcessDemoMain.java:20)
The database file is attached for reference.
The problem is that everything in an access db is in a table. in your database, however, you have mixed character sets. presumably, the system catalog (a table) is using the "correct" character set. when you try to change the charset for the entire database, jackcess can't lookup the initial information in the system catalog. you really need to localize the usage of this charset to the specifically corrupted table data (jackcess doesn't really have support for that today). this is really more of a feature request than a bug.
Well, i guess the Database Charset is mutable. If your intent is just to "fix" the data (and not general database interaction), i think you could selectively manipulate the database charset at certain points in time to recover the data. e.g. first open the db and get the table using the default charset. then, change the database charset to the "wrong" charset while getting rows from the table. if you intend to write correct data back to the table (or do other interactions with the database), you would need to switch back to the default charset before proceeding. i "think" this could work (no promises).
Ahh, sneaky. When I did
immediately before retrieving the string I got
with a null character between each real one, but at least the 'Ŕ' character was correct. Squeezing out the nulls was simple enough with
This is just for reading, so we're good.
Thanks for the quick replies!
Is the example database you attached actually created using the php script and odbc?
Access_ODBC can't create a new database file. It was created using MSACCESS.EXE, uploaded to the WAMP server, and then updated using Access_ODBC under PHP.
Yeah, that's what i meant, how was the incorrect data created.
i'm hesitent to make changes to jackcess to handle this case because the data really is corrupted. the data is written using the "compressed unicode" format. but, the data in question obviously isn't actually unicode.
basically, you have "cp1252" characters which have been written as utf16 chars. i think you should probably just fix the strings once you get them back out of the table.
Can you generate a similar database except where the text column has the "compressed unicode" option disabled? i'm curious to see what the resulting data looks like.
Here it is.
Right, so that's kind of what i expected. The "uncompressed" data comes out exactly the same as the compressed unicode version after "decompression" (as jackcess currently functions). so, jackces is handling the character data consistently. and, in the case of uncompressed text data, changing the charset to cp1250 before reading doesn't give you the "correct" (i.e. original data) results. the patch you provided (https://github.com/jahlborn/jackcess/pull/1) would not give you the results that you want for uncompressed fields.
in other words, the decompression logic in jackcess isn't the problem, the problem is that the original data was written as if the bytes provided were single bytes of unicode characters, when in fact they were single bytes of cp1250 characters. jackcess can't fix this problem for you. if you want to get consistent results for any character field in this situation, you really need to "post process" the data (strip the nulls and re-interpret the bytes as cp1250).
Last edit: James Ahlborn 2017-04-04
Actually, the bytes are interpreted correctly as cp1250, it's just those pesky null characters after each one. Still, I understand that
Thanks for the feedback.
Nothing to do here, as this isn't something that can be "fixed" within jackcess.