"Did not find required parent table id" with DatabaseBuilder#setCharset

Brought to you by: aganim, jahlborn, javajedi, mdelaurentis

#139 "Did not find required parent table id" with DatabaseBuilder#setCharset

Milestone: Unassigned

Status: wont-fix

Owner: nobody

Labels: None

Priority: 1

Updated: 2017-04-05

Created: 2017-04-01

Creator: Gord Thompson

Private: No

I have a database file that was updated incorrectly via ODBC by a PHP script using single-byte characters from the Windows-1250 character set. So, for example, the character 'Ŕ' (U+0154) was inserted as the single byte 0xC0:

HexEdit

When I try to retrieve the text using Jackcess 2.1.6 with

DatabaseBuilder dbb = new DatabaseBuilder(new File(dbPath));
try (Database db = dbb.open()) {
    Table tbl = db.getTable("Table1");
    for (Row r : tbl) {
        String str = r.getString("TextField");
        System.out.println(str);
    }
}

I get

Latin Capital Letter R With Acute: À

because 'À' is U+00C0.

I know that I can juggle the bytes to fix the problem, but I also thought that I should be able to simply add

dbb.setCharset(Charset.forName("cp1250"));

before calling DatabaseBuilder#open. However, when I do that I get

java.io.IOException: Did not find required parent table id (Db=specials.mdb)
    at com.healthmarketscience.jackcess.impl.DatabaseImpl.readSystemCatalog(DatabaseImpl.java:882)
    at com.healthmarketscience.jackcess.impl.DatabaseImpl.<init>(DatabaseImpl.java:533)
    at com.healthmarketscience.jackcess.impl.DatabaseImpl.open(DatabaseImpl.java:400)
    at com.healthmarketscience.jackcess.DatabaseBuilder.open(DatabaseBuilder.java:252)
    at com.example.jackcessdemo.JackcessDemoMain.main(JackcessDemoMain.java:20)

The database file is attached for reference.

1 Attachments

specials.mdb

Discussion

James Ahlborn - 2017-04-01

The problem is that everything in an access db is in a table. in your database, however, you have mixed character sets. presumably, the system catalog (a table) is using the "correct" character set. when you try to change the charset for the entire database, jackcess can't lookup the initial information in the system catalog. you really need to localize the usage of this charset to the specifically corrupted table data (jackcess doesn't really have support for that today). this is really more of a feature request than a bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Ahlborn - 2017-04-01

Well, i guess the Database Charset is mutable. If your intent is just to "fix" the data (and not general database interaction), i think you could selectively manipulate the database charset at certain points in time to recover the data. e.g. first open the db and get the table using the default charset. then, change the database charset to the "wrong" charset while getting rows from the table. if you intend to write correct data back to the table (or do other interactions with the database), you would need to switch back to the default charset before proceeding. i "think" this could work (no promises).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gord Thompson - 2017-04-01

Ahh, sneaky. When I did

db.setCharset(Charset.forName("cp1250"));

immediately before retrieving the string I got

L a t i n C a p i t a l L e t t e r R W i t h A c u t e : Ŕ

with a null character between each real one, but at least the 'Ŕ' character was correct. Squeezing out the nulls was simple enough with

StringBuilder sb = new StringBuilder(str); for (int i = 1; i <= str.length() / 2; i++) { sb.deleteCharAt(i); } str = sb.toString();

This is just for reading, so we're good.

Thanks for the quick replies!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Ahlborn - 2017-04-04

Is the example database you attached actually created using the php script and odbc?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gord Thompson - 2017-04-04

Access_ODBC can't create a new database file. It was created using MSACCESS.EXE, uploaded to the WAMP server, and then updated using Access_ODBC under PHP.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Ahlborn - 2017-04-04

Yeah, that's what i meant, how was the incorrect data created.

i'm hesitent to make changes to jackcess to handle this case because the data really is corrupted. the data is written using the "compressed unicode" format. but, the data in question obviously isn't actually unicode.

basically, you have "cp1252" characters which have been written as utf16 chars. i think you should probably just fix the strings once you get them back out of the table.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Ahlborn - 2017-04-04

Can you generate a similar database except where the text column has the "compressed unicode" option disabled? i'm curious to see what the resulting data looks like.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Gord Thompson - 2017-04-04
  
  Here it is.
  
  cp1250uncompressed.mdb
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Ahlborn - 2017-04-04

Right, so that's kind of what i expected. The "uncompressed" data comes out exactly the same as the compressed unicode version after "decompression" (as jackcess currently functions). so, jackces is handling the character data consistently. and, in the case of uncompressed text data, changing the charset to cp1250 before reading doesn't give you the "correct" (i.e. original data) results. the patch you provided (https://github.com/jahlborn/jackcess/pull/1) would not give you the results that you want for uncompressed fields.

in other words, the decompression logic in jackcess isn't the problem, the problem is that the original data was written as if the bytes provided were single bytes of unicode characters, when in fact they were single bytes of cp1250 characters. jackcess can't fix this problem for you. if you want to get consistent results for any character field in this situation, you really need to "post process" the data (strip the nulls and re-interpret the bytes as cp1250).

Last edit: James Ahlborn 2017-04-04

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gord Thompson - 2017-04-04

Actually, the bytes are interpreted correctly as cp1250, it's just those pesky null characters after each one. Still, I understand that

consistent results are desirable,

the programmer has to specify the foreign character set so they must have some awareness that

this is an unusual case.

Thanks for the feedback.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Ahlborn - 2017-04-05

status: open --> wont-fix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Ahlborn - 2017-04-05

Nothing to do here, as this isn't something that can be "fixed" within jackcess.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.