From: Jakob E. <jab...@gm...> - 2011-09-23 06:54:00
|
In Jet4, all text is stored using the UCS2 encoding. However, Access uses a special trick to reduce storage requirements: ASCII characters are stored as single byte characters, and all others are stored as two byte characters. A null byte is used to switch between the two encoding methods. In your field, such a NULL byte will appear between the ASCII text and the hebrew characters. Apparently, this NULL byte causes mdb-export to believe it has reached the end of the string. This shouldn't happen. Which version of mdb-export are you using? There are a lot of old versions "in the wild". It is best if you compile the most current version from github yourself so you can ensure you are using a recent version. I'd be glad to try reading your file with the newest version, if you are interested just send me a copy of the database to my private email address. Best regards, Jakob On 23.09.2011, at 08:03, אריאל קלגסבלד Ariel Klagsbald wrote: > > > 2011/9/21 Nirgal <con...@ni...>: > > Jet4 always use unicode (UCS2) internally. > > > > Output should be utf-8, unless you set env var MDBICONV (there is no underscore). > > > I tried again and again, and MDBICONV seems to have no effect (a strange fact by itself). Any more ideas please? Maybe I'm wrong, and it isn't an encoding problem. Can something else cause emd-export to ignore half of the field? > > > [ak@ch ~/]$ setenv MDBICONV UTF-8 > [ak@ch ~/]$ mdb-export -QHd^ WebStructure.mdb FilePaths | grep '^130419\^' > 130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00 00:00:00^84^20223203^0113^1^0^45 ��ז� �ט��י ��ט��, ��' ��ך, ךי'ב^0^0^0^0 > [ar@ch ~/]$ setenv MDBICONV iso-8859-1 > [ar@ch ~/]$ mdb-export -QHd^ WebStructure.mdb FilePaths | grep '^130419\^' > 130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00 00:00:00^84^20223203^0113^1^0^45 ��ז� �ט��י ��ט��, ��' ��ך, ךי'ב^0^0^0^0 > [ar@ch ~/]$ setenv MDBICONV nothingatall > [ar@ch ~/]$ mdb-export -QHd^ WebStructure.mdb FilePaths | grep '^130419\^' > 130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00 00:00:00^84^20223203^0113^1^0^45 ��ז� �ט��י ��ט��, ��' ��ך, ךי'ב^0^0^0^0 > [ar@ch ~/]$ > > > See? MDBICONV has no effect. The 10th field (it's hebrew) seems the same (even if your terminal doesn't show hebrew, you can see there's no difference), and the 3rd field is still truncated. Only the numbers appear. > > > > Any help please?!? > > > > > On Wednesday 21 September 2011 09:55:12 אריאל קלגסבלד Ariel Klagsbald wrote: > >> I hope this is the place to post such a problem. And I also hope my > >> diagnosys is correct (that it's really is an encoding problem. I'm not > >> sure). > >> > >> Well, I have a large mdb file, in which one of the fields contains strings like > >> > >> 0007-20101223-214033-שמות-בגדר_שם.mp3 > >> > >> or > >> > >> 0007-20110714-213442-יום_טוב_שני_של_גלויות.mp3 > >> > >> That is, part english, part numbers and part Hebrew (yes, that's > >> hebrew, in case you can't see it in your browser). > >> > >> When I use mdb-export to extract data from this file, I get the > >> numbers correctly, but only them. The hebrew and english parts are > >> simply missing (even the '3' in the 'mp3' suffix). That is, when I > >> extract the latter example I get only > >> > >> 0007-20110714-213442 > >> > >> I'll add that other fields contain only hebrew (e.g. > >> יום טוב שני של גלויות, יב' תמוז, תשע'א > >> in the example ebove), and they seem to be extracted correctly. That > >> is, I get some gibberish which I guess is the correct data, only my > >> terminal can't present it. > >> > >> I though it might be an encoding problem, so I've played a bit with > >> MDB_ICONV, MDB_JET_CHARSET, MDB_JET3_CHARSET and MDB_JET4_CHARSET but > >> it showed no difference. > >> The file seems to be JET4 (so mdb-ver claims). I've no idea what > >> encoding does it use (I don't know how to find out. Any ideas?), but I > >> guess it's utf-8 (only a guess). > >> > >> I'll be grateful for any help! > >> Ariel. > >> > >> ------------------------------------------------------------------------------ > >> All the data continuously generated in your IT infrastructure contains a > >> definitive record of customers, application performance, security > >> threats, fraudulent activity and more. Splunk takes this data and makes > >> sense of it. Business sense. IT sense. Common sense. > >> http://p.sf.net/sfu/splunk-d2dcopy1 > >> _______________________________________________ > >> mdbtools-dev mailing list > >> mdb...@li... > >> https://lists.sourceforge.net/lists/listinfo/mdbtools-dev > >> > > > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2dcopy2_______________________________________________ > mdbtools-dev mailing list > mdb...@li... > https://lists.sourceforge.net/lists/listinfo/mdbtools-dev |