Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#165 Notes character encoding is not translated

open
Mark Ellis
librtfcomp (1)
5
2012-09-30
2009-01-21
Vadim Belman
No

Synchronizing of contacts, tasks and events with notes written in encodings other than latin (e.g. cyrillic) leads to losing encoding information when transferring from WM device to a PC. Transferring from PC to WM produce correct notes encoding.

Discussion

1 2 > >> (Page 1 of 2)
  • Vadim Belman
    Vadim Belman
    2009-01-21

    Example of a garbled note in Evolution Address Book.

     
    Attachments
  • David Eriksson
    David Eriksson
    2009-01-21

    Please specify your exact device model and Windows Mobile version, and what programs and their versions you use for synchronization.

    Thanks,

    David

     
  • Vadim Belman
    Vadim Belman
    2009-01-22

    The same problem has been observed for both HTC Mogul (PPC 6800, WM6.1) and ASUS MyPal A696 (WM5).

    OS: openSUSE 11.0 and 11.1 in order.

    Synce versions: I've been following svn repository for over 4-5 months until recently. Presently it's versions taken from openSUSE system:SynCE repository based on 0.13 branch.

    Client software used: Presently it's either multisync-gui or msynctool. Tried with kitchensync some months ago with same result.

     
  • ThoMaus
    ThoMaus
    2009-02-03

    I'm observing a very similar, probably related, but more extended problem.
    (I have evidence the root cause is in SynCE, see attachments.)

    == Environment ==
    * SynCE -- issue occurs with 0.12 and 0.13
    * OpenSync 0.22 (Plugins: kdepim synce-legacy)
    * Kontact 3.5.10
    * OpenSuSE 11.1
    * PDA with WinCE aka WM2003 (M$ PocketPC Version '4.20.0 Build 14053'; Processor 'ARM S3C2410'; Modell+ROM-ID 'BM-6280 V1.00 GER B16')

    == Observations ==
    * Sync between KDE-PIM and PDA is running fine, as long as no characters outside the ASCII set are used (on the PDA).
    * As soon as the PDA data contains any non-ASCII, e.g. umlauts and the like, the conversion engine of OpenSync fails with 'invalid utf8 passed to VFormat. Limbing along.' The individual resulting XML data field is cut off at the position of the non-ASCII character -- the XML structure is undamaged. (an annotated 'msynctool trace' is attached -- annotations are indicated by a hash mark '#' at line start)
    * The entries ending up in Kontact's databases (std.ics and std.vcf) are encoded as UTF8 and the data is cut off where the intermediary XML data was cut off.
    * Trace output from msynctool is suggesting that the intermediary vcal or vcard data is in Windows code page 1252 encoding (which is very similar to ISO-latin-1) -- more complex detail below ...
    * A pcap-traffic-capture on the ppp0 interface shows that the data from the PDA is encoded as shown by the msynctool trace, with the notable exception that all characters are encoded as 2-byte entities (UTF-16-LE) but the note or comment fields of contacts, todos or events, which are encoded as single byte characters (typically in Windows Code Page 1252, but see below).
    * Non-ASCII traveling the opposite direction, i.e. from KDE PIM to the PDA, are:
    o UTF-8-encoded in the PIM databases
    o UTF-8-encoded in the intermediary vcal (for VEVENT or VTODO) or vcard (contacts)
    o HTML-encoded Unicode in the intermediary XML (i.e. 'a diaresis' becomes ä)
    o junk on the PDA (i.e. a diaresis is displayed as two characters: the representations of 'capital A with tilde' and 'currency sign')

    == Investigations ==
    The encoding is somewhat more complex as experiments showed:

    Entries for todos (and events and contacts) are stored as wide chars (2-byte chars) in UTF-16-LE on the device -- at least the character codings of all non-ASCII chars used in the examples support this. The notable exception are notes in these entries (what becomes DESCRIPTION fields in VCALENDAR and VCONTACT): These are stored as blobs and have a different encoding depending on the content.

    If (and only if) all non-ASCII chars can be represented in the Windows Code Page 1252, this encoding is used (it's the M$ mockup of ISO-latin-1 -- it's mostly identical to ISO-latin-1 but has a few additional chars, notably the EURO sign in a different position (this in turn is identical to cp 1250 -- thus I was mislead ...)

    If there are any chars outside Windows Code Page 1252 a completely different encoding is used (even for the chars representable within cp 1252!). This encoding is very similar to UTF-8 but slightly different. To the best of my knowledge it is not CESU-8 either. I'm clueless ... Any hint is appreciated.

    I attach a tarball with a few crafted examples with different non-ASCII char classes in different fields in raw format (as delivered by 'rra-get-data' -- you should be able to upload these to a PDA with 'rra-put-data' and make your own tests with them), their transformations into VCALENDAR and screenshots from the PDA showing their actual glyphs used on the PDA.

    You will find the initial conversion from the raw PDA data into the VCALENDAR and VCARD formats, done with 'rra-task-to-vtodo' (based on the code in 'librra') is already (and identically) broken, causing aftereffects in OpenSync -- so I believe OpenSync exculpated ...

    PS: Beside that described problem, which would actually be a show-stopper for my main use-case, I consider synce quite impressive.

     
  • ThoMaus
    ThoMaus
    2009-02-03

    I'm observing a very similar, probably related, but more extended problem.
    (I have evidence the root cause is in SynCE, see attachments.)

    == Environment ==
    * SynCE -- issue occurs with 0.12 and 0.13
    * OpenSync 0.22 (Plugins: kdepim synce-legacy)
    * Kontact 3.5.10
    * OpenSuSE 11.1
    * PDA with WinCE aka WM2003 (M$ PocketPC Version '4.20.0 Build 14053'; Processor 'ARM S3C2410'; Modell+ROM-ID 'BM-6280 V1.00 GER B16')

    == Observations ==
    * Sync between KDE-PIM and PDA is running fine, as long as no characters outside the ASCII set are used (on the PDA).
    * As soon as the PDA data contains any non-ASCII, e.g. umlauts and the like, the conversion engine of OpenSync fails with 'invalid utf8 passed to VFormat. Limbing along.' The individual resulting XML data field is cut off at the position of the non-ASCII character -- the XML structure is undamaged. (an annotated 'msynctool trace' is attached -- annotations are indicated by a hash mark '#' at line start)
    * The entries ending up in Kontact's databases (std.ics and std.vcf) are encoded as UTF8 and the data is cut off where the intermediary XML data was cut off.
    * Trace output from msynctool is suggesting that the intermediary vcal or vcard data is in Windows code page 1252 encoding (which is very similar to ISO-latin-1) -- more complex detail below ...
    * A pcap-traffic-capture on the ppp0 interface shows that the data from the PDA is encoded as shown by the msynctool trace, with the notable exception that all characters are encoded as 2-byte entities (UTF-16-LE) but the note or comment fields of contacts, todos or events, which are encoded as single byte characters (typically in Windows Code Page 1252, but see below).
    * Non-ASCII traveling the opposite direction, i.e. from KDE PIM to the PDA, are:
    o UTF-8-encoded in the PIM databases
    o UTF-8-encoded in the intermediary vcal (for VEVENT or VTODO) or vcard (contacts)
    o HTML-encoded Unicode in the intermediary XML (i.e. 'a diaresis' becomes ä)
    o junk on the PDA (i.e. a diaresis is displayed as two characters: the representations of 'capital A with tilde' and 'currency sign')

    == Investigations ==
    The encoding is somewhat more complex as experiments showed:

    Entries for todos (and events and contacts) are stored as wide chars (2-byte chars) in UTF-16-LE on the device -- at least the character codings of all non-ASCII chars used in the examples support this. The notable exception are notes in these entries (what becomes DESCRIPTION fields in VCALENDAR and VCONTACT): These are stored as blobs and have a different encoding depending on the content.

    If (and only if) all non-ASCII chars can be represented in the Windows Code Page 1252, this encoding is used (it's the M$ mockup of ISO-latin-1 -- it's mostly identical to ISO-latin-1 but has a few additional chars, notably the EURO sign in a different position (this in turn is identical to cp 1250 -- thus I was mislead ...)

    If there are any chars outside Windows Code Page 1252 a completely different encoding is used (even for the chars representable within cp 1252!). This encoding is very similar to UTF-8 but slightly different. To the best of my knowledge it is not CESU-8 either. I'm clueless ... Any hint is appreciated.

    I attach a tarball with a few crafted examples with different non-ASCII char classes in different fields in raw format (as delivered by 'rra-get-data' -- you should be able to upload these to a PDA with 'rra-put-data' and make your own tests with them), their transformations into VCALENDAR and screenshots from the PDA showing their actual glyphs used on the PDA.

    You will find the initial conversion from the raw PDA data into the VCALENDAR and VCARD formats, done with 'rra-task-to-vtodo' (based on the code in 'librra') is already (and identically) broken, causing aftereffects in OpenSync -- so I believe OpenSync exculpated ...

    PS: Beside that described problem, which would actually be a show-stopper for my main use-case, I consider synce quite impressive.

     
  • Mark Ellis
    Mark Ellis
    2009-02-04

    These are similar looking problems with completely different causes. Thomaus, with WM2003 and the 'legacy' plugin, the problem is likely in the plugin or librra. Vadim, your problem most likely is in sync-engine somewhere.

    Thomaus, excellent report ! But I can't see most of the attachments you mentioned ? I haven't looked much at the legacy plugin yet, but I have noticed some places where I would think the conversion should be called with a UTF8 flag and it isn't. I've attached a patch to point out where these points are, it is however untested.

    Vadim, could you perform a sync that will show the encoding loss, capture the output of msynctool and sync-engine and post it here.

     
  • Mark Ellis
    Mark Ellis
    2009-02-04

    Thomaus, I see you've opened a new bug, I'll take it there, since it is a different problem.

     
  • Vadim Belman
    Vadim Belman
    2009-02-04

    sync-engine & msynctool outputs

     
    Attachments
  • Vadim Belman
    Vadim Belman
    2009-02-04

    Mark, I have attached full session output for sync-engine (-vDEBUG) and msynctool.

     
  • Mark Ellis
    Mark Ellis
    2009-02-04

    So just checking what I'm looking at, you added a contact on WM called "Test Someone" with non-latin characters in the notes, and after the sync it is incorrect on PC, is that right ?

    In the sync-engine output, you can see what sync-engine is sending to opensync, search for "Test, Someone", can you confirm the contents of the notes is correct at this point ?

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    For the first part - you're right. Notes are filled with Cyrillic letters.

    For the second part - sync-engine is sending the text in incorrect encoding.

     
  • Mark Ellis
    Mark Ellis
    2009-02-05

    I'm just double checking, because encodings are a pain. And I'm going to describe rather than copy/paste. And please forgive me, I'm terrible at languages :)

    In the sync-engine log, in the notes field that sync-engine is sending to opensync, I can see a single word five characters long. I would appear to be an uppercase E with an umlaut, something that looks like a lowercase b with the 'tail' of the upright extending further down, an a with an accent, an i capped with a ^, and an a with a small circle above it.

    If this is not what you were expecting, can you try and tell me what it should be.

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    Heh, being terrible in languages - it's my job. Ok? ;)

    Unfortunately, I don't remember what's been put into notes in the previous test. So, I did another one. Here is what I get from sync-engine:

    Òåñò

    But in fact it should be:

    Тест

    I have put both variants into a single file like this:

    --- cut ---
    Тест
    Òåñò
    --- cut ---

    And fed it to hexdump:

    <pre>
    00000000 d0 a2 d0 b5 d1 81 d1 82 0a c3 92 c3 a5 c3 b1 c3 |................|
    00000010 b2 0a |..|
    </pre>

    I know nothing about UTF/unicode, but here it is clearly visible that instead of proper d0/d1 prefix characters are encoded with c3 which represents Latin letters with umlauts/accents/etc.

    And what am I afraid of is that WM5/6 uses cp1251 (or any other non-unicode) encoding internally for keeping notes. Then this is a BIG problem.

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    I have done one more test which seems to prove the non-UTF hypothesis. This one doesn't even require syncing. After inserting some Greek characters into notes and saving changes I've got nothing but several question marks in the field.

    So, the only question remain is how to find out which particular encoding is used? I'm sure that it is not changeable across records but remain the same system-wide.

     
  • Mark Ellis
    Mark Ellis
    2009-02-05

    Sorry didn't quite understand that last one.

    You put some greek characters in the notes on the device ? And they dont show on the device afterwards ?

    That would suggest there isn'tmuch we can do about it.

    Can I just check, what version of opensync are you using.

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    You've got me correctly. The point is that the device itself is localized to Cyrillic language (Russian/Ukrainian in particular). Therefore it is accepting cyrillic letters as well as latin. Any attempt to insert letters from other languages simply fails.

    My proposal for this situation would be a possibility to explicitly specify mobile's codepage when a new partnership is being established.

    I'm using libopensync-0.22-150.30.

     
  • Mark Ellis
    Mark Ellis
    2009-02-05

    I think the problem is specifically located in librtfcomp.

    The notes field is passed in airsync to sync-engine in compressed RTF, presumably in the device's local encoding. This is then passed to librtfcomp, where a comment above the function in question reads

    "Convert an RTF-encoded string to a UTF-8 encoded one. This is crude - we just assume that the RTF encoding is ANSI and that all characters are either raw, or escaped out with the backslash/apostrophe sequence"

    Which with my limited understanding is saying this won't work.

    I'll pas this around see if anyone can help.

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    And after all I'd like to explain why this problem is more important than just user's inconvenience.

    As you may know some PIM managers are using notes for keeping out-of-band information which doesn't fit standard contact/calendar entry structure. For example, it could be a second mobile phone (which is not business, home or whatever else Microsoft allowed us to be entered), or references to some associated records in PIM database, etc. This is overall is not a problem with Latin-localized devices while may seriously impact people using different alphabets because that information would be either corrupted or even lost.

     
  • ThoMaus
    ThoMaus
    2009-02-05

    vrurg -- it is possible to insert chars from other than the device-native script into the notes on the PDA.
    See ticket ID: 2562067 -- my device is natively using Windows Code Page 1252, and I can insert cyrillic script into the notes, but they are represented in an UTF-8-like encoding which is only similar but not identical to UTF-8.

    Perhaps you can make more clear what you see and mean by taking the same approach I did for ticket 256206 -- use 'kcemirror' (which is based on the synce packages) to get screenshots from you PDA. Thus we can see how the chars are displayed on your PDA. If then -- via rra-get-data you can also provide the raw data from the device, this should help a lot along the way.

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    Thanks Mark! I'm looking forward for a fix.

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    thomaus, I'd try to do it as soon as I get a bit more spare time.

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    Anyone to point me into good description/documentation of rra-get-data? kcemirror doesn't work for me either.

     
  • Mark Ellis
    Mark Ellis
    2009-02-05

    Seconding David's comment, rra is for WM2003 and earlier, the only thing is does for WM5 and later is the file sync component.

    I'm not sure how to do the same thing for WM5 as rra-get-data would have done though.

    What's the problem with kcemirror ?

     
  • Vadim Belman
    Vadim Belman
    2009-02-05

    Aha, it probably explains why my Touch Pro's activesync is dying. :)

    kcemirror simply does nothing but ends up with exit status 1. I tried connecting by a device name (as listed by list_partnership) and by device's IP address.

     
1 2 > >> (Page 1 of 2)