SynCE / Bugs / #167 Errors in translation of non-ASCII characters

ThoMaus - 2009-02-03

annotated msynctool log file for syncing todos with different kinds of non-ASCII content

msynctool_excerpt.log

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-03

Raw and transformed tasks exercising varying charsets and triggering the defect, together with screenshots how the data originally displayed on the PDA

examples.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-03

Excuse the mess -- actually I wanted to add files and comments to ID 2526454, because I consider it related, but I have accustomize myself first to the interface.

I restate my defect description here:

I think the attachments give sufficient evidence the root cause is in SynCE.

== Environment ==
* SynCE -- issue occurs with 0.12 and 0.13
* OpenSync 0.22 (Plugins: kdepim synce-legacy)
* Kontact 3.5.10
* OpenSuSE 11.1
* PDA with WinCE aka WM2003 (M$ PocketPC Version '4.20.0 Build 14053';
Processor 'ARM S3C2410'; Modell+ROM-ID 'BM-6280 V1.00 GER B16')

== Observations ==
* Sync between KDE-PIM and PDA is running fine, as long as no
characters outside the ASCII set are used (on the PDA).
* As soon as the PDA data contains any non-ASCII, e.g. umlauts and the
like, the conversion engine of OpenSync fails with 'invalid utf8 passed to
VFormat. Limbing along.' The individual resulting XML data field is cut off
at the position of the non-ASCII character -- the XML structure is
undamaged. (an annotated 'msynctool trace' is attached -- annotations are
indicated by a hash mark '#' at line start)
* The entries ending up in Kontact's databases (std.ics and std.vcf)
are encoded as UTF8 and the data is cut off where the intermediary XML data
was cut off.
* Trace output from msynctool is suggesting that the intermediary vcal
or vcard data is in Windows code page 1252 encoding (which is very similar
to ISO-latin-1) -- more complex detail below ...
* A pcap-traffic-capture on the ppp0 interface shows that the data
from the PDA is encoded as shown by the msynctool trace, with the notable
exception that all characters are encoded as 2-byte entities (UTF-16-LE)
but the note or comment fields of contacts, todos or events, which are
encoded as single byte characters (typically in Windows Code Page 1252, but
see below).
* Non-ASCII traveling the opposite direction, i.e. from KDE PIM to the
PDA, are:
o UTF-8-encoded in the PIM databases
o UTF-8-encoded in the intermediary vcal (for VEVENT or VTODO)
or vcard (contacts)
o HTML-encoded Unicode in the intermediary XML (i.e. 'a
diaresis' becomes ä)
o junk on the PDA (i.e. a diaresis is displayed as two
characters: the representations of 'capital A with tilde' and 'currency
sign')

== Investigations ==
The encoding is somewhat more complex as experiments showed:

Entries for todos (and events and contacts) are stored as wide chars
(2-byte chars) in UTF-16-LE on the device -- at least the character codings
of all non-ASCII chars used in the examples support this. The notable
exception are notes in these entries (what becomes DESCRIPTION fields in
VCALENDAR and VCONTACT): These are stored as blobs and have a different
encoding depending on the content.

If (and only if) all non-ASCII chars can be represented in the Windows
Code Page 1252, this encoding is used (it's the M$ mockup of ISO-latin-1 --
it's mostly identical to ISO-latin-1 but has a few additional chars,
notably the EURO sign in a different position (this in turn is identical to
cp 1250 -- thus I was mislead ...)

If there are any chars outside Windows Code Page 1252 a completely
different encoding is used (even for the chars representable within cp
1252!). This encoding is very similar to UTF-8 but slightly different. To
the best of my knowledge it is not CESU-8 either. I'm clueless ... Any hint
is appreciated.

I attach a tarball with a few crafted examples with different non-ASCII
char classes in different fields in raw format (as delivered by
'rra-get-data' -- you should be able to upload these to a PDA with
'rra-put-data' and make your own tests with them), their transformations
into VCALENDAR and screenshots from the PDA showing their actual glyphs
used on the PDA.

You will find the initial conversion from the raw PDA data into the
VCALENDAR and VCARD formats, done with 'rra-task-to-vtodo' (based on the
code in 'librra') is already (and identically) broken, causing aftereffects
in OpenSync -- so I believe OpenSync exculpated ...

I consider the issue relatively urgent as it should prevent any user relying on non-ASCII characters in PIM data (and that should be a significant percentage!?) from syncing the PDA with a Linux environment. (Perhaps this is limited to WM2003 users?)

PS: Beside that described problem, which would actually be a show-stopper
for my main use-case, I actually consider 'synce' quite impressive.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-04

The problem is likely in the plugin or librra.

Thomaus, excellent report ! I haven't looked much at the legacy plugin yet, but I have
noticed some places where I would think the conversion should be called with a UTF8 flag and it isn't. I've attached a patch to point out where these points are, it is however untested.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-04

File Added: legacy-utf8.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-04

legacy-utf8.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-04

assigned_to: nobody --> mark_ellis
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-05

Sorry for the delay: I'm stuck with testing your diff as the source package does not configure ...
Looks like the use of outdated parameters to gcc during configure tests, but I have to be wide awake again before tackling this AND have a little bit spare time -- can take 1 or 2 days ...

PS:
Actually I wondered all the time what I am overseeing, when I searched what would trigger all those UTF conversions in librra. Now I know: your diff ;-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-05

Extensive reports will follow, but we are definitely on a promising track!

I've to do some more testing, but it seems as umlauts and cyrillic chars will travel unharmed for all fields but notes (DESCRIPTION) from the PDA through OpenSycn into e.g. KDEPIM
(I have no idea what would happen if an UTF-16-LE surrogate representation is used -- probably we need someone accustomed to Chinese chars to test this reliably ...)

DESCRIPTIONs will travel in that direction unharmed if and only if the contain only chars representable as 1-Byte in whatever the native encoding of the PDA is (the discussion for ID 2526454 -- which I still, and now more convinced, consider related to our issue here -- hints that the PDA there uses Windows Code Page 1251 (the cyrillic one!) as native encoding, mine uses 1252 (the nearly latin1 one). As librra will interpret this as latin1 encoding the cyrillic sample text 'Test' (transscribed to latin) the user vrurg used is transformed into the latin1 interpretation of the codepoints -- notice that the upper and lower case cyrillic 'T' became upper and lower case 'O accent grave')
Obviously when there is any char outside the native code page this very curious encoding happens, which is similar but different to UTF8 -- this leads to omission of the DESCRIPTION alltogether.

It'll probably take some days to prepare detailed reports of the observations and craft some test-cases ...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-05

The cause of the notes field problem may be the same, or definitely related, but the solution will still be different.

Assuming I'm on the right track, vrurg's problem is in librtfcomp, which converts the notes field from compressed rtf to utf8 text, on, and this is the significant bit, WM5 and later.

You, on the other hand, are WM2003, and syncing uses rra, not airsync. I have as yet not worked out what format the 2003 notes field is in, but it is converted by librra, not rtfcomp.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-05

Ok, in librra, the notes field from the device is basically assumed to be iso-8859-1, and conversion will give up if it finds an invalid character.

It would actually be not too hard to fix, if we knew what encoding each device was actually sending.

If you want to play around, the notes field is converted in the function rra_contact_to_vcard2() in contact.c, look for the section starting ID_NOTE. This calls convert_to_utf8() in common_handlers.c, which specifies a from code of CHARSET_ISO88591, defined at the top of the file as the encoding string for iconv. I would speculate if you changed this to an encoding suitable for your device, it may work. If it does, we may be able to set up something using environment variables perhaps ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-05

Thank you for the clarification on the difference between WM5+ and WM2003!
It helps me significantly to understand and sort out components and their relationships.

The notes field is NOT in ISO88591 aka ISO latin 1 but in Windows Code Page 1252 (for my device, probably CP1251 for vrurg's device). Perhaps some configuration option would be helpful for this.
To complicate the issue the notes field changes to a completely different encoding I can not identify despite some hours of research, whenever there is any character outside the devices code page -- it seems quite similar to UTF8 but differs:
Then 'Euro Sign', which is 0x80 in CP1252, becomes 0xAC82CE (in UTF-8 it would be 0xE282AC, which is similar when you look at the last 2 bytes right-to-left (and it's NOT CESU-8, another obscure candidate)).
We could fix the transformation of the native code page, but I'm clueless what this pseudo-UTF-8 encoding is ...

By the way, slightly off topic:
I've noticed that there is an 'airsync.dll' in WM2003, too.
And actually if I start 'synce-engine' there is some initial exchange "wapi provisioning ..." with the WM2003 device, initiated by linux system and answer by the PDA but then not continued.
Is there any knowledge if there is a chance to base an AIRSYNC exchange on this?
(Would allow to unify the code base again, which might be preferable?)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-05

That's an insane way of going about things, I'd love to know who came up with that idea !

The euro thing looks familiar. In librra's convert_to_utf8(), the only special case code is to fix the euro. It replaces any occurence of 0xC280 with 0xE282AC.

Don't know about airsync.dll on wm2003. Up until recently I've concentrated on the connection stuff rather than syncing, so I'm just getting into it. If you're interested I'd recommend joining the mailing lists.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-17

I did some further investigation into the encoding of the "notes" or "description" fields -- find attached a tarball with a report and a few sample files as well as a manifest file describing the content of the tgz. The report is best read with vim in outline mode ...
In this comment a mainly give conclusions, ask for your position on some questions and raise a view requests for testing by persons with different equipment to shed light into some open issues.
Further test results will follow in due time ...

== Summary and Conclusions ==

First of all: Actually things are even more insane as imagined so far -- look and behold, I saw the "Gates to Hell":

* the notes field -- opposed to all other fields -- is definitely coded in the "native" code page of the PDA when all chars can be encoded thus -- Windows CP 1252 for mine. A good indicator is the "ellipsis" character which is absent in the ISO latin char-sets.

* Via hex-editing I crafted a task entry with a notes field of type LPWSTR as all the other fields and filled it with some UTF-16-LE text. According 'rra-decode' it decodes correctly. When loading this entry on the PDA, the task display but the note is empty. When entering text into the note, it displays correctly on the PDA, but when fetching the entry from the PDA it contains two notes with same ID but different type. Syncing both versions fails more or less miserably ...

* When chars outside the native code page are used, notes are encoded in "Pocket Word Ink" format, which is according my tests identical to "Pocket Word Document" and can be opened with PocketWord when extracted from the note.
The good news: OpenOffice is capable of creating "Pocket Word Document"s and these can be used with PocketWord.
The bad news: Unicode characters inserted in OpenOffice come out as "empty rectangles" in PocketWord -- this might be an issue of not matching the peculiar encoding correctly or of my PDA not having the appropriate glyphs, at the moment I can't decide this.
Worse news: Whenever OpenOffice successfully opened a "Pocket Word Document", it is displayed as empty, even when it works wonderfully with Pocket Word.

== Requests ==
* Can anybody with an WM2003/WinCE PDA residing in a different Code Page (e.g. cyrillic) PLEASE create a task with a note in non-Latin-script and kindly provide a screen shot (for cyrillic its sufficient to tell me the character names or use an russian word from basic vocabulary) and the raw file via 'rra-get-data'. It would help very much in the search for a code page indicator! (My silent hope is that the field type might indicate the code page ...)

* Is there any known way to find the "native code page" of a device, similar to 'rra-get-timezone' or any knowledge if the PDA has any way to tell this to us automatically?

* Can anybody with a WM5 or WM6 device please try to use my prepared PocketWord files and report if they open correctly and what they look like (e.g. via screenshot)? It would help to see if there are different format versions, and how beneficial further work in this area could be (i.e. how universally could such a solution be applied).

* Concerning the handling of "(Pocket Word) Ink Notes" we can hope for progress on the side of OpenOffice, and even actively raise the issue there. For an interims solution or if the proprietary format cannot be completely unraveled we might have to discuss the following options:
** I'm confident that it is possible to rip the text out of such an Ink Note and convert it to Unicode -- but this would be a misrepresentation of the data in the sync partner, say KDEPIM. When the note is changed in KDEPIM, we have to decide if and how we propagate this change to the PDA, as we might would habe to substitute the original note, such genuinely running the risk of destroying user-provide information like hand-writing, drawings or audio records.
** We could also sync an Ink Note as an attachment of the task, but as long as OOo can not open these, the data would be present but opaque in KDEPIM.
** Finally we could allow notes in KDEPIM which are either not transported to the PDA or not displayed there (see the dual note issue above), again being a misrepresention of the data, this time on the PDA.
What is the general perception of DOs and DONTs in these scenarios?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-17

Investigation and sample files for clarifying the format options for notes

investigation.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-17

Detailed description of the files provided in 'investigation.tgz'

Manifest

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-17

The interoperability of M$ products has still surprises for me:
Of all the PocketWordDocuments in the tarball ActiveSync manages only to transfer
* A_PocketWord_test_doc.psw -- generated on the PDA, containing only ASCII chars
* Note-1.1.pwi -- extracted "Ink Note" from a task entry with Unicode chars
It fails miserably in the conversion routines of all the others.

Oh my ...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-23

Test case investigating sync behavior when an "Ink Note" on the PDA and a note within KDEPIM collide and how KDEPIM task extension are handled

test-encoding+input-variants-7.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-23

Test case investigating sync of tasks with non-ASCII-chars from PDA to KDEPIM and discriminators between "flat" and "Ink Notes"

test-encoding+input-variants-6.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-23

A detailed investigation into a specific error case preventing syncs completely

test-encoding+input-variants-5.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-23

Test case investigating sync of tasks containing non-ASCII chars from KDEPIM to PDA

test-encoding+input-variants-3.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-23

Test case delineating failure condidtions for sync of tasks containing non-ASCII chars from KDEPIM to PDA

test-encoding+input-variants-2.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ThoMaus - 2009-02-23

As you can see from the file attachment section I did some further testing:

== Test Data provided ==

You'll find 5 directory tar-balls with test data in identical layout.
test-encoding+input-variants-2.tgz
test-encoding+input-variants-3.tgz
test-encoding+input-variants-5.tgz
test-encoding+input-variants-6.tgz
test-encoding+input-variants-7.tgz

As you see the numbering of the dirs is not consecutive: A few test runs were left out, as their results were either redundant or not providing new in sights (or both).

On the other hand you'll find some duplication, e.g. the identical raw PDA tasks in state_before_sync and state_after sync. Besides easing automated analysis and (regression) testing, I actually do not consider this redundancy:
Of course it is our expectation if the tasks on the PDA are not changed by syncing when they are completely new for KDEPIM. But this seemingly redundant data provides CERTAINTY, it makes the difference between expecting and knowing ...

== Observations Summary ==

-- Data Format on PDA --
* Task title (VTODO SUMMARY) is represented as LPWSTR, i.e. UTF-16LE (or UCS-2LE, the discriminator being "no surrogate support"). Chars from the ASCII, Latin1 and e.g. cyrillic set are represented and sync'ed correctly. Higher code-points like japanese chars seem not to work, but the reason might lie within my input procedures.
* It is quite probable that this is true for all fields represented as LPWSTR, judging from a tentative sync of my productive use data (which is not provided).
* Task notes (VTODO DESCRIPTION) are natively represented in Windows Code Page 1252 on MY device -- the SynCE-assumption of ISO latin1 plus work-around for the Euro sign is incorrect (but an approximation).
* Probably -- but this have to be verified! -- devices produced for other countries, e.g. eastern Europe, will have a different native Code Page in use.
* When a task note is forced to contain chars outside this code page (or any handwritting, drawing, or audio recodring) it is store in "Ink Note" format, which seems to be identical with PocketWord format.
* A task note which became "Ink Note" format by some action, will remain in "Ink Note" format, even when the root cause for this, e.g. the special char, has been removed.

-- Data Sync from PDA to KDEPIM --
* If a task does not contain a note field of type "blob" -- this can either be a native code page string or "Ink Note" -- synchronization fails completely with an assertion error. (Example: the synthetic task with a synthetic note of LPWSTR type)
* While the task itself sync's, "Ink Note" formatted task notes are not transfered. If the task receives a note with KDEPIM and is sync'ed with the PDA again, the original "Ink Note" on the PDA is lost without collision warning or request to the user!)

-- Data Sync from KDEPIM to PDA --
* if a task's note (DESCRIPTION) from KDEPIM contain chars not representable in the code page of the PDA
the task is not synchronized with the PDA at all!
* it a task's title (and probably any other LPWSTR field) contains a char not representable (e.g. Japanese chars) the task is not synchronized with the PDA at all! The exact extent of what is representable has still to be verified.

== Open Questions and Support Requests ==

With my equipment (WM2003 aka WinCE, localized for Germany) I can not clarify with sufficient certainty:
* Are the LWPSTR fields encoded in UTF-16LE or UCS-2LE? To answer a WM2003 device owner is needed, capable of natively entering chars beyond the Unicode base plane. (Only very rarely used Asian scripts and traditional ideograms seem applicable to me -- so maybe this is a non-issue)
* Delineate the representation scope of the LWPSTR fields: I managed to insert cyrillic chars into title fields of tasks and these are not only stored correctly but also display correctly. Albeit I inserted japanese chars by the same method they neither display nor seem to be stored (judging from the raw binary data). Someone with a WM2003 device capable of entering japanese or chinese chars into the title of a task could shed light by providing the raw data via rra-get-data.
* My device stores notes to tasks in Windows Code Page 1252 -- but is this true for all WM2003 device or do they have other native code pages. Again somebody with a WM2003 natively using non-latin script could clarify by providing raw data via rra-get-data.
* Finally I'd consider it helpful, to have the identical questions answered for WM5 and WM6 devices to see, if there are changes or if advances would apply to all these devices. So owners of such devices with non-latin script, please, contribute too.

== Development Issues ==

I could solve my specific problem with task note encoding by hard-coding CP1252 in the synce source (and using Mark's diffs -- merci!).

I would prefer to do it the right way and solve the general problem. For this I need support:
* Is there any way to find out what the native code page of a PDA is?
* Failing this, what would be the proper config file to configure this on a per-device basis?
* What is the intended API to use for accessing this config file?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-23

I haven't had a chance to look at all of the data you've produced, but I'll throw in some points that may or may not be helpful.

I'm fairly confident everything in synchronized items comes from the device as WSTR, except obviously for notes/description. WSTR conversion in libsynce uses UCS-2LE, whether this is correct or not I cant say at this time.

I can quite believe your results from syncing various things in various directions. Without looking at the code, it most likely just decides it cant deal with it and drops it. Whether the drop occurs in synce, or at the device/client because we've sent it junk, will require more investigation.

synce and host client apps will have no knowledge of Ink Note/Pocket Word format, so we'd have to write a conversion routine accessible to librra. I can see headaches here, particularly when sending a notes filed from eg evolution to device, and determining whether we first need to somehow encode it to Ink Note.

Forget WM5 and WM6, notes are compressed RTF across the board.

I would agree a very important point at the moment would be to figure out the native code page of a device, at least then we can cover some situations. There is currently nothing in librra to set this per device, or at all. I may be able to knock something up with environment variables, I'll let you know.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Ellis - 2009-02-23

Ok, that seemed straight forward enough. I've attached codepage-env.diff for librra. If you set the environment variable SYNCE_DEV_CP to the iconv name of your codepage (run iconv -l for a list, there's a CP1252 and WINDOWS-1252), it should use that instead of 8859-1. I hope.

I actually found a comment in the task_to_vtodo code about Ink format not yet being supported.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Errors in translation of non-ASCII characters

Group

Searches

Help

#167 Errors in translation of non-ASCII characters

Discussion