Microsoft Works format import library / Bugs / #13 German umlauts are converted too much

alonso laurent - 2015-01-02

Hello,
normally, the first problem must be fixed in libwps-0.3.1 by https://sourceforge.net/p/libwps/code/ci/41dfedf3d25025fe1dc91e7a01d83a54d3259476 ( see attachment tt.ods ), i.e. this file uses LICS encoding which is now retrieved, however the Windows DOS codepage does not appear in the file so it still uses a default one.

Note: this file does contain a chart with name Wachstum, but actually, we do not reconstruct charts which appear in the file :-~

tt.ods

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2016-12-11

Hi, looking at the source code of libwps_tools_win.cpp
0.4.x, the conversion of LICS characters to Unicode
appears to be partially broken:

In Font::LICSunicode() the code converts all codes >= 0x80
according to a 128 byte translation table named LICS[],
and the resulting byte value is then stuffed into another
conversion routine named unicode(). unicode(), however,
erroneously carries out an OEM codepage to Unicode conversion
(with the OEM codepage being a varible input). This does
not make sense, as the conversion LICS->Unicode is fixed
and does not depend on an OEM codepage.
Further, if I assume codepage 850 (as it was still hardwired
in the older 0.3.1 issue of the file), carrying out the
conversion through codepage 850 results in the correct
Unicode values for some of the LICS codes, but not for all.
For illustration, these are the errors in the LICS code
range 0xD0 to 0xFE:

0xD7: code results in U+00D7, but should result in U+0152.
0xDD: code results in U+00DD, but should result in U+0178.
0xDE: code results in U+00FE, but should resukt in U+00DE.
0xE4: code results in U+00F6, but should result in U+00E4.
0xF7: code results in U+00F7, but should result in U+0153.
0xFE: code results in U+00DE, but should result in U+00FE.

Please compare your conversion with the LICS translation
documented in the article "Lotus International Character Set"
in the English Wikipedia, which is based on a printed
LICS table in a HP 95LX manual. While some of the codes
in the range 0x80 to 0xBF are still unknown or are
ambiguous, the conversion of codes in the range 0xC0
to 0xFE to Unicode is clearly unambiguous, so the Wikipedia
article can be used as a reference for these codes.

Hope this helps to improve the Lotus 1-2-3 file importer.

Matthias

Hi, looking at the source code of libwps_tools_win.cpp 0.4.x, the conversion of LICS characters to Unicode appears to be partially broken: In Font::LICSunicode() the code converts all codes >= 0x80 according to a 128 byte translation table named LICS[], and the resulting byte value is then stuffed into another conversion routine named unicode(). unicode(), however, erroneously carries out an OEM codepage to Unicode conversion (with the OEM codepage being a varible input). This does not make sense, as the conversion LICS->Unicode is fixed and does not depend on an OEM codepage. Further, if I assume codepage 850 (as it was still hardwired in the older 0.3.1 issue of the file), carrying out the conversion through codepage 850 results in the correct Unicode values for some of the LICS codes, but not for all. For illustration, these are the errors in the LICS code range 0xD0 to 0xFE: 0xD7: code results in U+00D7, but should result in U+0152. 0xDD: code results in U+00DD, but should result in U+0178. 0xDE: code results in U+00FE, but should resukt in U+00DE. 0xE4: code results in U+00F6, but should result in U+00E4. 0xF7: code results in U+00F7, but should result in U+0153. 0xFE: code results in U+00DE, but should result in U+00FE. Please compare your conversion with the LICS translation documented in the article "Lotus International Character Set" in the English Wikipedia, which is based on a printed LICS table in a HP 95LX manual. While some of the codes in the range 0x80 to 0xBF are still unknown or are ambiguous, the conversion of codes in the range 0xC0 to 0xFE to Unicode is clearly unambiguous, so the Wikipedia article can be used as a reference for these codes. Hope this helps to improve the Lotus 1-2-3 file importer. Matthias

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- alonso laurent - 2016-12-12
  
  Hello,
  currently this code takes inspiration from the comments of https://bugs.documentfoundation.org/show_bug.cgi?id=87222 (a), so if you do
  
  wks2ods --encoding CP437 input.WKS result.ods
  
  or choose Western Europe/OS2-437/US in LibreOffice, the result must be better.
  
  Notes:
  
  (a) which seems to indicate that we must do two passes LICS -> CP437/local encoding -> unicode
  
  I will replace these two passes by an unique one, probably this week-end.
  
  Last edit: alonso laurent 2016-12-12
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2016-12-19

Hi Alonso,
this won't work as LICS has nothing to do with codepage 437. As far as I know. Lotus 1-2-3 Release 1.x supported "ASCII" files only (but didn't strip off bit 7, so in reality users undocumentedly used 8-bit OEM codepages). Lotus 1-2-3 release 2.0 introduced LICS and since 2.01 it allowed the user to choose between "ASCII" and LICS in setup. I don't know if there is a tag or attribute inside the file format to indicate which mode was chosen or if the user just had to set up the program accordingly. Perhaps this info is also associated with the various filename extensions - this would need some research (BTW. Japanese versions of Lotus for the NEC PC-98 used WJ1/WJ2/WJ3/WJ4 file extensions instead of WKS/WK1/WP2/WK3/WK4 - probably a file format very similar, but obviously with some detail differences as well). Apparently, various third-party spreadsheets did not support LICS but just continued to use "ASCII", which actually means that they used whatever codepage was selected as 8-bit OEM codepage by the underlying operating system (often 437 or 850, but a whole bunch of other OEM codepages exist). With Lotus 1-2-3 release 3.x they switched to yet another character set, and encoding, LMBCS, also discussed in the English WP now (although the discussion is not complete at present).

Hope it helps.

Greetings,

Matthias

Hi Alonso, this won't work as LICS has nothing to do with codepage 437. As far as I know. Lotus 1-2-3 Release 1.x supported "ASCII" files only (but didn't strip off bit 7, so in reality users undocumentedly used 8-bit OEM codepages). Lotus 1-2-3 release 2.0 introduced LICS and since 2.01 it allowed the user to choose between "ASCII" and LICS in setup. I don't know if there is a tag or attribute inside the file format to indicate which mode was chosen or if the user just had to set up the program accordingly. Perhaps this info is also associated with the various filename extensions - this would need some research (BTW. Japanese versions of Lotus for the NEC PC-98 used WJ1/WJ2/WJ3/WJ4 file extensions instead of WKS/WK1/WP2/WK3/WK4 - probably a file format very similar, but obviously with some detail differences as well). Apparently, various third-party spreadsheets did not support LICS but just continued to use "ASCII", which actually means that they used whatever codepage was selected as 8-bit OEM codepage by the underlying operating system (often 437 or 850, but a whole bunch of other OEM codepages exist). With Lotus 1-2-3 release 3.x they switched to yet another character set, and encoding, LMBCS, also discussed in the English WP now (although the discussion is not complete at present). Hope it helps. Greetings, Matthias

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- alonso laurent - 2016-12-22
  
  Hello, I have changed the code to do the LICS conversion in one pass :
  https://sourceforge.net/p/libwps/code/ci/830ac17bddd0c39faa8f51b951c6e112a08fd6ed/
  
  Note: do you have some files WJ1-WJ4 files or some files which use some LMBCS encoding ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.
  - alonso laurent - 2017-09-26
    
    Hello,
    concerning the chart, I add basic code to retrieve it in a separate sheet (it is not perfect but it may be sufficient...)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    
    Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

alonso laurent - 2017-09-26

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

German umlauts are converted too much

Group

Searches

Help

#13 German umlauts are converted too much

Discussion