Menu

#13 German umlauts are converted too much

untriaged
closed-fixed
nobody
None
5
2017-09-26
2015-01-01
No

German umlauts do not work in that WKS file. Moreover I found out that this file has an additional sheet (?) or at least something which not shown (the word "Wachstum" is shown in my HEX-Editor but not visible anywhere)

For example: "Verkõufe Gesamt" (cell B5) should be "Verkäufe Gesamt"

Tested with LibreOffice Version: 4.3.2.2
Build-ID: edfb5295ba211bd31ad47d0bad0118690f76407d

1 Attachments

Discussion

  • alonso laurent

    alonso laurent - 2015-01-02

    Hello,
    normally, the first problem must be fixed in libwps-0.3.1 by https://sourceforge.net/p/libwps/code/ci/41dfedf3d25025fe1dc91e7a01d83a54d3259476 ( see attachment tt.ods ), i.e. this file uses LICS encoding which is now retrieved, however the Windows DOS codepage does not appear in the file so it still uses a default one.

    Note: this file does contain a chart with name Wachstum, but actually, we do not reconstruct charts which appear in the file :-~

     
  • Anonymous

    Anonymous - 2016-12-11

    Hi, looking at the source code of libwps_tools_win.cpp
    0.4.x, the conversion of LICS characters to Unicode
    appears to be partially broken:

    In Font::LICSunicode() the code converts all codes >= 0x80
    according to a 128 byte translation table named LICS[],
    and the resulting byte value is then stuffed into another
    conversion routine named unicode(). unicode(), however,
    erroneously carries out an OEM codepage to Unicode conversion
    (with the OEM codepage being a varible input). This does
    not make sense, as the conversion LICS->Unicode is fixed
    and does not depend on an OEM codepage.
    Further, if I assume codepage 850 (as it was still hardwired
    in the older 0.3.1 issue of the file), carrying out the
    conversion through codepage 850 results in the correct
    Unicode values for some of the LICS codes, but not for all.
    For illustration, these are the errors in the LICS code
    range 0xD0 to 0xFE:

    0xD7: code results in U+00D7, but should result in U+0152.
    0xDD: code results in U+00DD, but should result in U+0178.
    0xDE: code results in U+00FE, but should resukt in U+00DE.
    0xE4: code results in U+00F6, but should result in U+00E4.
    0xF7: code results in U+00F7, but should result in U+0153.
    0xFE: code results in U+00DE, but should result in U+00FE.

    Please compare your conversion with the LICS translation
    documented in the article "Lotus International Character Set"
    in the English Wikipedia, which is based on a printed
    LICS table in a HP 95LX manual. While some of the codes
    in the range 0x80 to 0xBF are still unknown or are
    ambiguous, the conversion of codes in the range 0xC0
    to 0xFE to Unicode is clearly unambiguous, so the Wikipedia
    article can be used as a reference for these codes.

    Hope this helps to improve the Lotus 1-2-3 file importer.

    Matthias

     
    • alonso laurent

      alonso laurent - 2016-12-12

      Hello,
      currently this code takes inspiration from the comments of https://bugs.documentfoundation.org/show_bug.cgi?id=87222 (a), so if you do

      wks2ods --encoding CP437 input.WKS result.ods

      or choose Western Europe/OS2-437/US in LibreOffice, the result must be better.

      Notes:

      • (a) which seems to indicate that we must do two passes LICS -> CP437/local encoding -> unicode
      • I will replace these two passes by an unique one, probably this week-end.
       

      Last edit: alonso laurent 2016-12-12
  • Anonymous

    Anonymous - 2016-12-19

    Hi Alonso,
    this won't work as LICS has nothing to do with codepage 437. As far as I know. Lotus 1-2-3 Release 1.x supported "ASCII" files only (but didn't strip off bit 7, so in reality users undocumentedly used 8-bit OEM codepages). Lotus 1-2-3 release 2.0 introduced LICS and since 2.01 it allowed the user to choose between "ASCII" and LICS in setup. I don't know if there is a tag or attribute inside the file format to indicate which mode was chosen or if the user just had to set up the program accordingly. Perhaps this info is also associated with the various filename extensions - this would need some research (BTW. Japanese versions of Lotus for the NEC PC-98 used WJ1/WJ2/WJ3/WJ4 file extensions instead of WKS/WK1/WP2/WK3/WK4 - probably a file format very similar, but obviously with some detail differences as well). Apparently, various third-party spreadsheets did not support LICS but just continued to use "ASCII", which actually means that they used whatever codepage was selected as 8-bit OEM codepage by the underlying operating system (often 437 or 850, but a whole bunch of other OEM codepages exist). With Lotus 1-2-3 release 3.x they switched to yet another character set, and encoding, LMBCS, also discussed in the English WP now (although the discussion is not complete at present).

    Hope it helps.

    Greetings,

    Matthias

     
    • alonso laurent

      alonso laurent - 2016-12-22

      Hello, I have changed the code to do the LICS conversion in one pass :
      https://sourceforge.net/p/libwps/code/ci/830ac17bddd0c39faa8f51b951c6e112a08fd6ed/

      Note: do you have some files WJ1-WJ4 files or some files which use some LMBCS encoding ?

       
      • alonso laurent

        alonso laurent - 2017-09-26

        Hello,
        concerning the chart, I add basic code to retrieve it in a separate sheet (it is not perfect but it may be sufficient...)

         
  • alonso laurent

    alonso laurent - 2017-09-26
    • status: open --> closed-fixed
     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.