multibyte encodings broken in docbook

Deplate
2009-01-27
2013-04-17
  • PILCH Hartmut

    PILCH Hartmut - 2009-01-27

    When I transform a japanese utf-8 file to docbook I get some (about 10%) garbled characters, and the xml header says lang="en".   Transformation of the same to Xhtml and Latex works fine.  I have tried the following, to no avail

    commandline: -m utf8

    document header

      #lang: ja
      #LANG: ja
      #set id=lang: ja
      #VAR: lang=8
      #VAR: encoding=utf-8
      #VAR: encoding=UTF-8

    Also, it doesn't matter whether I specify dbk-book, dbk-ref or dbk-article on the commandline.

    Some multibyte chars, such as 称  are never displayed correctly and seem to be torn apart and treated as concatenations of single-byte characters.

    .As said, formats like html and latex, which are unlike docbook not utf-8 by default, do not have those big utf-8 problems that docbook appears to have.

     
    • PILCH Hartmut

      PILCH Hartmut - 2009-01-27

      PS. I have of course also tried cjk-encoding=utf-8. I have been trying these alone and in combination.

      .And I am using version 0.8.4 from the gem.

       
      • Tom Link

        Tom Link - 2009-01-28

        > cjk-encoding=utf-8

        Where does this variable come from? I don't think deplate can parse variable names with hyphens.

        BTW, you should probably not use underscores in filenames since deplate (0.8.4/5+) uses the underscore as "escape" character to encode special characters.

         
    • PILCH Hartmut

      PILCH Hartmut - 2009-01-27

      More details, including sample files, can be found at

        http://a2e.de/adv/deplate/epctM1

       
    • Tom Link

      Tom Link - 2009-01-28

      When I view http://a2e.de/adv/deplate/epctM1/epctM1_pati.ja.xml in firefox, it seems the § character is wrongly converted. I cannot tell about the other characters.

      Ahm, well, the § is supposed to belong to the character sequence, which is  称 for the character you mentioned. The utf-8 "support" is a quick hack. At the moment, I can only suggest to filter the output through sed or similar and replace the wrongly escaped characters within utf-8 byte sequences. I don't know when I will have time to take a closer look at this.

      Instead of setting the lang variable, you could create a module lang-ja_CN.UTF-8.rb that loads localized messages and sets the language variable.

      p.s.
      I guess, you meant to set the variable cjk_encoding that is defined in deplate/zh_CN.rb?

       
    • PILCH Hartmut

      PILCH Hartmut - 2009-01-28

      The quoted file contains far more than one misconverted character.  About 20% of all multibyte characters are misconverted.  What you are quoting here also seems to be misconverted.

      The fact remains that the latex/html/xhtml/php formatters produce perfect multibyte utf-8 (0% misconversion) while the dbk-* formatters do not (20% misconversion).

      I'll be happy to create a set of locale files for Deplate soon, but how will that help solve the dbk-* multibyte problem?

      I'm not actually relying on the cjk_encoding variable.  It is mentioned in the deplate manual and that is why, in my desperation, I tried it out shortly.  It is probably useful only for some pre-Unicode CJK locale solutions.  There is a ruby package for improving the utf-8 support of the Ruby language, with the promise from the chief developer Matz of having this or something like it integrated in Ruby itself soon.   Maybe by including that in Deplate you can even eliminate the utf8 module.  But anyway, no matter how hackish that module may be, I can't see why it should work in all formatters except docbook/dbk-*, especially when docbook i.e. xml files are by default coded in utf-8.

      I can't ready the utf-8 characters which you are quoting here nor do I have an idea on how to repair the stuff that the docbook formater seems to be breaking by means of a postprocessor (based on sed or perl or whatever).    I'd rather suggest that you throw out all support for any charset except UTF-8 and have whoever wishes something else solve his problem by a postprocessor.   After all these encodings are roundtrip-convertible to/from Unicode, and Unicode was introduced in order to allow programmers and localisers to stop bothering with language-specific encodings.  I for one will be more than happy to provide localisation for Chinese, Japanese and more languages to/from which my company translates in if I can do that in UTF-8.   Also, Docbook is a very important format for our practical work, because it allows us to maintain our text programming freedom while satisfying the demands of large numbers of  wordprocessor-captive clients.   I could easily justify donating to the Deplate project in view of that.

       
      • Tom Link

        Tom Link - 2009-02-02

        This isn't really a solution but it seems to solve at least the problems with your test document: In the file ${RUBY_LIB}/deplate/xml.rb, comment out the line (around line 26):

        "§"  => Proc.new {|e| symbol_paragraph(nil)},

        This line doesn't really belong their anyway, since § is a non-ascii symbol.

         
    • PILCH Hartmut

      PILCH Hartmut - 2009-02-05

      It is surprising to see how commenting  out such a line alone completely fixes multibyte output that was 20% broken before.

      The docbook output is now usable as a source format of OOffice and elsewhere.

      Thanks a  lot for this fix!

       
    • PILCH Hartmut

      PILCH Hartmut - 2009-02-05

      Btw it is very difficult for me to avoid using underscores in filenames.

      I use name files like variables, using only ascii7 alphanumeric characters and underscores, no hyphens, for the filename elements.  Of course I also can't help using dots for concatenation, but only where established conventions require this, otherwise my concatenator is '_'.

      I have been using '@' as a prefix for some system files, but even there I am considering replacing that by '_' or '__', as done in programming.

      The underscore also matches \w in Perl regular expressions.  It is the only non-alphanumeric character that does so.  I guess many programmers will revolt against any attempt to remove the underscore from the innermost character set.

       

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks