Menu

#26 Export does not write correct UTF-8 signature

v1.0.3
closed-fixed
Core (39)
v1.0.2
5
2015-01-26
2005-11-28
StephanH
No

Some applications do not read the exported files
correctly. Others (notepad for example) do seem to
read them.

When exporting a category list, the file appears to
have UTF-8 contents, but does not have the "EF" (for
intel based processors) two byte beginning, nor the
next 4 byte encoding descriptor. (Perhaps that 4
bytes is optional -- not sure).

Can be duplicated by opening an exported file with
Notepad (on XP) and then rewritten, explicitly
selecting UTF-8 encoding (or other UTF encoding, if
desired). Binary comparision of the two files will
show the difference.

Related

Bugs: #26

Discussion

  • StephanH

    StephanH - 2005-11-28

    Beginning 1 word file, rewritten with Notepad.

     
  • StephanH

    StephanH - 2005-12-15

    Logged In: YES
    user_id=1391401

    I found the application that fails with the "original",
    but not with the rewritten version. Microsoft Wordpad 5.1
    (current XP Pro version) has this problem. If I understand
    the UTF-8 specification, then it's really Microsoft's bug,
    but having the prefix codes/characters is allowed
    (optional) in the specification, so I was hoping you'd be
    able to add them.
    If I have it straight, then there's only two
    possibilities, either you're on a Intel type platform
    (reversed byte encoding), or a Motorola type (Mac?) where
    the bytes are in strict left-to-right ordering. HTH

     
  • bogleg

    bogleg - 2005-12-15

    Logged In: YES
    user_id=1223911

    Ok, I can reproduce the problem with Wordpad as well.
    Opening an exported file with Wordpad shows garbled
    characters unless I rewrite it in UTF-8 using Notepad like
    you said.

    So you mention it's a Microsoft bug. But is there anything
    you can do on the java side to help mitigate the problem?
    Or should I just leave it as is.

     
  • Michael van den Berg

    • status: open --> closed-rejected
    • Version: --> v1.0.2
    • Milestone: --> any-next
     
  • Michael van den Berg

    Tried with ZDT v1.0.2 and Wordpad on Windows 7.

    Exported a random category, File format: zdt.
    Opened with Notepad: reads just fine (as reported above).
    Opened with Wordpad: is readable, but it shows the wrong Chinese characters. This is regardless which font (e.g. Arial, Lucida Unicode, Times New Roman) choose.

    When opening with Word 2010, it opens a conversion window:
    "Text Encoding: Windows (Default)" and "Text Encoding: MS-DOS" give the same 'wrong' characters are Wordpad. In 'Other encoding' it shows "Chinese simplified (GB2312)" as being selected.
    "Text Encoding: Other encoding: UTF-8" is the only giving the right result. (Others tried: Big-Endian, UTF-7, Chinese (all flavors, except 'Auto select', which jumps right to UTF-8).

    It seems that ZDT works as expected. It has chosen to use UTF-8 and Word 2010 recognizes this as such, as does Notepad. Apparently, Wordpad is designed with the GB2312 in mind, rather than UTF-8. We could consider having the user choosing which encoding to use in the export (in the import this is already possible), but since this bug is 'old', the bug will be closed-rejecte, because the feature is rejected for now.

    Anyone is free to re-open it if they like to have support for other formats besides UTF-8. In that case, it will be treated as a feature request.

     
  • Michael van den Berg

    Stephan sent a message in:

    On Sat, Aug 30, 2014 at 11:53 PM, Stephan Hodges tabletguy@users.sourceforge.net wrote:
    Even though I'm the originator of the report, there's no possibility of adding comments after closing, and I didn't see how to reopen the bug report. Would have been nice to leave the testing comments for a day before closing.

    Based on your testing report, I think you missed the point of the bug. ZDT is not writing the Utf-8 header. The programs you used for testing don't require the header, but use other means to detect UTF-8 encoding.

    There are still many programs that do require it.

    If you try with Notepad++, you will see that there's an option to save with or without the "BOM" (UTF-8 Header). The BOM is a 2 character header to the file.

    Writing the header is rather easy to do, actually.

     
  • Michael van den Berg

    • status: closed-rejected --> open
     
  • Michael van den Berg

    Stephan, you are right. I did not realize, users couldn't re-open trackers. Sorry for that.

    The UTF-8 http://tools.ietf.org/html/rfc3629 standard (or on Wikipedia) does not require starting bytes, but UTF-8 may start with a BOM http://en.wikipedia.org/wiki/Byte_order_mark.

    The code 0xEF BB BF is actually the UTF-8 encoding for U+FEFF, which is the 'BOM'. Note: It is perhaps confusing, at least it was to me, but U+FEFF the UTF-8 'code point', not to be confused with a Hex value.

    The standard suggests to use a BOM "when it is expected that implementations of the protocol will not be in a position to always use the mechanisms properly."

    Since the zdt-export can be used by other software, I think we should add the 0xEF BB BF.

    Note that UTF-16 is the default for (most) Windows implementations. This explains why UTF-8 is not always working in applications like Notepad.

    To avoid confusion: In the original post it said: ".... does not have the "EF" (for intel based processors) two byte beginning, nor the next 4 byte encoding descriptor. (Perhaps that 4 bytes is optional -- not sure)." Actually, the six bytes belong together to create the BOM. So, it is not just 2+4 bytes. The "E" in 0xEF BB BF actually indicates that this is a three byte uni-code character. The ..F BB BF define which character.

     
  • Michael van den Berg

    • assigned_to: bogleg --> Oliver Emery
     
  • Michael van den Berg

    Oliver, would you like to have a look to add the BOM (0xEFBBBF) to the export (file format: zdt)? I am not sure if there is Java function supporting the prefixing of a BOM. Probably there is...

     
  • Anonymous

    Anonymous - 2014-09-01

    I fixed this issue with writer.write('\ufeff');

     
  • Anonymous

    Anonymous - 2014-09-05

    Fixed and uploaded.

     

    Last edit: Anonymous 2014-09-05
  • Michael van den Berg

    Retested with rebuild of zdt.jar and flashcard jar:
    Exported a category with three entries (each happen to have a word consisting out of 2 Chinese characters: 愿意, 有机, 集体; each having also traditional, pinyin, definition and notes).
    Open in Notepad - characters and other info identical to Flashcard view (double click on the specific category).
    * Open in Wordpad (Win7-64bit) - looks exactly the save as in Notepad, and Flashcard view.

    Bug fixed. May be closed. To be released with v1.0.3.

    Tested with: rebuild (by kaya) "Flashcard Plug-in version 1.0.3" and "zdt.jar" - replaced these files in respectively 'zdt\plugins\net.sourceforge.zdt.core_1.0.2' and 'zdt\'. (Note: This is supposedly up to build 1241, however, checking in Help-> About-> Installation details does not show a difference in version. Is there a way to see the build number of an individual .JAR file or ZDT installation? Would be helpful for giving feedback on developer's builds.)

     
  • Michael van den Berg

    • status: open --> closed-fixed
    • Milestone: any-next --> v1.0.3
     
    • StephanH

      StephanH - 2015-01-26

      You cannot validate the fix by only "viewing" Notepad or Wordpad. As noted
      in the original report, they correctly open a "non-BOM" file. It will be
      visually identical. As the original report says, there are many programs
      which do not correctly interpret a file without a BOM. The use of Notepad
      (open and save-as with UTF8 encoding) was specifically to generate a file
      different from the original, to demonstrate that they were different.
      Perhaps you also did that, but the description of testing procedure doesn't
      seem to say this.

      I'm not saying it's not fixed, but the easiest way is to examine the file
      with a hex editor. The BOM signature is the 1st 4 bytes. It's well
      documented on the internet.

      On Mon, Jan 26, 2015 at 1:32 PM, Michael van den Berg mvdberg112@users.sf.net wrote:

      • status: open --> closed-fixed
      • Milestone: any-next --> v1.0.3

      Status: closed-fixed
      Milestone: v1.0.3
      Labels: Core
      Created: Mon Nov 28, 2005 08:01 PM UTC by StephanH
      Last Updated: Mon Jan 26, 2015 07:57 AM UTC
      Owner: Oliver Emery

      Some applications do not read the exported files
      correctly. Others (notepad for example) do seem to
      read them.

      When exporting a category list, the file appears to
      have UTF-8 contents, but does not have the "EF" (for
      intel based processors) two byte beginning, nor the
      next 4 byte encoding descriptor. (Perhaps that 4
      bytes is optional -- not sure).

      Can be duplicated by opening an exported file with
      Notepad (on XP) and then rewritten, explicitly
      selecting UTF-8 encoding (or other UTF encoding, if
      desired). Binary comparision of the two files will
      show the difference.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/zdt/bugs/26/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #26


Log in to post a comment.