ZDT / Bugs / #26 Export does not write correct UTF-8 signature

StephanH - 2005-11-28

Beginning 1 word file, rewritten with Notepad.

beg_chinese_lesson1a.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

StephanH - 2005-12-15

Logged In: YES
user_id=1391401

I found the application that fails with the "original",
but not with the rewritten version. Microsoft Wordpad 5.1
(current XP Pro version) has this problem. If I understand
the UTF-8 specification, then it's really Microsoft's bug,
but having the prefix codes/characters is allowed
(optional) in the specification, so I was hoping you'd be
able to add them.
If I have it straight, then there's only two
possibilities, either you're on a Intel type platform
(reversed byte encoding), or a Motorola type (Mac?) where
the bytes are in strict left-to-right ordering. HTH

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

bogleg - 2005-12-15

Logged In: YES
user_id=1223911

Ok, I can reproduce the problem with Wordpad as well.
Opening an exported file with Wordpad shows garbled
characters unless I rewrite it in UTF-8 using Notepad like
you said.

So you mention it's a Microsoft bug. But is there anything
you can do on the java side to help mitigate the problem?
Or should I just leave it as is.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2014-08-30

status: open --> closed-rejected

Version: --> v1.0.2

Milestone: --> any-next
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2014-08-30

Tried with ZDT v1.0.2 and Wordpad on Windows 7.

Exported a random category, File format: zdt.
Opened with Notepad: reads just fine (as reported above).
Opened with Wordpad: is readable, but it shows the wrong Chinese characters. This is regardless which font (e.g. Arial, Lucida Unicode, Times New Roman) choose.

When opening with Word 2010, it opens a conversion window:
"Text Encoding: Windows (Default)" and "Text Encoding: MS-DOS" give the same 'wrong' characters are Wordpad. In 'Other encoding' it shows "Chinese simplified (GB2312)" as being selected.
"Text Encoding: Other encoding: UTF-8" is the only giving the right result. (Others tried: Big-Endian, UTF-7, Chinese (all flavors, except 'Auto select', which jumps right to UTF-8).

It seems that ZDT works as expected. It has chosen to use UTF-8 and Word 2010 recognizes this as such, as does Notepad. Apparently, Wordpad is designed with the GB2312 in mind, rather than UTF-8. We could consider having the user choosing which encoding to use in the export (in the import this is already possible), but since this bug is 'old', the bug will be closed-rejecte, because the feature is rejected for now.

Anyone is free to re-open it if they like to have support for other formats besides UTF-8. In that case, it will be treated as a feature request.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2014-09-01

Stephan sent a message in:

On Sat, Aug 30, 2014 at 11:53 PM, Stephan Hodges tabletguy@users.sourceforge.net wrote:
Even though I'm the originator of the report, there's no possibility of adding comments after closing, and I didn't see how to reopen the bug report. Would have been nice to leave the testing comments for a day before closing.

Based on your testing report, I think you missed the point of the bug. ZDT is not writing the Utf-8 header. The programs you used for testing don't require the header, but use other means to detect UTF-8 encoding.

There are still many programs that do require it.

If you try with Notepad++, you will see that there's an option to save with or without the "BOM" (UTF-8 Header). The BOM is a 2 character header to the file.

Writing the header is rather easy to do, actually.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2014-09-01

status: closed-rejected --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2014-09-01

Stephan, you are right. I did not realize, users couldn't re-open trackers. Sorry for that.

The UTF-8 http://tools.ietf.org/html/rfc3629 standard (or on Wikipedia) does not require starting bytes, but UTF-8 may start with a BOM http://en.wikipedia.org/wiki/Byte_order_mark.

The code 0xEF BB BF is actually the UTF-8 encoding for U+FEFF, which is the 'BOM'. Note: It is perhaps confusing, at least it was to me, but U+FEFF the UTF-8 'code point', not to be confused with a Hex value.

The standard suggests to use a BOM "when it is expected that implementations of the protocol will not be in a position to always use the mechanisms properly."

Since the zdt-export can be used by other software, I think we should add the 0xEF BB BF.

Note that UTF-16 is the default for (most) Windows implementations. This explains why UTF-8 is not always working in applications like Notepad.

To avoid confusion: In the original post it said: ".... does not have the "EF" (for intel based processors) two byte beginning, nor the next 4 byte encoding descriptor. (Perhaps that 4 bytes is optional -- not sure)." Actually, the six bytes belong together to create the BOM. So, it is not just 2+4 bytes. The "E" in 0xEF BB BF actually indicates that this is a three byte uni-code character. The ..F BB BF define which character.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2014-09-01

assigned_to: bogleg --> Oliver Emery
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2014-09-01

Oliver, would you like to have a look to add the BOM (0xEFBBBF) to the export (file format: zdt)? I am not sure if there is Java function supporting the prefixing of a BOM. Probably there is...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-01

I fixed this issue with writer.write('\ufeff');

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-05

Fixed and uploaded.

Last edit: Anonymous 2014-09-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2015-01-26

Retested with rebuild of zdt.jar and flashcard jar:
Exported a category with three entries (each happen to have a word consisting out of 2 Chinese characters: 愿意, 有机, 集体; each having also traditional, pinyin, definition and notes).
Open in Notepad - characters and other info identical to Flashcard view (double click on the specific category).
* Open in Wordpad (Win7-64bit) - looks exactly the save as in Notepad, and Flashcard view.

Bug fixed. May be closed. To be released with v1.0.3.

Tested with: rebuild (by kaya) "Flashcard Plug-in version 1.0.3" and "zdt.jar" - replaced these files in respectively 'zdt\plugins\net.sourceforge.zdt.core_1.0.2' and 'zdt\'. (Note: This is supposedly up to build 1241, however, checking in Help-> About-> Installation details does not show a difference in version. Is there a way to see the build number of an individual .JAR file or ZDT installation? Would be helpful for giving feedback on developer's builds.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael van den Berg - 2015-01-26

status: open --> closed-fixed

Milestone: any-next --> v1.0.3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- StephanH - 2015-01-26
  
  You cannot validate the fix by only "viewing" Notepad or Wordpad. As noted
  in the original report, they correctly open a "non-BOM" file. It will be
  visually identical. As the original report says, there are many programs
  which do not correctly interpret a file without a BOM. The use of Notepad
  (open and save-as with UTF8 encoding) was specifically to generate a file
  different from the original, to demonstrate that they were different.
  Perhaps you also did that, but the description of testing procedure doesn't
  seem to say this.
  
  I'm not saying it's not fixed, but the easiest way is to examine the file
  with a hex editor. The BOM signature is the 1st 4 bytes. It's well
  documented on the internet.
  
  On Mon, Jan 26, 2015 at 1:32 PM, Michael van den Berg mvdberg112@users.sf.net wrote:
  
  status: open --> closed-fixed
  
  Milestone: any-next --> v1.0.3
  
  [bugs:#26] http://sourceforge.net/p/zdt/bugs/26 Export does not write
  correct UTF-8 signature*
  
  Status: closed-fixed
  Milestone: v1.0.3
  Labels: Core
  Created: Mon Nov 28, 2005 08:01 PM UTC by StephanH
  Last Updated: Mon Jan 26, 2015 07:57 AM UTC
  Owner: Oliver Emery
  
  Some applications do not read the exported files
  correctly. Others (notepad for example) do seem to
  read them.
  
  When exporting a category list, the file appears to
  have UTF-8 contents, but does not have the "EF" (for
  intel based processors) two byte beginning, nor the
  next 4 byte encoding descriptor. (Perhaps that 4
  bytes is optional -- not sure).
  
  Can be duplicated by opening an exported file with
  Notepad (on XP) and then rewritten, explicitly
  selecting UTF-8 encoding (or other UTF encoding, if
  desired). Binary comparision of the two files will
  show the difference.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/zdt/bugs/26/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: ~~#26~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Export does not write correct UTF-8 signature

Zhongwen Development Tool - helping to study Mandarin Chinese

Milestone

Searches

Help

#26 Export does not write correct UTF-8 signature

Related

Discussion

Related