Menu

Subtitle export UTF-8 missing BOM marker

Help
akini
2008-10-05
2013-05-14
  • akini

    akini - 2008-10-05

    source file: package net.sourceforge.dvb.projectx.subtitle.UnicodeWriter

    Subtitle exporting using a "[x]UTF-8" format does not write BOM marker at the start of a text file. UTF-16 format does write a bom marker. Quite many prefer UTF-8 encoding, well western speaking world anyway, as its a compact storage format but allowing a national special letters work properly.

    UTF-8 export: does not write a bom marker
    UTF-16 export: writes a bom marker

    FFDSHOW subtitling does not work properly without an utf8 bom marker. I hacked my source dump to write a bom marker and .srt and .txt files display national letters properly.

    Java and utf8 unicode miscellaneous info:
    http://koti.mbnet.fi/akini/java/java_utf8_xml/

    All I changed was adding "mark file as UTF-8" rows to write EFBBBF bytes at the start. I think its always best to write a bom marker so any video player and text editor can recognize a charset of the text file without a magic guessing.

        /**
         *
         */
        public void print(String str) throws IOException
        {
            if (!useUnicode)
            {
                out2.print(str);
                return;
            }

            // UTF8 with BOM marker (fixes ffdshow utf-8 charset problem)
            if (useUTF8)
            {
                // mark file as UTF-8
                if (out1.size() == 0)
                    out1.write( new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF}, 0, 3);

                char[] chars = str.toCharArray();

                for (int i = 0, j = chars.length; i < j; i++)
                {
                    if ((mask_1 & chars[i]) == 0) //0xxxxxxx - 0000-007F
                        out1.writeByte(chars[i]);

                    else if ((mask_2 & chars[i]) == 0) //110xxxxx 10xxxxxx - 0080-07FF
                        out1.writeShort(0xC080 | (0x1F00 & chars[i]<<2) | (0x3F & chars[i]));

                    else //1110xxxx 10xxxxxx 10xxxxxx - 0800-FFFF
                    {
                        out1.writeByte(0xE0 | (0xF0000 & chars[i]<<4));
                        out1.writeShort(0x8080 | (0x3F00 & chars[i]<<2) | (0x3F & chars[i]));
                    }
                }

                return;
            }

            // UTF16 with BOM marker
            /**
             * mark file as big endian unicode
             */
            if (out1.size() == 0)
                out1.writeChar(0xFEFF);

            out1.writeChars(str);
        }

     
    • Matthias Mueller

      thx, it has been integrated in b26 (hopefully it don't bother anyone)

       

Log in to post a comment.