Project X - DVB demux Tool / Discussion / Help: Subtitle export UTF-8 missing BOM marker

source file: package net.sourceforge.dvb.projectx.subtitle.UnicodeWriter

Subtitle exporting using a "[x]UTF-8" format does not write BOM marker at the start of a text file. UTF-16 format does write a bom marker. Quite many prefer UTF-8 encoding, well western speaking world anyway, as its a compact storage format but allowing a national special letters work properly.

UTF-8 export: does not write a bom marker
UTF-16 export: writes a bom marker

FFDSHOW subtitling does not work properly without an utf8 bom marker. I hacked my source dump to write a bom marker and .srt and .txt files display national letters properly.

Java and utf8 unicode miscellaneous info:
http://koti.mbnet.fi/akini/java/java_utf8_xml/

All I changed was adding "mark file as UTF-8" rows to write EFBBBF bytes at the start. I think its always best to write a bom marker so any video player and text editor can recognize a charset of the text file without a magic guessing.

    /**
     *
     */
    public void print(String str) throws IOException
    {
        if (!useUnicode)
        {
            out2.print(str);
            return;
        }

        // UTF8 with BOM marker (fixes ffdshow utf-8 charset problem)
        if (useUTF8)
        {
            // mark file as UTF-8
            if (out1.size() == 0)
                out1.write( new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF}, 0, 3);

char[] chars = str.toCharArray();

            for (int i = 0, j = chars.length; i < j; i++)
            {
                if ((mask_1 & chars[i]) == 0) //0xxxxxxx - 0000-007F
                    out1.writeByte(chars[i]);

else if ((mask_2 & chars[i]) == 0) //110xxxxx 10xxxxxx - 0080-07FF
out1.writeShort(0xC080 | (0x1F00 & chars[i]<<2) | (0x3F & chars[i]));

                else //1110xxxx 10xxxxxx 10xxxxxx - 0800-FFFF
                {
                    out1.writeByte(0xE0 | (0xF0000 & chars[i]<<4));
                    out1.writeShort(0x8080 | (0x3F00 & chars[i]<<2) | (0x3F & chars[i]));
                }
            }

return;
}

        // UTF16 with BOM marker
        /**
         * mark file as big endian unicode
         */
        if (out1.size() == 0)
            out1.writeChar(0xFEFF);

out1.writeChars(str);
}

Subtitle export UTF-8 missing BOM marker

Forums

Help

Subtitle export UTF-8 missing BOM marker document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Subtitle export UTF-8 missing BOM marker