|
From: Marc-Antoine R. <ma...@un...> - 2005-10-20 16:46:27
|
Joachim Eibl wrote: > Am Montag, 17. Oktober 2005 19:54 schrieb Matus Lipka: > >>Joachim, >> >>Are you using MBCS? From my humble knowledge of multibyte character >>encodings, I was under the impression that UNICODE is always 2 bytes per >>character. So once a file is detected as being in UNICODE format, *all* >>characters are interpreted as 2 bytes. If not, then all characters are 1 >>byte, and the ° and likewise characters are never interpreted as mutibyte. >> >>These kind of encodings shouldn't be mixed together in a single file, >>unless something weird like MBCS is used (which could be a non-default >>option in KDiff). >> >>Does this make sense? >> >>Cheers, >> >>Matus > > > Hi Matus, > > The term "Unicode" covers both. You might want to read > http://en.wikipedia.org/wiki/Unicode > > Since the name "Unicode" doesn't stand for any specific encoding the names > UTF-8 or UCS-2 are used to be more precise. > > In any case UTF-8 (which is an 8-bit, variable-width encoding) is becoming > very popular and is often the default (especially on Linux-machines). > > But KDiff3 should try to honor the default setting for every individual > machine. > > Cheers, > Joachim > FYI, Microsoft somewhat mixed up people with their nomenclature. Unicode is NOT an encoding, as Joachim previously said. I wouldn't blame Microsoft has the encoding standards have been relatively slow to be standardized. Windows uses UTF-16 internally. It used to use UCS-2 in NT 3.x but it is now deprecated. http://www.faqs.org/rfcs/rfc2279.html http://www.faqs.org/rfcs/rfc2781.html Windows is the only OS that use UTF-16 (known to me), every other uses UTF-8 because it is simpler for backward compatibility. MBCS is a generic term because in the past, there has been other encoding than UTF-X that shouldn't be used anymore. UTF-8/UTF-16 can be detected by looking for the BOM (BYTE ORDER MARK). But UTF-8 can be the default encoding nevertheless. By the way, if a invalid character is read, this is probably normal that is it removed by the UTF-8 decoder. But it would be better than kdiff3 detect this and warns the user. Just a thought. M-A |