#18 Full unicode support

closed
nobody
None
5
2011-07-16
2011-02-28
klonos
No

...as per title.

I tried comparing an English file vs its Greek translation and the Greek text was all gibberish.

Discussion

  • Derrick Moser
    Derrick Moser
    2011-02-28

    This should work. Perhaps the system was unable to detect the encoding properly. Could you provide an example file with Greek that does not display correctly in Diffuse? If you inspect the "Regional Settings" tab of the Preferences, what codecs are listed?

     
  • klonos
    klonos
    2011-03-02

     
  • klonos
    klonos
    2011-03-02

    You are right Derrick, adding "UTF-8" as the *first* encoding in the list of encodings in that field displayed the text just fine. The default for that setting after installing diffuse is set to "mbcs latin_1", so I guess it should changed to include UTF-8 as well. I need to stress that If UTF-8 is added, but not *first* in that list, then the same issue occurs. It needs to be first in the list.

    One thing I need to note here though is that in other programs and various text editors I don't have to manually "mess" with settings in order for the text to be id'ed correctly.

     
  • Derrick Moser
    Derrick Moser
    2011-03-04

    Python does not have any sophisticated encoding detection methods. Diffuse just tries the codecs listed in the preferences until one succeeds. Likely the default list of codecs just needs to be improved. "UTF-8 mbcs latin1" would mean UTF-8 has higher precedence than the platform's native format but I doubt it would cause any problems for people.

     
  • klonos
    klonos
    2011-03-08

    The platform's native format is Greek (Greek Win7 x64) but is set to use en-US as the default input method and fall back to Greek when no unicode support is available. Still, it fails to do that with diffuse. When I suggested using UTF-8 I was thinking that this would succeed in most users' systems and not just Greek. In other words I was thinking of a global solution rather than just solving my own problem.

    Finally, yes after my tests I reached to the conclusion that the first format in the list is the one favored over the others, even if it doesn't match the document's actual format. If the list is set to "UTF-8 mbcs latin1", then the document is detected correctly. If it is "mbcs UTF-8 latin1" then it gets detected as mbcs (I need to stress here that the document is actually saved in UTF-8 format and is detected as such by other applications without issues). If the process of format detection was going down the list till it found a proper/matching format as you say, then it should work either ways, but it only does when UTF-8 is placed *first* in the list of formats.

    Anyways, since UTF-8 will work with almost all documents (that is the purpose of unicode after all) I suggest that you set it as the default first in the list of formats.

     
  • Derrick Moser
    Derrick Moser
    2011-03-09

    I don't think the input method rules are exposed to Python so Diffuse just sees Greek as the native format and knows nothing about your other en-US preference. Actually, on Microsoft Windows, Diffuse just sees "mbcs" but that is an alias to the system's native encoding.

    I suspect your system's native encoding is actually "cp1253". Although the example file attached to this issues is valid UTF-8 text, it also happens to be valid cp1253 text (although not particularly useful cp1253 text). So if the cp1253 codec is tested before the UTF-8 codec, the text will be identified as cp1253 encoded text.

    I have updated the default order for auto detection to first try "utf_8", then the platform's native format, and finally "latin_1". On Windows, it would show up as "utf_8 mbcs latin_1". If you have saved preferences, your saved preferences will override the default.

     
  • Derrick Moser
    Derrick Moser
    2011-07-16

    • status: open --> closed