encodings – the final frontier?

  • eNG1Ne

    eNG1Ne - 2014-01-21

    Not exclusively a Notepad++ question, I realise, but since it's just shown up while I've been using Notepad++ … also, I reckon the people in this forum will have the breadth of understanding I'm missing.

    I prepared a plain text file in InDesign (IDD) tagged text format: all this does is prefix each paragraph with a style-reference, like this:

    <ParaStyle:p>The first reason for controlling the wind pressure is to obtain flexible, and therefore expressive, dynamics. A second reason is that changes in wind pressure – short, sudden or both short and sudden – can provide various effects on the harmonium that are not possible on other keyboard instruments.

    I stuck with DOS\Windows line endings and ANSI encoding, because that's the combination that worked the first time I was successful with IDD tagged text. Satisfyingly enough, the import worked as required.

    Next step in the current project is to share the same tagged content with a Mac user. The import worked, the line ends caused no problems … but accented characters and fancy punctuation did not survive the journey. [this, I have checked, is not specific to IDD; just opening the file in a Mac text editor did the same]

    “toucher simultanément les notes sur lesquelles il se trouve placé de manière à obtenir un son bref et fort sur le Poïkilorgue”, for example, limps on to the Mac as something like ítoucher simultanÈment les notes sur lesquelles il se trouve placÈ de maniÓre É obtenir un son bref et fort sur le PoÛkilorgueï – I don't have a genuine example to hand.

    Can anyone advise me on which encoding to select in Notepad++ on Windows if I want these special characters to show up correctly when I open the same file on a Mac? not just to save me five minutes' search/replace, but for my general education Thanks in advance.

  • Neomi

    Neomi - 2014-01-24

    In my opinion the best choice is UTF-8. Not just for portability to the Mac, but for portability across different editors or even the same editor on differently configured Windows systems.

    The problem with specific code pages like Western European, Central European, Cyrillic and all others, each in different flavors, is this: they use 8 bit codes for different characters, but they cannot identify themselves as being a certain code page from the content alone.

    In short, the big advantage of UTF-8 is that there is no limitation to certain character sets, it just includes all of them. It cannot do so with 1 byte per character of course since there are too many, so they are encoded with a variable amount of bytes. If a text file is encoded as UTF-8 without BOM (byte order mark, optional since endianess is irrelevant for this encoding, but can be useful for automatic detection) and uses only the first half of the ANSI characters (true for texts and source codes in English), it is in fact identical to an ANSI text file. Extended characters are encoded in a way that doesn't interfere with the others. Even if you compile f.e. a C or C++ file encoded as UTF-8 (without BOM) with Japanese characters in strings and comments with a compiler that handles only ANSI, it still works, it just results in the strings to be UTF-8 encoded in the binary.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks