GetDocumentText -- encoding

2013-08-02
2013-08-06
  • WhyDoesOpenIDNotWork

    Hi,

    This page - http://sourceforge.net/p/notepad-plus/discussion/482781/thread/90ae2b1a/

    indicates a method like

    public static string GetDocumentText(IntPtr curScintilla)
    {
    int length = (int)Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
    StringBuilder sb = new StringBuilder(length);
    Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);
    return sb.ToString();
    }

    but I'm having some character encoding issues with that. If I have a buffer (encoded as UTF-8) that contains

    TEST ß TEST — BLAH x

    then when I GetDocumentText and write that to a file (for debugging purposes)

    System.IO.File.WriteAllText(@"C:\temp\buffer.txt", sb.ToString());

    then I get a file containing:

    TEST ß TEST — BLAH x

    Can anyone provide any guidance with this?

    Thanks,

     
  • WhyDoesOpenIDNotWork

    I can do this:

            Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_MENUCOMMAND, 0, NppMenuCmd.IDM_EDIT_SELECTALL);
            Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_MENUCOMMAND, 0, NppMenuCmd.IDM_EDIT_COPY);
            string howzat = Clipboard.GetText(TextDataFormat.Text);
    

    but that's definitely naughty, using the clipboard for the application, rather than the user.

     
  • cchris

    cchris - 2013-08-03

    Your text writing primitive has no reason to perform any encoding conversion. Since its target is on your system, it wil happily assume your text is encoded like your OS says it is by default.

    I don't code in C#, but, if it works like Java, you need a FileWriter constructed with the encoding it will be getting. Something like

    System.IO.File.WriteTextWithEncoding(filename, text, Charset.UTF8);

    with different names of course..

    CChris

     
  • WhyDoesOpenIDNotWork

    Hi,

    Thanks for the response. I'm a Java programmer too -- this is my only foray into the world of C#. I think in my many (many) lines of different attempts at this I've already tried this, but I did it again to be sure:

    System.IO.File.WriteAllText(@"C:\Users\davet\hmhb.txt", sb.ToString(), Encoding.UTF8);

    Still looks like this, though:

    TEST ß TEST — BLAH x

    although the file is a UTF8 file:

    /cygdrive/c/Users/davet/hmhb.txt: text/plain; charset=utf-8

     
  • cchris

    cchris - 2013-08-05

    Is the display you are showing to us what you get in some text editor - in which case the encoding it assumes could be the problem - or what you see from a hex editor? If the latter, then I'lll let a C# guru sort it oout. But I've lost some amount of time once fighting with the non problem that the editor was using Latin 1 instead of UTF8 as expected. This is why I'm chiming in.

    CChris

     
  • THEVENOT Guy

    THEVENOT Guy - 2013-08-05

    Hello,

    I'm not an C++ nor a Java programmer and I just want to give you a hint about your troubles !

    In your text TEST ß TEST — BLAH x :

    • The Latin small letter SHARP S (ß), UNICODE character \x00DF, is coded with the two bytes C3 and 9F, in an UTF-8 or an UTF-8 without BOM file.

    • The EM DASH symbol (), UNICODE character \x2014, is coded with the three bytes E2, 80 and 94, in an UTF-8 or an UTF-8 without BOM file.

    If, you copy your text TEST ß TEST — BLAH x, in a new tab file, in Notepad++, and choose the Encoding menu, and, then, the first item Encode in ANSI, you'll get the new text : TEST ß TEST — BLAH x.

    Why ? Just because :

    • the two UTF-8 bytes C39F represent, after the ANSI encoding, the individual characters à and Ÿ, with hexadecimal codes C3 and 9F, in the Windows-1252 or Windows-1254 code pages.

    • the three UTF-8 bytes E28094 represent, after the ANSI encoding, the individual characters â, and , with hexadecimal codes E2, 80 and 94, in the Windows-1252 or Windows-1254 code pages.

    So it seems that, when you write this text in a file, by your means, you simultaneously perform an ANSI encoding .

    See, also, my post, relative to Encodings, Conversions and Characters sets...., at the address below :

    https://sourceforge.net/p/notepad-plus/discussion/331753/thread/41da811c/#da8b

    Best regards,

    guy038

     
  • WhyDoesOpenIDNotWork

    Thanks guy, that is some useful information, and it certainly seems to make sense.

    Hopefully someone on these forums is au fait with C# and can help with why I seem to be unable to write out this text correctly, even when I'm trying to explicitly encode it as UTF8, rather than ANSI/ASCII.

     
  • WhyDoesOpenIDNotWork

    Ok, found a C# djinn away from here, and have ended up with the following code. Basically getting rid of the StringBuilder and things start working as expected:

        public static string GetDocumentTextBytes(IntPtr curScintilla) {
    
            int length = (int) Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
            byte[] sb = new byte[length];
    
            unsafe {
                fixed (byte* p = sb) {
    
                    IntPtr ptr = (IntPtr) p;
    
                    Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, ptr);
                }
    
                return System.Text.Encoding.UTF8.GetString(sb).TrimEnd('\0');
            }
        }
    
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks