Menu

Handling Unicode data

Help
Anonymous
2011-02-22
2013-01-25
  • Anonymous

    Anonymous - 2011-02-22

    Hi there, I've just started using Python Script 0.8.0.0 in Notepad++ 5.8.7 - great plugin, but I'm having difficulties properly handling Unicode-encoded data. I've simplified one of the more basic situations in the following example:

    def lowercase(contents, lineNumber, totalLines):
    amendedLine=contents.lower()
    editor.replaceWholeLine(lineNumber, amendedLine)

    editor.forEachLine(lowercase)

    When executed for a text file that has been encoded in utf-8, this script does not modify characters such as à and É to lower-case, simply leaving them as they are.

    Does anyone have any ideas why?

     
  • Dave Brotherstone

    Basically, you need to do the conversion to, and from, unicode.  Assuming you're working with a utf8 file

    amendedLine = contents.decode('utf8').lower().encode('utf8')
    

    This says - take the contents, convert from UTF8 byte string to a unicode string (if you do an editor.getText() in the console, you'll see the output string has a "u" in it - identifying it as a unicode string instead of a normal "multibyte" string.  Then, lower case the unicode string (this can obviously take advantage of knowledge of unicode character points, and various casing rules), then convert it back to a multibyte string (with encode('utf8'). 

    A bit long winded, but due to the way Python and Scintilla chose to handle text (both actually the same, with multibyte strings as the "base").

    Hope that helps,
    Dave.

     
  • Anonymous

    Anonymous - 2011-02-22

    Many thanks Dave - that works a dream.

    To try and understand what's happening here, are you saying that the way that the utf-8 encoded data file stores its data is different from how Python stores the data in its Unicode strings, although both are using Unicode encoding?

    Also, when I enter editor.getText() in the console, I don't get the "u" format, but instead the relevant Unicode chars as escaped sequences e.g. \xc3\xa3. If I enter print("é") in the console, I also get the escaped code printed out. Is there any way to force the console to consider its input and output as Unicode? I apologise for these naïve questions…

     
  • Dave Brotherstone

    To try and understand what's happening here, are you saying that the way that the utf-8 encoded data file stores its data is different from how Python stores the data in its Unicode strings, although both are using Unicode encoding?

    No, they're both identical - if you save your script in UTF-8 and your target file is UTF-8, then it's identical.  Where the problem lies is Python treats a normal string as an 8bit string - whatever the characters are (so \xb4, ie. hex byte B4, is just a character that can't be displayed).  Python handles unicode with a special datatype, natively it's stored as (I believe) UTF-16, but you only need to know that with a unicode datatype string, you can do upper/lower and all the other operations with the entire unicode character set.  However, to put that "unicode datatype data" back to Scintilla, you need to convert it back to good ol' UTF-8.

    There's a great article on Joel on Software about character sets - http://www.joelonsoftware.com/articles/Unicode.html - it's the best introduction to character sets I've read, with Joel's added humour :)

    As for the console, that's a tricky one - as the console write() function expects a normal string, and I'm not sure the console output is configured for UTF-8.  I'm testing the new version at the moment, so I'll try and make sure that's correct in the next version. 

    They're not naive questions by the way - it's not the simplest thing to understand, and there's lots of if's and but's!

    The Python documentation on unicode may help - http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange

    Good luck!

    Dave.

     
  • Anonymous

    Anonymous - 2011-02-22

    Thanks Dave - I guess I've got my reading cut out for me!

    I'm glad I posted a question here in the Forum, and that you kindly answered so quickly, as I was ready to give up on Python after struggling with this Unicode issue a few days ago. But the pure power and simplicity of the plugin drew me back to give it a second go and I'm now looking forward to learning more. Thanks for producing such a neat wee plugin and for being such a great help!

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.