Notepad++ Python Script / Discussion / Help: Handling Unicode data

Anonymous - 2011-02-22

Hi there, I've just started using Python Script 0.8.0.0 in Notepad++ 5.8.7 - great plugin, but I'm having difficulties properly handling Unicode-encoded data. I've simplified one of the more basic situations in the following example:

def lowercase(contents, lineNumber, totalLines):
amendedLine=contents.lower()
editor.replaceWholeLine(lineNumber, amendedLine)

editor.forEachLine(lowercase)

When executed for a text file that has been encoded in utf-8, this script does not modify characters such as Ã and É to lower-case, simply leaving them as they are.

Does anyone have any ideas why?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brotherstone - 2011-02-22

Basically, you need to do the conversion to, and from, unicode. Assuming you're working with a utf8 file

amendedLine = contents.decode('utf8').lower().encode('utf8')

This says - take the contents, convert from UTF8 byte string to a unicode string (if you do an editor.getText() in the console, you'll see the output string has a "u" in it - identifying it as a unicode string instead of a normal "multibyte" string. Then, lower case the unicode string (this can obviously take advantage of knowledge of unicode character points, and various casing rules), then convert it back to a multibyte string (with encode('utf8').

A bit long winded, but due to the way Python and Scintilla chose to handle text (both actually the same, with multibyte strings as the "base").

Hope that helps,
Dave.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2011-02-22

Many thanks Dave - that works a dream.

To try and understand what's happening here, are you saying that the way that the utf-8 encoded data file stores its data is different from how Python stores the data in its Unicode strings, although both are using Unicode encoding?

Also, when I enter editor.getText() in the console, I don't get the "u" format, but instead the relevant Unicode chars as escaped sequences e.g. \xc3\xa3. If I enter print("é") in the console, I also get the escaped code printed out. Is there any way to force the console to consider its input and output as Unicode? I apologise for these naïve questions…

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brotherstone - 2011-02-22

To try and understand what's happening here, are you saying that the way that the utf-8 encoded data file stores its data is different from how Python stores the data in its Unicode strings, although both are using Unicode encoding?

No, they're both identical - if you save your script in UTF-8 and your target file is UTF-8, then it's identical. Where the problem lies is Python treats a normal string as an 8bit string - whatever the characters are (so \xb4, ie. hex byte B4, is just a character that can't be displayed). Python handles unicode with a special datatype, natively it's stored as (I believe) UTF-16, but you only need to know that with a unicode datatype string, you can do upper/lower and all the other operations with the entire unicode character set. However, to put that "unicode datatype data" back to Scintilla, you need to convert it back to good ol' UTF-8.

There's a great article on Joel on Software about character sets - http://www.joelonsoftware.com/articles/Unicode.html - it's the best introduction to character sets I've read, with Joel's added humour :)

As for the console, that's a tricky one - as the console write() function expects a normal string, and I'm not sure the console output is configured for UTF-8. I'm testing the new version at the moment, so I'll try and make sure that's correct in the next version.

They're not naive questions by the way - it's not the simplest thing to understand, and there's lots of if's and but's!

The Python documentation on unicode may help - http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange

Good luck!

Dave.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2011-02-22

Thanks Dave - I guess I've got my reading cut out for me!

I'm glad I posted a question here in the Forum, and that you kindly answered so quickly, as I was ready to give up on Python after struggling with this Unicode issue a few days ago. But the pure power and simplicity of the plugin drew me back to give it a second go and I'm now looking forward to learning more. Thanks for producing such a neat wee plugin and for being such a great help!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Handling Unicode data

A Python Scripting plugin for Notepad++

Forums

Help

Handling Unicode data

Handling Unicode data

A Python Scripting plugin for Notepad++

Forums

Help

Handling Unicode data document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Handling Unicode data