[Pyobjc-dev] depythonify_c_value rejects non-ascii, non-unicode strings

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Good day, all!

I am writing some Python code that has to output Latin-1 text.
Some of that output makes its way through other (python) code to a text 
widget through insertText_. The other code does not know about my 
encoding choice, as it is not my code, but Glenn Andreas' PyOxide IDE; 
it should not know about encoding. So it simply passes along my Latin-1 
strings to the insertText_ method of a text widget, where the PyObjC 
bridge tries to make it into a NSString.

In objc_support.c, in  int depythonify_c_value (const char *type, 
PyObject *argument, void *datum)
We have the following code
(currently around line 1300:)
			as_unicode = PyUnicode_Decode(
				strval,
				len,
				PyUnicode_GetDefaultEncoding(),
				"strict");
			if (as_unicode == NULL) {
				PyErr_Format(PyExc_UnicodeError,
					"depythonifying 'id', got "
					"a string with a non-default "
					"encoding");
				return -1;
			}
Now, it turns out that the DefaultEncoding is ascii, unless specified 
otherwise in PyUnicode_SetDefaultEncoding....
(from 
/System/Library/Frameworks/Python.framework/Headers/unicodeobject.h)
Now, that means that in many cases, I get the immediately following 
error and no output at all.

It is fairly easy to set the default encoding at startup (thanks to 
Glenn for pointing this out to me) using 
sys.setdefaultencoding('iso-8859-1') in a sitecustomize.py.
However, this can only be done at Python startup, and I fear many users 
of the bridge may not know about this limitation.
I propose that the PyObjC bridge use a less restrictive encoding than 
the current (bizarre) platform default, so as to allow Python to output 
encoded text to Cocoa widgets.
(Maybe the bridge should have a hook to set the platforn default when 
the Python subsystem is started?)
I suggest Latin 1, as it is the most common encoding, and the one most 
likely to be used by most (unix-written) Python code; even if the 
python code uses another encoding, as Latin-1 lets bytes pass through 
identically to widgets, if the user sees gibbersih it will be familiar 
gibberish. But I am sure a case could be made for mac-roman as well.
Another solution (Glenn's suggestion) is to at least not decode it 
'strict'ly, using 'ignore' or at worst 'replace' to allow some of the 
text at least to reach the user...

Whatever the correct solution, I feel that the current situation 
(rejecting any encoded non-ascii text) is overly restrictive.

Thank you for your attention,
Marc-Antoine Parent