[Pyobjc-dev] Bridging strings from Python to other languages
Brought to you by:
ronaldoussoren
From: Bill B. <bb...@co...> - 2003-02-05 04:56:02
|
[This is a continuation of the thread that Just mentions below -- "NSString & mutability". I finally had a chance to write enough code to figure out where the walls were that I kept bloodying my head against. I believe I also came up with a more constructive way to think about this whole problem. More below -- the end of the 'bridging strings' section contains a series of what I believe are issues with the current Python implementation that should/may/could be addressed in a future version] On Tuesday, Feb 4, 2003, at 22:13 US/Eastern, Just wrote on python-dev: > (The use case is this. The PyObjC project marries Objective-C with > Python. This is cool as it gives us direct access to almost all of > Cocoa, the native OSX GUI interface. However, Cocoa defines its own > string type and for reasons that are waaay beyond the scope of this > post > (check the archives of the pyobjc-dev list if you're really really > interested; see a recent thread called "NSString & mutability") it > appears a bad idea to _convert_ these strings to Python unicode > strings. > So we need to wrap them. Yet they should work as much like unicode > strings as possible...) Let me rephrase the problem in slightly different terms. This will be long winded-- skip down to the 'bridging strings' section if you don't want to go through the initial discussion of the challenges of bridging two runtimes... In creating a bridge between Python and other languages-- in this case, Objective-C-- the general goal is to provide seamless connectivity between the two runtime environments. That is, you want to have a proxy to objects or structures found in the 'alien' runtime available in the 'native' runtime in a fashion that makes the proxy convenient to use. Generally, this means that the proxy should act as much like the 'native' runtime up to the point where it starts to obfuscate the behavior of the 'alien' runtime. While a decent bridging and proxying mechanism can make "crossing the bridge" easy to do, one can never avoid the fact that there really is a bridge and on the other side there really is an 'alien' runtime. Now-- there are a number of different ways to proxy objects/structures between the two runtimes: - pure proxy: the object/structure to be bridged is represented by a proxy that handles all requests for information or invocation of functions/methods by converting the request/invocation into a form that can be understood on the other side of the bridge. Example -- the following creates a python native proxy to the alien Objective-C NSMutableArray instance (ignore that it really creates an NSCFArray-- that is an internal-to-Foundation implementation detail that is irrelevant). The expression 'a.count()' actually causes the 'count' Objective-C method to be invoked through the proxy a: >>> from Foundation import * >>> a = NSMutableArray.array() >>> type(a) <objective-c class NSCFArray at 0x466d0> >>> a.count() 0 >>> len(a) 0 The len(a) is just a demonstration of how far the proxying can go by defining the appropriate internal(?) attributes on the proxy. - pure conversion: the object/structure to be bridged is converted to the native type as it crosses the bridge. Example -- the NSString is currently bridged to the Python String class such that string instances are converted to their native types as they cross the bridge [at least, this is the case in CVS -- I now have a proxy class that can wrap a Python PyString/PyUnicode instance and present it is a standard NSString instance on the ObjC side. Avoids lots of unnecessary data copying when going from Python->Objective-C, but it needs a bunch of cleanup before I commit.]: In the following, I create a new Objective-C NSString instance and assign it to 's'. What results is a copy of the contents of the NSString instnace shoved into a normal Python string. >>> s = NSString.stringWithString_("Foobar") >>> type(s) <type 'str'> - mixed conversion/proxy: this is a suboptimal case. It generally converts to a native type in one direction, but potentially not in the other or not fully. Example -- NSNumber is currently in this category. It should change eventually, but there are issues to deal with: >>> a = NSArray.arrayWithObject_(1) >>> a[0] <NSCFNumber objective-c instance 0x6704b0> >>> a[0] + 1 2 --- One of the key challenges is that proxying effectively causes two references to any given object to exist; the native object reference and the 'alien' reference through the proxy. Care must be taken to ensure that a single reference on either side of the bridge is enough to preserve both components of the hunk of data while also ensuring that the existence of a proxy without references does not prevent the item from being collected [may sound confusing: consider the situation where a Python class is subclassed in the alien environment or vice-versa -- you effectively end up with instances that have part of their implementation in one runtime and the other part in the other runtime. It can lead to issues.]. In general, these kinds of issues can be worked through by leveraging mechanisms such as weak references. By providing a callback on the finalization of an object, it is possible to ensure that the alien-to-python component of the instance is destroyed, as well. A final challenge is that sometimes an object's type or contents are completely irrelevant to a piece of code. It is the object reference itself that is meaningful. In these situations, if an object is passed across the bridge and back, what should come back really should be what went across in the first place-- if not, the contents may have been preserved, but the object's original identity has been lost. Sometimes an object is just an object. --- Strings provide a particular set of challenges in that no two runtime environments present exactly the same set of features in their string handling API, yet every runtime has some kind of a string API and, invariably, that API is very much at the core of the runtime. The addition of Unicode to every string API over the last decade+ has not made things any simpler. In python, strings are immutable and can encapsulate non-unicode data. A separate unicode-- also immutable-- type is provided to encapsulate unicode data, but the standard string type can also encapsulate unicode data in certain circumstances [at least, it appears that PyString will happily consume and represent UTF8]. In Objective-C [and other languages], there is a single String class that can encapsulate both ASCII and unicode data in many different encodings. Furthermore, there is a subclass of String that provides additional mutability API -- an instance of the mutable string class can have its contents changed by the developer while the identity of the object remains the same (unlike python where appending "b" to "a" results in a new string "ab"). To further complicate matters, most typed languages support the concept of 'upcasting' -- that is, of casting a particular instance to actually be an instance of a superclass. For Objective-C, it can mean that a method that is declared as returning an immutable string or array actually returns a mutable string or array instance -- as long as the developer pays attention to the compiler warnings and doesn't do any stupid casting of their own, everything is fine. Java offers similar casting "features". - bridging strings - So, how to bridge strings in such an environment? In all cases, we can [fortunately] assume that strings pass across the bridge in one of a few choke points in the code -- that there is always a location to add a little bit of logic with which to help bridge the string [or any other random object]. The goal is to bridge strings in a fashion such that (not really in order of importance): (1) only one hunk of memory is used to contain the data within the string (2) conversion is kept to a minimum, if present at all, because strings will be passed back-and-forth across the bridge very frequently (3) identity is maintained; pass a string with id() 7570720 from Python into the alien runtime and subsequently from the alien runtime back into python and the same string instance with id() 7570720 really should come back (4) 'alien' string specific API can still be used; the Objective-C NSString provides a very rich API, including localization features that are not available in pure python. For Python->Objective-C, bridging strings has proven to be fairly easy. (1), (2), and (4) are quite straightforward. (3) is not done yet. For Objective-C->Python, bridging strings is not so easy. The difficulty is compounded by certain features of the Python string/unicode APIs. (1) is pretty easy -- the challenge is to figure out which API to call on the Python side such that the resulting Python object does not copy and re-encode the data. If that is unavoidable, the cost of encoding or conversion (2) should be minimized [hopefully with a cache so that cost of conversion is paid once, then never again for immutable string instances]. There is also the ongoing challenge of determining when to use the PyString vs. PyUnicode APIs; it seems that unicode objects are not welcome everywhere that string objects are? (4) is actually quite easy and has been available for some time through the use of unbound methods. However, the current implementation in CVS will always cause the python string to be converted to an NSString, the method invoked, and then the result-- if any and if a string-- is converted back to a python string. (3) is not so easy-- at least, not from what I have determined so far. Most of the issues seem to be due to limitations in Python (which is really just another way of saying "I don't know enough to approach this problem from the right direction"): - can't use weakref because one can't have a weak reference to a string or unicode object. This means that a callback when a string ref is finalized on the python side is not possible. It also means that creating a hash between ObjC string instances and Python string instances can't be done without using strong references, thereby creating the potential for leaking memory. - can't subclass string (but can unicode) to provide a class that acts exactly like a regular string while containing a reference to the foreign string object. There doesn't appear to be anywhere to hide a hunk of data in the string instance, either. - can't use the character buffer APIs because a character buffer cannot be used consistently throughout the python APIs in the same places as a string. Using str() to turn a char buffer into a string violates (1) [and doesn't make much sense anyway]. End result -- it is very difficult to preserve the association between an alien string instance and a PyString instance consistently. Even if PyString instances provide very thorough and consistent hashing behavior where two strings with the same contents always hash the same, the same cannot be said of all alien environments. Even when it is true, there are cases where the developer may be relying on the identity of the object to not change outside of their control. Mutable strings obviously present issues of their own, but they are not particularly relevant to discussion on python-dev outside of how future development might make the support of such common idiosyncrasies a bit more straightforward. Ideally, one could have an object on the python side that looks/feels/smells like a string instance, but whose contents may change. This creates any number of exciting problems. To further compound problems, anything that is declared as returning an NSString *may* return an NSMutableString at whim. It doesn't happen often, but when it does, if the handling of mutable vs. immutable strings is too radically different, it'll cause code to blow up in highly unexpected and very difficult to debug ways. Rambling on.... b.bum |