json-py / Bugs / #9 JsonWriter should encode Unicode characters

#9 JsonWriter should encode Unicode characters

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2005-08-07

Created: 2005-08-07

Creator: Koen van de Sande

Private: No

Right now, JsonWriter will simply concatenate all
strings "into" the written out string. However, it
should encode all characters not in the default ASCII
range (ord(character)>=128).

An example of the problem (\u20ac is a single
character). The following will read properly (in the
string Unicode character is encoded as 6 characters):

t = json.read(r'{"\u20ac_bef":27}')

Writing this out does not encode Unicode characters back:

json.write(t)

gives

>>> json.write(t)
u'{"\u20ac_bef":27}'

which is wrong, check out the 3rd character:

>>> json.write(t)[2]
u'\u20ac'

So, this should be encoded. Before line 289
(self._append(obj)), you should insert the following
fix (of course, the fix and fixFunction declarations
only need to be initialized once somewhere else):
import
fix = re.compile(u"([\u0080-\uffff])")
fixFunction = lambda(x): r"\u%04x" % ord(x.group(1))
fix.sub(fixFunction, obj)

This fixes it and should have no side-effects (besides
the introduction of a Regular expression).

Discussion

Koen van de Sande - 2005-08-07

Logged In: YES
user_id=270334

A quick note: The import statement above should of course
read "import re".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Koen van de Sande - 2005-08-07

Logged In: YES
user_id=270334

Another addendum: I had to add the Unicode type to line 216/217:
if type(key) is not types.StringType and
type(key) is not types.UnicodeType:
raise ReadException, "Not a valid JSON
object key (should be a string): %s" % key

Otherwise my testcase would not be read at all, even though
it is valid JSON (I think, anyway).

Related to this bug is the test case
"testWriteEscapedHexCharacter", whose result is wrong. The
test case should be:

def testWriteEscapedHexCharacter(self):
s = json.write(u'\u1001')
self.assertEqual(r'"\u1001"', _removeWhitespace(s))

(in the assertion it should be a raw string instead of a
Unicode string).

That should complete make this report accurate.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Koen van de Sande - 2005-08-07

My fixes applied to json.py

json.py

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Koen van de Sande - 2005-08-07

Corrected the testWriteEscapedHexCharacter test case

jsontest.py

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Koen van de Sande - 2005-08-08

Logged In: YES
user_id=270334

All line numbers are against the 3.2 version.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JsonWriter should encode Unicode characters

Group

Searches

Help

#9 JsonWriter should encode Unicode characters

Discussion