Character encoding trouble

Status: Beta

Brought to you by: exhuma

#3 Character encoding trouble

Status: open

Owner: Michel Albert

Labels: None

Priority: 5

Updated: 2007-11-21

Created: 2007-11-21

Creator: Michel Albert

Private: No

If supplying a multibyte string as either "haystack" or "needle", results are unpredictable:

>>> ngram.compare('dfsédfsdf', 'dfsédfsdf')
XXd: 'XXd'
Xdf: 'Xdf'
dfs: 'dfs'
fs�: 'fs\xc3'
sé: 's\xc3\xa9'
éd: '\xc3\xa9d'
�df: '\xa9df'
dfs: 'dfs'
fsd: 'fsd'
sdf: 'sdf'
dfX: 'dfX'
fXX: 'fXX'
1.0

Note that the trigrams in the middle are not trigrams at all, but di-grams because the multibyte character is recognized as two characters. Essentially, in this case, as everything get's treated the same, the end-result is correct. However, when supplying the ctrings as unicode objects, it all works as expected:

>>> ngram.compare(u'dfsédfsdf', u'dfsédfsdf')
XXd: u'XXd'
Xdf: u'Xdf'
dfs: u'dfs'
fsé: u'fs\xe9'
séd: u's\xe9d'
édf: u'\xe9df'
dfs: u'dfs'
fsd: u'fsd'
sdf: u'sdf'
dfX: u'dfX'
fXX: u'fXX'
1.0

For this reason, I will add a type-check that will only allow unicode objects to be passed down into the module. This may also reveal possible encoding trouble, beacuse the unicode conversion will most likely fail in that case. So the module will not accept data that is obviously wrong.

Discussion

Michel Albert - 2007-11-21

Fix

1835822.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michel Albert - 2007-11-21

Logged In: YES
user_id=560690
Originator: YES

File Added: 1835822.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Graham Poulter - 2008-10-06

I'm tending towards letting ngram be string-agnostic about unicode vs string, since the distinction disappears in Py3K (everything's unicode).

Perhaps have the docstring recommend unicode for strings containing multibyte characters, rather than policing it in the code?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michel Albert - 2008-10-06

Good point.

What about issuing a warning message? I am not (yet) as familiar with python's logging module, but with Log4J it would be possible to emit warnings, which can be caught and logged in the application that is using the library.

From what I see in the python logging module, this should be possible.

At least, then less experienced python users will see that something's afoot. I know it bit me already once or twice ;) Until I found the "repr" method and realized that even if "printing" two "string" variables show the same text, you might actually look at one unicode object and one str object. Once you know the difference it's fine. But for new pythonistas it is not that obvious. With a warning one could give them a helpful hint :)

Or we could just let them stumble over weird results and tell them afterwards: "Read The Fine Manual"! ;)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Graham Poulter - 2008-10-07

Hi, I've checked in the 3.0.0 beta code for the module.

It still has the problem, and I've added a unit test demonstrating the "1 byte == 1 character" assumption made when passing str type instead of unicode type.

One workaround might be to convert things to unicode via the padding, by padding with u'$' instead of '$'. Then catch UnicodeDecodeError, which occurs when attempting to convert a str() instance that uses chars >127, and re-raise it as an error which says "Non-ascii characters detected: such strings must be provided as unicode objects"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Graham Poulter - 2009-12-07

Currently in NGram 3.1 the warning in the documentation is that one should use unicode if there is any chance of the input string containing multi-byte encoded characters. It only checks isinstance(x,basestring)

With transition to Python 3, that might be a good time to make NGram force the use unicode with an "isinstance(x,str)" check

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.