|
From: James L. <bjl...@lo...> - 2007-01-28 05:26:41
|
Does the python interface to:
self.gaim.GaimSavedstatusSetMessage(self.status, message)
require a UTF8 string?
I am trying to modify the AmarokGaim plugin so that it works with=20
non-utf8 strings like "Bj=F6rk".
It calls "message =3D message.decode('utf8')" where message might be the=20
string "Bj=F6rk".
This works for utf8 strings but fails on strings that contain special=20
characters.
I tried "message =3D message.decode('utf16')" but that errors.
I tried to leave message in whatever Amarok sends but then the gaim=20
status is weird characters.
|
|
From: Sean E. <sea...@gm...> - 2007-01-28 05:53:13
|
On 1/27/07, James Lockie <bjl...@lo...> wrote:
> Does the python interface to:
> self.gaim.GaimSavedstatusSetMessage(self.status, message)
> require a UTF8 string?
The Gaim API requires that all strings be UTF-8. I'm not sure how
Python deals with strings, but if it has the concept of different
encodings (as you seem to imply), it would make sense that you would
need to use UTF-8 strings in Python.
> I am trying to modify the AmarokGaim plugin so that it works with
> non-utf8 strings like "Bj=F6rk".
> It calls "message =3D message.decode('utf8')" where message might be the
> string "Bj=F6rk".
> This works for utf8 strings but fails on strings that contain special
> characters.
"strings that contain special characters"? You mean "strings that
aren't in UTF-8." I'm sure Amarok provides a mechanism for knowing the
encoding of the strings it gives you, or there's a KDE convention "all
strings use X encoding"
-s.
|
|
From: James L. <bjl...@lo...> - 2007-01-28 16:20:19
|
Thanks.
Sean Egan wrote:
> I'm sure Amarok provides a mechanism for knowing the
> encoding of the strings it gives you
I have asked on the Amarok maiing list what string format Amarok is=20
returning. :-)
Bartosz Oler wrote:
> On 01/28/2007 06:26 AM, James Lockie wrote:
>
> [...]
> =20
>> I am trying to modify the AmarokGaim plugin so that it works with
>> non-utf8 strings like "Bj=F6rk".
>>
>> It calls "message =3D message.decode('utf8')" where message might be t=
he
>> string "Bj=F6rk".
>> =20
> [...]
>
> The decode() method converts a string into a Unicode string (which is j=
ust
> an internal representation and not the thing which you can pass anywher=
e
> further). The argument specifies *current* encoding of the string. Thus=
if
> you know it's not utf-8 then you should not be giving utf-8 as the argu=
ment.
Mmm.
I didn't write the original code and I don't know python, gaim or Amarok=20
so I am trying to figure it out.
The original code passes utf8 to to decode and it works for strings with=20
non-special characters but fails on things like "Bj=F6rk".
I can print out "Bj=F6rk" to the console so the internal representation b=
y=20
python can handle special characters.
|
|
From: Ethan B. <ebl...@cs...> - 2007-01-28 17:22:48
|
James Lockie spake unto us the following wisdom:
> James Lockie wrote:
> > Sean Egan wrote:
> >> I'm sure Amarok provides a mechanism for knowing the
> >> encoding of the strings it gives you
> >
> > I have asked on the Amarok maiing list what string format Amarok is=20
> > returning. :-)
> > =20
> I took out the decode and it works for strings without special=20
> characters but I get this error for "Bj=F6rk":
> message.append(signature=3Dintrospect_sig, *args)
> UnicodeError: String parameters to be sent over D-Bus must be valid UTF-8
>=20
> My guess is Amarok is NOT sending utf8 but decode works for some of the=
=20
> strings but utf8 for other strings. :-(
> I'll figure it out. :-)
I'm guessing Amarok is not sending UTF-8 for *any* of its strings, but
those strings which happen to contain only characters in the ASCII
range are validating as UTF-8. (ASCII strings are valid UTF-8, and
will display correctly.) Any strings which contain extended ISO-Latin
characters (e.g., =F6) are probably in some locale-specific charset
(most likely ISO-8859-1 or ISO-8859-15) and are thus failing
validation.
Try taking a byte dump of the string you're trying to use, and see if
it doesn't say {0x42, 0x6a, 0xf6, 0x72, 0x6b}. This is the
representation of "Bj=F6rk" in ISO-8859-{1,15}.
> I was told "Bj=F6rk" can be represented by utf8 but that doesn't seem to=
=20
> be the case.
It most certainly can. B, j, r, and k are represented by their ASCII
values, and =F6 can be represented in several ways; this email contains
=F6 in UTF-8 as U+00F6, which UTF-8 represents as 0xc3 0xb6.
Ethan
--=20
The laws that forbid the carrying of arms are laws [that have no remedy
for evils]. They disarm only those who are neither inclined nor
determined to commit crimes.
-- Cesare Beccaria, "On Crimes and Punishments", 1764
|
|
From: Ethan B. <ebl...@cs...> - 2007-01-28 22:03:45
|
James Lockie spake unto us the following wisdom:
> Ethan Blanton wrote:
> > It most certainly can. B, j, r, and k are represented by their ASCII
> > values, and =F6 can be represented in several ways; this email contains
> > =F6 in UTF-8 as U+00F6, which UTF-8 represents as 0xc3 0xb6.
>
> Ah, the string is probably something other than UTF8.
> I get it thanks.
>=20
> I think I did this right :-)
> '426c656564696e6720576f726473206279204d6f62696c65' =3D 'Bleeding Words by=
=20
> Mobile'
Yes, this string is all ASCII -- I assume it works.
> '69742773206f6820736f20717569657420627920426af6726b' =3D 'it's oh so quiet
^^ =F6 in ISO-8859-{1,15}
> by Bj=F6rk'
So, if Amarok isn't telling you what encoding these strings are (and I
suspect it is, when it can tell itself, but some annotations, such as
id3v1, do not have encoding tags), your best bet is probably to simply
try whatever encoding you expect to be most common, and fall back on
replacement of the invalid characters if that doesn't work. Something
like:
if (g_utf8_validate passes)
use the string as is
else if (g_convert from ISO-8859-1 works)
use the conversion
else
gaim_utf8_salvage it
We use this sort of tactic in several places in Gaim, where strings
come in that are of some unknown encoding.
Ethan
--=20
The laws that forbid the carrying of arms are laws [that have no remedy
for evils]. They disarm only those who are neither inclined nor
determined to commit crimes.
-- Cesare Beccaria, "On Crimes and Punishments", 1764
|
|
From: Bartosz O. <li...@bz...> - 2007-01-28 12:12:57
|
On 01/28/2007 06:26 AM, James Lockie wrote:
[...]
> I am trying to modify the AmarokGaim plugin so that it works with
> non-utf8 strings like "Bj=F6rk".
>
> It calls "message =3D message.decode('utf8')" where message might be th=
e
> string "Bj=F6rk".
[...]
The decode() method converts a string into a Unicode string (which is jus=
t
an internal representation and not the thing which you can pass anywhere
further). The argument specifies *current* encoding of the string. Thus i=
f
you know it's not utf-8 then you should not be giving utf-8 as the argume=
nt.
Unicode string is what you can later convert into other encodings, includ=
ing
utf-8.
In general, if you want to convert something into UTF-8, you should do:
message =3D message.decode(ENCODING_OF_MESSAGE).encode('utf-8')
The only problem is finding out what the proper value of ENCODING_OF_MESS=
AGE
is, but Sean has already mentioned how to do it.
take care,
Bartosz
|
|
From: James L. <bjl...@lo...> - 2007-01-28 17:05:19
|
James Lockie wrote:
> Thanks.
>
>
> Sean Egan wrote:
> =20
>> I'm sure Amarok provides a mechanism for knowing the
>> encoding of the strings it gives you
>> =20
>
> I have asked on the Amarok maiing list what string format Amarok is=20
> returning. :-)
> =20
I took out the decode and it works for strings without special=20
characters but I get this error for "Bj=F6rk":
message.append(signature=3Dintrospect_sig, *args)
UnicodeError: String parameters to be sent over D-Bus must be valid UTF-8
My guess is Amarok is NOT sending utf8 but decode works for some of the=20
strings but utf8 for other strings. :-(
I'll figure it out. :-)
I was told "Bj=F6rk" can be represented by utf8 but that doesn't seem to=20
be the case.
>
>
> Bartosz Oler wrote:
> =20
>> On 01/28/2007 06:26 AM, James Lockie wrote:
>>
>> [...]
>> =20
>> =20
>>> I am trying to modify the AmarokGaim plugin so that it works with
>>> non-utf8 strings like "Bj=F6rk".
>>>
>>> It calls "message =3D message.decode('utf8')" where message might be =
the
>>> string "Bj=F6rk".
>>> =20
>>> =20
>> [...]
>>
>> The decode() method converts a string into a Unicode string (which is =
just
>> an internal representation and not the thing which you can pass anywhe=
re
>> further). The argument specifies *current* encoding of the string. Thu=
s if
>> you know it's not utf-8 then you should not be giving utf-8 as the arg=
ument.
>> =20
> Mmm.
> I didn't write the original code and I don't know python, gaim or Amaro=
k=20
> so I am trying to figure it out.
> The original code passes utf8 to to decode and it works for strings wit=
h=20
> non-special characters but fails on things like "Bj=F6rk".
>
> I can print out "Bj=F6rk" to the console so the internal representation=
by=20
> python can handle special characters.
> =20
|
|
From: James L. <bjl...@lo...> - 2007-01-28 19:12:27
|
Ethan Blanton wrote:
> James Lockie spake unto us the following wisdom:
> =20
>> James Lockie wrote:
>> =20
>>> Sean Egan wrote:
>>> =20
>>>> I'm sure Amarok provides a mechanism for knowing the
>>>> encoding of the strings it gives you
>>>> =20
>>> I have asked on the Amarok maiing list what string format Amarok is=20
>>> returning. :-)
>>> =20
>>> =20
>> I took out the decode and it works for strings without special=20
>> characters but I get this error for "Bj=F6rk":
>> message.append(signature=3Dintrospect_sig, *args)
>> UnicodeError: String parameters to be sent over D-Bus must be valid UT=
F-8
>>
>> My guess is Amarok is NOT sending utf8 but decode works for some of th=
e=20
>> strings but utf8 for other strings. :-(
>> I'll figure it out. :-)
>> =20
>
> I'm guessing Amarok is not sending UTF-8 for *any* of its strings, but
> those strings which happen to contain only characters in the ASCII
> range are validating as UTF-8. (ASCII strings are valid UTF-8, and
> will display correctly.) Any strings which contain extended ISO-Latin
> characters (e.g., =F6) are probably in some locale-specific charset
> (most likely ISO-8859-1 or ISO-8859-15) and are thus failing
> validation.
>
> Try taking a byte dump of the string you're trying to use, and see if
> it doesn't say {0x42, 0x6a, 0xf6, 0x72, 0x6b}. This is the
> representation of "Bj=F6rk" in ISO-8859-{1,15}.
>
> =20
>> I was told "Bj=F6rk" can be represented by utf8 but that doesn't seem =
to=20
>> be the case.
>> =20
>
> It most certainly can. B, j, r, and k are represented by their ASCII
> values, and =F6 can be represented in several ways; this email contains
> =F6 in UTF-8 as U+00F6, which UTF-8 represents as 0xc3 0xb6.
Ah, the string is probably something other than UTF8.
I get it thanks.
I think I did this right :-)
'426c656564696e6720576f726473206279204d6f62696c65' =3D 'Bleeding Words by=
=20
Mobile'
'69742773206f6820736f20717569657420627920426af6726b' =3D 'it's oh so quie=
t=20
by Bj=F6rk'
|
|
From: James L. <bjl...@lo...> - 2007-01-29 06:07:01
|
Ethan Blanton wrote:
> James Lockie spake unto us the following wisdom:
> =20
>> Ethan Blanton wrote:
>> =20
>>> It most certainly can. B, j, r, and k are represented by their ASCII
>>> values, and =F6 can be represented in several ways; this email contai=
ns
>>> =F6 in UTF-8 as U+00F6, which UTF-8 represents as 0xc3 0xb6.
>>> =20
>> Ah, the string is probably something other than UTF8.
>> I get it thanks.
>>
>> I think I did this right :-)
>> '426c656564696e6720576f726473206279204d6f62696c65' =3D 'Bleeding Words=
by=20
>> Mobile'
>> =20
>
> Yes, this string is all ASCII -- I assume it works.
>
> =20
>> '69742773206f6820736f20717569657420627920426af6726b' =3D 'it's oh so q=
uiet
>> =20
> ^^ =F6 in ISO-8859-{1,15=
}
> =20
>> by Bj=F6rk'
>> =20
>
> So, if Amarok isn't telling you what encoding these strings are (and I
> suspect it is, when it can tell itself, but some annotations, such as
> id3v1, do not have encoding tags), your best bet is probably to simply
> try whatever encoding you expect to be most common, and fall back on
> replacement of the invalid characters if that doesn't work. Something
> like:
>
> if (g_utf8_validate passes)
> use the string as is
> else if (g_convert from ISO-8859-1 works)
> use the conversion
> else
> gaim_utf8_salvage it
>
> We use this sort of tactic in several places in Gaim, where strings
> come in that are of some unknown encoding.
>
> Ethan
Thank you so much.
I did
# Try and decode message
try:
msg =3D message.decode('utf8')
except:
self.log("DecodeError: Could not decode utf8 '%s', trying ISO=
8$
try:
msg =3D message.decode('iso-8859-1')
except:
self.log("DecodeError: Could not decode '%s'" % message)
return
which is what you suggested and it works now.
|
|
From: Ethan B. <ebl...@cs...> - 2007-01-29 20:06:50
|
James Lockie spake unto us the following wisdom:
> # Try and decode message
> try:
> msg =3D message.decode('utf8')
> except:
> self.log("DecodeError: Could not decode utf8 '%s', trying ISO=
8$
>=20
> try:
> msg =3D message.decode('iso-8859-1')
Note that ISO-8859-1, specifically, should never fail to deocode, as
it uses all 256 codepoints. This is not true of all encodings
(particularly multibyte encodings).
Glad to hear it works for you.
Ethan
--=20
The laws that forbid the carrying of arms are laws [that have no remedy
for evils]. They disarm only those who are neither inclined nor
determined to commit crimes.
-- Cesare Beccaria, "On Crimes and Punishments", 1764
|