From: Alexey M. <gai...@ly...> - 2004-08-26 21:10:27
|
Hello. I'm using gaim 0.82 under windows. Keep getting "(There was an error receiving this message)" when my ICQ friends send me cyrillic. Debug window shows that those messages are in AIM_CHARSET_ASCII rather than CUSTOM, so it is trying to treat it as UTF-8. Previous versions of gaim would show it as cp1251 reencoded in latin1, so I could at least decode it in the shell, now the message is lost. In gaim_plugin_oscar_parse_im_part removed AIM_CHARSET_ASCII else if part and ||ed it to AIM_CHARSET_CUSTOM and now everything works. Is there any particular reason why we are trying to decode ASCII as UTF-8 right now? Any chance something like that would make into the official release?.. Thanks, --Lyosha |
From: Mark D. <ma...@ki...> - 2004-08-26 22:11:02
|
On Thu, 26 Aug 2004 14:10:23 -0700, Alexey Marinichev wrote > I'm using gaim 0.82 under windows. Keep getting "(There was an error > receiving this message)" when my ICQ friends send me cyrillic. Debug > window shows that those messages are in AIM_CHARSET_ASCII rather than > CUSTOM, so it is trying to treat it as UTF-8. Previous versions of gaim > would show it as cp1251 reencoded in latin1, so I could at least decode > it in the shell, now the message is lost. > > In gaim_plugin_oscar_parse_im_part removed AIM_CHARSET_ASCII else if > part and ||ed it to AIM_CHARSET_CUSTOM and now everything works. Is > there any particular reason why we are trying to decode ASCII as UTF- > 8 right now? > > Any chance something like that would make into the official release?.. > > Thanks, > > --Lyosha What ICQ client(s) are your friends using? -- O O Mark Doliner \ | ma...@ki... \ | www.kingant.net "There needs to be a better word for weird." |
From: Alexey M. <gai...@ly...> - 2004-08-26 23:30:47
|
Mark Doliner wrote: >On Thu, 26 Aug 2004 14:10:23 -0700, Alexey Marinichev wrote > > >>I'm using gaim 0.82 under windows. Keep getting "(There was an error >>receiving this message)" when my ICQ friends send me cyrillic. Debug >>window shows that those messages are in AIM_CHARSET_ASCII rather than >>CUSTOM, so it is trying to treat it as UTF-8. Previous versions of gaim >>would show it as cp1251 reencoded in latin1, so I could at least decode >>it in the shell, now the message is lost. >> >>In gaim_plugin_oscar_parse_im_part removed AIM_CHARSET_ASCII else if >>part and ||ed it to AIM_CHARSET_CUSTOM and now everything works. Is >>there any particular reason why we are trying to decode ASCII as UTF- >>8 right now? >> >>Any chance something like that would make into the official release?.. >> >>Thanks, >> >> --Lyosha >> >> > >What ICQ client(s) are your friends using? > > [...] Centericq and older version of gaim with some kind of plugin to do character set conversion. Somebody might be using miranda. Now that I'm looking into it, with some more hacking I finally managed to get offline messages to work. For receiving offline messages, ISO-8859-1 is hardcoded in incomingim_chan4. It would be better if it could use CUSTOM charset -- for some reason server does not send stored messages in Unicode. For sending messages to offline accounts, it seems like the server doesn't like Unicode either. In gaim_plugin_oscar_convert_to_best_encoding, if I comment out AIM_CAPS_ICQUTF8 branch, sending offline messages works. Sending normal messages doesn't work anymore though. Adding a check that the user is not offline fixes both situations. This was all done assuming that the custom charset is cp1251. I have no clue if it applies to other charsets as well. Thanks, --Lyosha |
From: Mark D. <ma...@ki...> - 2004-08-27 05:41:34
|
On Thu, 26 Aug 2004 16:30:34 -0700, Alexey Marinichev wrote > Mark Doliner wrote: > >On Thu, 26 Aug 2004 14:10:23 -0700, Alexey Marinichev wrote > >>I'm using gaim 0.82 under windows. Keep getting "(There was an error > >>receiving this message)" when my ICQ friends send me cyrillic. Debug > >>window shows that those messages are in AIM_CHARSET_ASCII rather than > >>CUSTOM, so it is trying to treat it as UTF-8. Previous versions of gaim > >>would show it as cp1251 reencoded in latin1, so I could at least decode > >>it in the shell, now the message is lost. > >> > >>In gaim_plugin_oscar_parse_im_part removed AIM_CHARSET_ASCII else if > >>part and ||ed it to AIM_CHARSET_CUSTOM and now everything works. Is > >>there any particular reason why we are trying to decode ASCII as UTF- > >>8 right now? > >> > >>Any chance something like that would make into the official release?.. > > > >What ICQ client(s) are your friends using? > > > [...] > > Centericq and older version of gaim with some kind of plugin to do > character set conversion. Somebody might be using miranda. > > Now that I'm looking into it, with some more hacking I finally > managed to get offline messages to work. > > For receiving offline messages, ISO-8859-1 is hardcoded in > incomingim_chan4. It would be better if it could use CUSTOM charset > -- for some reason server does not send stored messages in Unicode. > > For sending messages to offline accounts, it seems like the server > doesn't like Unicode either. In > gaim_plugin_oscar_convert_to_best_encoding, if I comment out > AIM_CAPS_ICQUTF8 branch, sending offline messages works. Sending normal > messages doesn't work anymore though. Adding a check that the user > is not offline fixes both situations. > > This was all done assuming that the custom charset is cp1251. I > have no clue if it applies to other charsets as well. I'm guessing centericq and miranda don't handle ICQ character encodings correctly. I'm really only interested in making non-ASCII characters work between Gaim and official ICQ clients. If it's flagged as ASCII it should never be anything other than ASCII--I'm pretty sure of that. If you're interested you can look into whether centericq and miranda handle ICQ charcter encodings correctly. If they don't, you might want to suggest they look at the libfaim source in Gaim or talk to me if they need help. It's all pretty fresh on my mind. You should tell the guy using Gaim to upgrade to Gaim 0.82.1. It handles ICQ character encodings in what we believe to be the best way. As for offline messages, you're right, Gaim incorrectly assumes those are ISO-8859-1. It should probably at least try to use the custom charset. I'll try to get that working for Gaim 0.83. -Mark -- O O Mark Doliner \ | ma...@ki... \ | www.kingant.net "There needs to be a better word for weird." |
From: Alexey M. <gai...@ly...> - 2004-08-27 18:20:34
|
Mark Doliner wrote: [...] >I'm guessing centericq and miranda don't handle ICQ character encodings >correctly. I'm really only interested in making non-ASCII characters work >between Gaim and official ICQ clients. If it's flagged as ASCII it should >never be anything other than ASCII--I'm pretty sure of that. If you're >interested you can look into whether centericq and miranda handle ICQ charcter >encodings correctly. If they don't, you might want to suggest they look at >the libfaim source in Gaim or talk to me if they need help. It's all pretty >fresh on my mind. > >You should tell the guy using Gaim to upgrade to Gaim 0.82.1. It handles ICQ >character encodings in what we believe to be the best way. > >As for offline messages, you're right, Gaim incorrectly assumes those are >ISO-8859-1. It should probably at least try to use the custom charset. I'll >try to get that working for Gaim 0.83. >-Mark > > Centericq uses libicq2000. 10 minutes looking at the code didn't get me far enough to see what is going on. As for text being flagged as ASCII never being anything other than ASCII, why does it say this in oscar.c? else if (charset == AIM_CHARSET_ASCII) charsetstr = "UTF-8"; My argument is that if you are trying to use something other than ASCII, you might as well do it in such a way that other clients are supported. I do hope offline message fixes will make it into 0.83. It's a pain to recode received messages. --Lyosha |
From: Ka-Hing C. <ka...@ja...> - 2004-08-28 00:53:55
|
On Fri, 2004-08-27 at 11:20, Alexey Marinichev wrote: > I do hope offline message fixes will make it into 0.83. It's a pain to > recode received messages. Have you tried this? https://sourceforge.net/tracker/?func=detail&aid=988352&group_id=235&atid=390395 -khc |
From: Mark D. <ma...@ki...> - 2004-08-28 14:15:46
|
On Fri, 27 Aug 2004 11:20:29 -0700, Alexey Marinichev wrote > Centericq uses libicq2000. 10 minutes looking at the code didn't > get me far enough to see what is going on. > > As for text being flagged as ASCII never being anything other than > ASCII, why does it say this in oscar.c? > else if (charset == AIM_CHARSET_ASCII) > charsetstr = "UTF-8"; > > My argument is that if you are trying to use something other than > ASCII, you might as well do it in such a way that other clients are supported. > > I do hope offline message fixes will make it into 0.83. It's a pain > to recode received messages. > > --Lyosha ASCII is a subset of UTF-8. So if something is valid ASCII, it WILL be valid UTF-8. Setting charsetstr to UTF-8 mostly just simplifies the code some. And yes, it also has the effect of accepting characters that are UTF-8 but not ASCII, but are sent using the ASCII flag. I suppose that's equally as wrong as assuming the text sent over ASCII is ISO-8859-1, but it seems a lot cleaner to me this way. If we're going to assume something, I'd rather assume UTF-8 than ISO-8859-1. -Mark -- O O Mark Doliner \ | ma...@ki... \ | www.kingant.net "There needs to be a better word for weird." |
From: Alexey M. <gai...@ly...> - 2004-08-28 17:10:45
|
Mark Doliner wrote: >ASCII is a subset of UTF-8. So if something is valid ASCII, it WILL be valid >UTF-8. Setting charsetstr to UTF-8 mostly just simplifies the code some. > >And yes, it also has the effect of accepting characters that are UTF-8 but not >ASCII, but are sent using the ASCII flag. I suppose that's equally as wrong >as assuming the text sent over ASCII is ISO-8859-1, but it seems a lot cleaner >to me this way. If we're going to assume something, I'd rather assume UTF-8 >than ISO-8859-1. >-Mark > > If something is valid ASCII, it is valid UTF-8, but it is as much valid ISO-8859-1 or ISO-8859-2 or KOI8-R or CP1251 or pretty much anything single byte. The choice seems pretty arbitrary to me. The choice is yours, but to me interoperability with other clients would be more important. Do you know of any client that sends UTF-8 and marks it as ASCII? I do know clients that send custom charset and mark it as ASCII. In any case, UTF-8 might seem cleaner to you, but to me custom charset would be more useful. I think both are equally wrong. The most interoperable way of course would be to try to decode it as UTF-8 and if that fails fall back to custom character set. And the cleanest way is of course to just leave it as ASCII. Thanks, --Lyosha |
From: Luke S. <lsc...@us...> - 2004-08-28 17:19:57
|
On Sat, Aug 28, 2004 at 10:10:32AM -0700, Alexey Marinichev wrote: > Mark Doliner wrote: > > >ASCII is a subset of UTF-8. So if something is valid ASCII, it WILL be > >valid > >UTF-8. Setting charsetstr to UTF-8 mostly just simplifies the code some. > > > >And yes, it also has the effect of accepting characters that are UTF-8 but > >not > >ASCII, but are sent using the ASCII flag. I suppose that's equally as > >wrong > >as assuming the text sent over ASCII is ISO-8859-1, but it seems a lot > >cleaner > >to me this way. If we're going to assume something, I'd rather assume > >UTF-8 > >than ISO-8859-1. > >-Mark > > > > > If something is valid ASCII, it is valid UTF-8, but it is as much valid > ISO-8859-1 or ISO-8859-2 or KOI8-R or CP1251 or pretty much anything > single byte. The choice seems pretty arbitrary to me. > > The choice is yours, but to me interoperability with other clients would > be more important. Do you know of any client that sends UTF-8 and marks > it as ASCII? I do know clients that send custom charset and mark it as > ASCII. that's the wrong question. the only other clients that matter are the official ones. i could care less what other 3rd parties are doing. > > In any case, UTF-8 might seem cleaner to you, but to me custom charset > would be more useful. I think both are equally wrong. The most > interoperable way of course would be to try to decode it as UTF-8 and if > that fails fall back to custom character set. And the cleanest way is > of course to just leave it as ASCII. except of course that the rest of gaim is using unicode, so we are converting everything to unicode anyway. luke > > Thanks, > > --Lyosha > |
From: Alexey M. <gai...@ly...> - 2004-08-28 17:58:57
|
Luke Schierer wrote: >>If something is valid ASCII, it is valid UTF-8, but it is as much valid >>ISO-8859-1 or ISO-8859-2 or KOI8-R or CP1251 or pretty much anything >>single byte. The choice seems pretty arbitrary to me. >> >>The choice is yours, but to me interoperability with other clients would >>be more important. Do you know of any client that sends UTF-8 and marks >>it as ASCII? I do know clients that send custom charset and mark it as >>ASCII. >> >> > >that's the wrong question. the only other clients that matter are the >official ones. i could care less what other 3rd parties are doing. > > That's great, I apprecate that. >>In any case, UTF-8 might seem cleaner to you, but to me custom charset >>would be more useful. I think both are equally wrong. The most >>interoperable way of course would be to try to decode it as UTF-8 and if >>that fails fall back to custom character set. And the cleanest way is >>of course to just leave it as ASCII. >> >> > >except of course that the rest of gaim is using unicode, so we are >converting everything to unicode anyway. > > That's great too, we should change gaim_plugin_oscar_parse_im_part to this: else if (charset == AIM_CHARSET_ASCII) charsetstr = "ASCII"; Having "UTF-8" there is arbitrary; this: charsetstr = gaim_account_get_string(account, "encoding", OSCAR_DEFAULT_CUSTOM_ENCODING); is just as arbitrary but better for interoperability. --Lyosha |
From: Mark D. <ma...@ki...> - 2004-08-28 18:36:10
|
On Sat, 28 Aug 2004 10:58:51 -0700, Alexey Marinichev wrote > That's great too, we should change gaim_plugin_oscar_parse_im_part > to this: > > else if (charset == AIM_CHARSET_ASCII) > charsetstr = "ASCII"; > > Having "UTF-8" there is arbitrary; this: > charsetstr = gaim_account_get_string(account, "encoding", > OSCAR_DEFAULT_CUSTOM_ENCODING); > is just as arbitrary but better for interoperability. Changing it from UTF-8 to ASCII is not going to happen. It doesn't help anything, and it makes things worse. Changing it to ISO-8859-1 is a possibility. Changing it to gaim_account_get_string(...) is horrible, because then real ASCII messages might not be decoded correctly if the character set specified in the account editor is not a superset of ASCII. Attempting to convert from UTF-8, and using gaim_account_get_string(...) as a fallback is a possibility. -Mark -- O O Mark Doliner \ | ma...@ki... \ | www.kingant.net "There needs to be a better word for weird." |
From: Alexey M. <gai...@ly...> - 2004-08-28 18:48:47
|
Mark Doliner wrote: >On Sat, 28 Aug 2004 10:58:51 -0700, Alexey Marinichev wrote > > >>That's great too, we should change gaim_plugin_oscar_parse_im_part >>to this: >> >> else if (charset == AIM_CHARSET_ASCII) >> charsetstr = "ASCII"; >> >>Having "UTF-8" there is arbitrary; this: >> charsetstr = gaim_account_get_string(account, "encoding", >>OSCAR_DEFAULT_CUSTOM_ENCODING); >>is just as arbitrary but better for interoperability. >> >> > >Changing it from UTF-8 to ASCII is not going to happen. It doesn't help >anything, and it makes things worse. Changing it to ISO-8859-1 is a possibility. > > Agreed 100%. The only thing changing it to ASCII would achieve is agreeing with the standard and not tolerating clients that do not follow the standard. Assuming, of course, AIM_CHARSET_ASCII really means ASCII. >Changing it to gaim_account_get_string(...) is horrible, because then real >ASCII messages might not be decoded correctly if the character set specified >in the account editor is not a superset of ASCII. > > We already do it for AIM_CHARSET_CUSTOM. Custom does not imply ASCII, though. I do not know much about charsets I do not use, are there many commonly used charsets that do not agree with ASCII for characters < 128? >Attempting to convert from UTF-8, and using gaim_account_get_string(...) as a >fallback is a possibility. >-Mark > > That is a possibility I would be most happy with. Thank you. --Lyosha |
From: Tim R. <om...@ho...> - 2004-08-29 01:45:09
|
My thoughts on this issue... First some background information. ASCII is a 7 bit character set, with one character per byte. ASCII characters are often stringed together to form strings, and usually the last character is set to the NUL character, which is 0. These are usually called C strings, and I've also heard the term ASCIIZ string. There are other character sets that only use 7 bits. UTF-8 is a way of representing unicode using sequences of 8 bit bytes (as opposed to UCS2 which 16 bits), and carefully designed for backward compatibility. A UTF-8 character without the 8th bit set is also the same character in ASCII. So a UTF-8 string that doesn't use any 8 bit characters is also an ASCII string. Also, all nonascii characters always use only bytes with the 8th bit set, so things like NULs or "/"'s don't end up in them. Because it uses sequences of bytes with the 8th bit set, it can represent all of unicode, and not just an extra 128 characters. In GTK, all strings passed to its functions have to be in UTF-8. You can validate UTF-8 by using g_utf8_validate(). However this only ensures that the string is valid UTF-8. The string might (in this case will) also be valid ISO-8859-1, or something else. So there's no way to tell for sure what character set a string is in, given only the string. Just because a string is valid ASCII, doesn't mean it's not really supposed to be some other 7 bit character set. In order to pass a string from OSCAR to the rest of Gaim, we need to put it in UTF-8. We have to either convert it from something to UTF-8, or else validate it. If we ever pass a string on that would fail g_utf8_validate(), we risk crashing, and all sorts of nasty things. As I recall, we convert from UTF-8 to ASCII when the string is supposed to be ASCII. Or do I have that backwards? Either way this sounds kind of expensive just to validate the string as ASCII. Why not write a gaim_ascii_validate()? Something that iterates through the string, and returns FALSE if it runs into a 0 or a character with the 8th bit set. I personally hate the error messages that complain about character set problems. Specificly I hate the one for IRC, but this ICQ one seems the same, except I've never run into it. What I hate about them is I lose the message, and it's usually because someone's message has one ISO-8859-1 8 bit character, and the rest of it was fine. That being said, what I think makes the most sense for ICQ is to run gaim_ascii_validate on messages labeled as ASCII, and if it returns false, treat the message as CUSTOM, since obviously it isn't ASCII like it said it was. Since I don't think we have a gaim_ascii_validate or anything like it, I'll write one here. Besides, I want to try to do something constructive with this email. gboolean gaim_ascii_validate(const gchar *str, gint len) { int i; if (len == -1) len = strlen(str); for (i = 0; i < len; i++, str++) if (!*str || (*str & 0x80)) return FALSE; return TRUE; } --Tim |
From: Sean E. <sea...@gm...> - 2004-08-28 17:25:27
|
On Sat, 28 Aug 2004 10:10:32 -0700, Alexey Marinichev <gai...@ly...> wrote: > If something is valid ASCII, it is valid UTF-8, but it is as much valid > ISO-8859-1 or ISO-8859-2 or KOI8-R or CP1251 or pretty much anything > single byte. The choice seems pretty arbitrary to me. Gaim requires that all internal strings are UTF-8; it's not arbitrary. Converting to Latin-1 would only then have to convert to UTF-8 afterwards. Converting to UTF-8 off the bat requires half as much code. -s. |
From: Alexey M. <gai...@ly...> - 2004-08-28 18:06:00
|
Sean Egan wrote: >On Sat, 28 Aug 2004 10:10:32 -0700, Alexey Marinichev ><gai...@ly...> wrote: > > > >>If something is valid ASCII, it is valid UTF-8, but it is as much valid >>ISO-8859-1 or ISO-8859-2 or KOI8-R or CP1251 or pretty much anything >>single byte. The choice seems pretty arbitrary to me. >> >> > >Gaim requires that all internal strings are UTF-8; it's not arbitrary. > Converting to Latin-1 would only then have to convert to UTF-8 >afterwards. Converting to UTF-8 off the bat requires half as much >code. > >-s. > > The question is whether we convert *from* ASCII, UTF-8 or what is specified in "encoding" setting. Right now we assume that if a message came marked as ASCII, we convert it to gaim required format as if it were UTF-8. See gaim_plugin_oscar_parse_im_part. I would prefer custom setting, that is treating ASCII message as if it were AIM_CHARSET_CUSTOM. --Lyosha |