unf-requirements Mailing List for Unified Internet Relay Chat
Brought to you by:
donwulff
You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
---|
From: Jukka S. <don...@ni...> - 2000-12-12 23:04:39
|
----- Original Message ----- From: "Jukka Santala" <don...@ni...> To: <unf...@li...> Sent: Friday, December 08, 2000 7:43 PM Subject: [Unf-requirements] Text encoding/UTF-8 > Well, for server-coding projects, wherever it matters at all, as long as > they're coordinated by me, it is to be assumed that UTF-8 is the default > encoding for all messages. (And for server, most of the time it just doesn't > matter. UTF-8 is a format specifically designed to encode Unicode over > normal 8-bit communications layers with full backwards-compatibility to > 7-bit ASCII) Careless, careless... I did myself the same basic mistake I've dissed others for before, at least I didn't do it in coding like some ;) Ofcourse, the IRC protocol character-set is NOT ASCII; it is properly known as SF7. Since the designers of early computers were so inconsiderate as not to consider the needs of the Scandinavian countries in their 7-bit character-sets (8th bit was used for parity checking), the Finns who designed IRC were as inconsiderate as to use their own variation of ASCII for IRC. One might assume it's another client-side issue how the characters are interpreted, but there are some server-side repercussions... Nowadays, as nobody uses SF7 anymore, it isn't that crucial other than for the odd equivalence of the characters specified in the IRC RFC, which are different-case versions of the Scandinavian characters not appearing in standard ASCII. As this is written in RFC, and has served well for over a decade, there's no reason to change it lest somebody want to really cause something equivalent to the confusion of Babel with IRC software. Many existing pieces of software correctly assume SF7, and for those that don't, it isn't a big problem unless they're Services or something. In fact, the only reason these letters are even allowed in nicknames is specifically because they were letters in the Scandinavian characterset, and if we were to "correct the mistake the designers made", these characters should be properly disallowed in nicks. You know how well THAT'D work. For UTF-8, this raises some interesting issues, but not anything critical, since UTF-8 letters couldn't be used in nicknames, and otherwise there's any difference only in channel names (Maybe other things, only if ircd is extended much...). However, the point of _this_ long explanation is that it can't be said ircd character-set is strict UTF-8; because of this little quirk, the ircd characterset would properly be something that doesn't even have a name. UTSF-8 maybe? Because some ircds (Bahamut "strain", most notably) use different charactersets, it might make sense to specify the character-set on the capability-header along with SAFELIST etc. for those clients which try to care what each server is doing. That will also allow UNF to be DALnet-compatible as is the goal. ("Unified ircd", supposed to eventually be good enough to use on every network, implementing the pertinent features, configurable if needed) ObsExploit: ircd uses the SF7 character-table for everything, including bans and k-lines. Because of this use of these characters in any masks will match both cases of the "letter". Conceivably, ircd should not do that, but it makes no difference because these letters should not be used in hostmasks and usernames aren't unique. -Donwulff |
From: Jukka S. <don...@ni...> - 2000-12-08 17:50:01
|
I consider the CTCP/2 protocol worked on for better part of half a decade kinda broken in ways, altough the final judge on its merits will be how widely it will be accepted and implemented... Considering majority of the non-popular/standard clients out there don't even implement CTCP(1), and those which do had great trouble finding commont erms to do it, I'm not exactly holding my breath. (Altough there never was official CTCP specification, it was just a commonly agreed set). Anyway, I've wentured far off the main topic, I just wanted to note that I do not neccessarily agree with this clipping from the CTCP/2 standard: <!--StartFragment--> 2.10 Encoding Specification: <^F> "E" [encoding] <^F> Because IRC was developed in Finland, it has historically used Latin-1 for encoding text. Latin-1 (ISO 8859-1) is ASCII, with the upper 128 characters making up national symbols from several European countries. Latin-1 is sufficient for most English speakers and those using a handful of Western European languages, but cannot be used to send characters from other languages. For this purpose, the "E" encoding attribute may be used to change the default "encoding" of text. The following encodings are valid: A number between 1 and 10 may be used to specify the ISO character sets ISO 8859-1 through ISO 8859-10. These character sets all use ASCII for the lower 128 characters. They span most European and Latin-based langauges, as well as the Cyrillic, Greek, Arabic, and Hebrew character sets. An encoding of "U" indicates UTF-8 encoding. This is a method of encoding Unicode (16-bit or even 32-bit) characters in an 8-bit character set. The lower 128 characters remain ASCII. All other encodings are reserved for future use. Specifying no encoding must return the IRC client to the default encoding. If no user selected default is chosen, it should return to Latin-1 encoding. Please note that all encodings preserve the lower 128 characters for ASCII. Therefore all CTCP control characters remain identical. <!--EndFragment--> What's this got to do with anything? Due to the fact that the IRC character-set has never been clearly defined beyond the lower 128 characters, there's many interpretations even currently existing, and most clients can't talk with each other. mIRC's default of Latin-1 might seem reasonable, but as I've pointed out long back, according to the IETF (Internet Engineering TaskForce, the standards-body for Internet protocols), NEW PROTOCOLS USING LATIN-1 MAY NOT BECOME STANDARDS. Sorry for the caps there, but I think it is pretty critical issue. Now, ircd is just a conduit for messages between clients, and thus in itself doesn't limit the character representation, however because of the standards limitation and because it simply makes sense, I've crusaded for long for the official default IRC encoding to be UTF8. What's this got to do with anything? Well, for server-coding projects, wherever it matters at all, as long as they're coordinated by me, it is to be assumed that UTF-8 is the default encoding for all messages. (And for server, most of the time it just doesn't matter. UTF-8 is a format specifically designed to encode Unicode over normal 8-bit communications layers with full backwards-compatibility to 7-bit ASCII) -Donwulff -Donwulff |