unf-requirements Mailing List for Unified Internet Relay Chat

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

----- Original Message -----
From: "Jukka Santala" <don...@ni...>
To: <unf...@li...>
Sent: Friday, December 08, 2000 7:43 PM
Subject: [Unf-requirements] Text encoding/UTF-8

> Well, for server-coding projects, wherever it matters at all, as long as
> they're coordinated by me, it is to be assumed that UTF-8 is the default
> encoding for all messages. (And for server, most of the time it just
doesn't
> matter. UTF-8 is a format specifically designed to encode Unicode over
> normal 8-bit communications layers with full backwards-compatibility to
> 7-bit ASCII)

Careless, careless... I did myself the same basic mistake I've dissed others
for before, at least I didn't do it in coding like some ;) Ofcourse, the IRC
protocol character-set is NOT ASCII; it is properly known as SF7. Since the
designers of early computers were so inconsiderate as not to consider the
needs of the Scandinavian countries in their 7-bit character-sets (8th bit
was used for parity checking), the Finns who designed IRC were as
inconsiderate as to use their own variation of ASCII for IRC. One might
assume it's another client-side issue how the characters are interpreted,
but there are some server-side repercussions...

Nowadays, as nobody uses SF7 anymore, it isn't that crucial other than for
the odd equivalence of the characters specified in the IRC RFC, which are
different-case versions of the Scandinavian characters not appearing in
standard ASCII. As this is written in RFC, and has served well for over a
decade, there's no reason to change it lest somebody want to really cause
something equivalent to the confusion of Babel with IRC software. Many
existing pieces of software correctly assume SF7, and for those that don't,
it isn't a big problem unless they're Services or something. In fact, the
only reason these letters are even allowed in nicknames is specifically
because they were letters in the Scandinavian characterset, and if we were
to "correct the mistake the designers made", these characters should be
properly disallowed in nicks. You know how well THAT'D work.

For UTF-8, this raises some interesting issues, but not anything critical,
since UTF-8 letters couldn't be used in nicknames, and otherwise there's any
difference only in channel names (Maybe other things, only if ircd is
extended much...). However, the point of _this_ long explanation is that it
can't be said ircd character-set is strict UTF-8; because of this little
quirk, the ircd characterset would properly be something that doesn't even
have a name. UTSF-8 maybe? Because some ircds (Bahamut "strain", most
notably) use different charactersets, it might make sense to specify the
character-set on the capability-header along with SAFELIST etc. for those
clients which try to care what each server is doing. That will also allow
UNF to be DALnet-compatible as is the goal. ("Unified ircd", supposed to
eventually be good enough to use on every network, implementing the
pertinent features, configurable if needed)

ObsExploit: ircd uses the SF7 character-table for everything, including bans
and k-lines. Because of this use of these characters in any masks will match
both cases of the "letter". Conceivably, ircd should not do that, but it
makes no difference because these letters should not be used in hostmasks
and usernames aren't unique.

 -Donwulff

unf-requirements Mailing List for Unified Internet Relay Chat

unf-requirements — Requirements for ircd in a changing world.