unicode support in char and string procedures

Brought to you by: mradestock, scgmille

#6 unicode support in char and string procedures

Status: open

Owner: nobody

Labels: None

Priority: 4

Updated: 2005-01-16

Created: 2001-12-16

Creator: Scott G. Miller

Private: No

Rather than numeric comparison, Character's methods for
comparison should be used, and Collator's for string
comparison

Discussion

Matthias Radestock - 2001-12-16

Logged In: YES
user_id=110070

same goes for upcase/downcase.

Also, I'm not sure whether char's comparison methods respect the locale. I suspect we might have to do a
conversion to string and use a Collator for that too :(

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias Radestock - 2001-12-16

Logged In: YES
user_id=110070

also, let's not forget the various character class tests, i.e:
char-alphabetic?
char-numeric?
char-whitespace?
char-upper-case?
char-lower-case?

Using collators for string comparison and conversion will be a significant performance hit in sisc since strings
are represented as arrays and need to be converted to/from strings to carry out these operations.

Perhaps siscs string representation should be changed to String? This should speed up all string operations,
except string-set! which needs to convert the String to a different representation and back.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias Radestock - 2002-06-24

Logged In: YES
user_id=110070

The consensus seems to be that making the existing
string/char functions unicode-compliant is just not going to
work.

A separate set of functions or even a different data type
are better solutions.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias Radestock - 2002-06-24

priority: 5 --> 4

summary: Comparison not locale sensitive --> unicode support
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias Radestock - 2005-01-16

summary: unicode support --> unicode support in char and string procedures
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Denys Rtveliashvili - 2006-04-27

Logged In: YES
user_id=1416496

Hm.. It seems to me that there will be a problem if a
separate datatype or a separate set of functions is used for
unicode-aware operations. Because there could be a serious
confusion for those who try to use the ordinary string /
character - related functions expecting that they would work
with unicode.
Also, converting from ordinary strings to unicode aware
strings and back would be a performance hit too.

I propose to discuss the making ordinary strings unicode
aware a little further, to make sure if there is really no
possibility to do it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias Radestock - 2006-04-27

Logged In: YES
user_id=110070

We are not going to do anything until the R6RS committee have made up their minds on what to do about Unicode.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2006-04-27

Logged In: NO

That is reasonable.

Interesting, how long would it take for the committee the to
finalize the R6RS..? Currently I see only a proposal on
Unicode support, which is almost 1 year old:

----------------------------------------------

Unicode support
---------------

We have written up a proposal for Unicode support that
defines the
notion of "char" to be a Unicode scalar value---strings are
simply
vectors of these scalar values. This allows Unicode support
to be
largely a conservative extension of the character and string
processing
in R5RS, and avoids the API problems inherent in using a
UTF-16-based
representation. Moreover, this approach has already been
successfully
implemented by several Scheme implementations.

Along with Unicode support, we are also considering
extensions to the
character and string literal syntax. Details are still under
discussion.

----------------------------------------------

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.