better string library / Discussion / Open Discussion: UTF-8

UTF-8

Forum: Open Discussion

Creator: Stage Nex

Created: 2009-09-05

Updated: 2013-04-29

Stage Nex - 2009-09-05

From what I've seen, I don't understand why you're saying that unicode isn't supported.

I can store UTF-8 in bstring correctly, (even through I can't expect a[i] giving me the i-th character, but the i-th byte of the string).
Converting to/from UCS2/UCS4 and UTF-8 is straightforward ( a 10 lines of code), so it isn't that hard to do, when one need to manipulate native unicode string.

Actually getting [i] to get the i-th character is not that hard too, as it only requires to decode the high bits of a char, but it's a O(N) operation, instead of O(1).
Overall, using UTF-8 doesn't really increase the required string length (because, if you use a UCS-4 string, most of bytes are zero anyway, while in UTF-8 it's not the case) , and char access in a string is quite rare anyway (so the O(N) penalty isn't that bad)

Cyril

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Hsieh - 2009-09-08
  
  > From what I've seen, I don't understand why you're saying that
  > unicode isn't supported.
  >
  > I can store UTF-8 in bstring correctly, (even through I can't expect
  > a[i] giving me the i-th character, but the i-th byte of the string).
  > Converting to/from UCS2/UCS4 and UTF-8 is straightforward ( a
  > 10 lines of code), so it isn't that hard to do, when one need to
  > manipulate native unicode string.
  
  Well of course Bstrlib can be used to store any serial data, including UTF-8. The point is to implement a equality function. If you read the Unicode documentation you will see that many code point sequences have equivalences. For example a with an grave accent on top of it exists as a single character or as two separate combined characters, and both must be considered equivalent.
  
  > Actually getting [i] to get the i-th character is not that hard too, as
  > it only requires to decode the high bits of a char, but it's a O(N)
  > operation, instead of O(1).
  > Overall, using UTF-8 doesn't really increase the required string
  > length (because, if you use a UCS-4 string, most of bytes are zero
  > anyway, while in UTF-8 it's not the case) , and char access in a
  > string is quite rare anyway (so the O(N) penalty isn't that bad)
  
  Direct grapheme (the unicode generalization of a character) access actually is typically used in parsing. This can matter to some people, especially when taking input from web pages, or implementing programming languages..
  
  --
  Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UTF-8

Forums

Help

UTF-8

UTF-8

Forums

Help

UTF-8 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

UTF-8