Menu

UTF-8

Stage Nex
2009-09-05
2013-04-29
  • Stage Nex

    Stage Nex - 2009-09-05

    From what I've seen, I don't understand why you're saying that unicode isn't supported.

    I can store UTF-8 in bstring correctly, (even through I can't expect a[i] giving me the i-th character, but the i-th byte of the string).
    Converting to/from UCS2/UCS4 and UTF-8 is straightforward ( a 10 lines of code), so it isn't that hard to do, when one need to manipulate native unicode string.

    Actually getting [i] to get the i-th character is not that hard too, as it only requires to decode the high bits of a char, but it's a O(N) operation, instead of O(1).
    Overall, using UTF-8 doesn't really increase the required string length (because, if you use a UCS-4 string, most of bytes are zero anyway, while in UTF-8 it's not the case) , and char access in a string is quite rare anyway (so the O(N) penalty isn't that bad)

    Cyril

     
    • Paul Hsieh

      Paul Hsieh - 2009-09-08

      > From what I've seen, I don't understand why you're saying that
      > unicode isn't supported.

      > I can store UTF-8 in bstring correctly, (even through I can't expect
      > a[i] giving me the i-th character, but the i-th byte of the string).
      > Converting to/from UCS2/UCS4 and UTF-8 is straightforward ( a
      > 10 lines of code), so it isn't that hard to do, when one need to
      > manipulate native unicode string.

      Well of course Bstrlib can be used to store any serial data, including UTF-8.  The point is to implement a equality function.  If you read the Unicode documentation you will see that many code point sequences have equivalences.  For example a with an grave accent on top of it exists as a single character or as two separate combined characters, and both must be considered equivalent.

      > Actually getting [i] to get the i-th character is not that hard too, as
      > it only requires to decode the high bits of a char, but it's a O(N)
      > operation, instead of O(1). 
      > Overall, using UTF-8 doesn't really increase the required string
      > length (because, if you use a UCS-4 string, most of bytes are zero
      > anyway, while in UTF-8 it's not the case) , and char access in a
      > string is quite rare anyway (so the O(N) penalty isn't that bad)

      Direct grapheme (the unicode generalization of a character) access actually is typically used in parsing.  This can matter to some people, especially when taking input from web pages, or implementing programming languages..

      --
      Paul

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.