better string library / Discussion / Open Discussion: i18n

Gregg Reynolds - 2006-04-30

Hi,

Just discovered bstring; looks very impressive. I plan to use it for some personal projects I'm kicking around.

Unfortunately, lack of i18n support is a huge problem. I see from your faq that you're open to adding such support but don't have a lot of experience with unicode, et al. Well, I'd like to help out if I can. I can't promise much in the way of running code (no time, alas), but I'm pretty conversant with Unicode issues, so I can help answer questions and do research. I could probably find time to read code, but writing and debugging is a different story. Also, the Unicode list is quite active, and lots of programmers monitor it - I'm certain you could count on lots of useful advice from them.

So I guess my question is, what needs to happen to get the ball rolling? In the faq you solicit suggestions regarding scope, and "how such support might be added".

Regarding scope: I haven't done the research, so I might be wrong, but to me the C world in general looks unicode-poor. I think Java more-or-less sets the standard for unicode support, so I think the goal, at least for a c string lib, should be to match or exceed Java (whatever that turns out to mean). First C string lib to offer full, robust, easy-to-use Unicode support wins big, no?

Rather than go on at length about scope in this email, I suggets we do the wiki thing, if you'r interested. In fact, I just decided I'll go ahead and create a wiki site at one of the freebie wiki hosts and then post the url here.

Regarding the how: I'd say the first thing to attack is to decide on internal representation. The world seems to be converging on utf-8 for internal representation of unicode data, so my first suggestion would be to find out how painful it's going to be to convert bstring to use this. For that I could help audit the code, at least.

Whaddya think?

-gregg

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Hsieh - 2006-05-01
  
  > Just discovered bstring; looks very impressive. I plan to use it for some
  > personal projects I'm kicking around.
  
  > Unfortunately, lack of i18n support is a huge problem. I see from your faq
  > that you're open to adding such support but don't have a lot of experience
  > with unicode, et al. Well, I'd like to help out if I can. I can't promise
  > much in the way of running code (no time, alas), but I'm pretty conversant
  > with Unicode issues, so I can help answer questions and do research.
  
  What do you know about fast normalization algorithms? The Unicode/IBM people
  seem to suggest that the table->table method is the best way to encode the
  code point attribute mapping, but I am not entirely satisfied with this
  approach.
  
  > [...] I
  > could probably find time to read code, but writing and debugging is a
  > different story.
  
  That's ok. Design, semantics description, properties, feature lists, API
  discussion are also valuable.
  
  > [...] Also, the Unicode list is quite active, and lots of
  > programmers monitor it - I'm certain you could count on lots of useful
  > advice from them.
  
  Perhaps that would be the best approach.
  
  > So I guess my question is, what needs to happen to get the ball rolling? In
  > the faq you solicit suggestions regarding scope, and "how such support
  > might be added".
  
  Well, Bstrlib includes comparison functions which performs both equality
  testing and lexical sorting. In Unicode parlance this is supposedly called
  "collation". Equality testing is well defined, however apparently actual
  ordering is locale-specific, and can't have a universal definition. So one
  question I have, is, is it ok to only include equality testing, or do we
  really want locale support for lexical sorting?
  
  The Unicode table also supplies attributes changing case. So that should make
  toupper()/tolower() like functions straight forward. However, Unicode *also*
  includes capitalization attributes -- but this would require context/position
  characteristics to be taken into account.
  
  > Regarding scope: I haven't done the research, so I might be wrong, but to
  > me the C world in general looks unicode-poor. I think Java more-or-less
  > sets the standard for unicode support, so I think the goal, at least for a
  > c string lib, should be to match or exceed Java (whatever that turns out to
  > mean). First C string lib to offer full, robust, easy-to-use Unicode
  > support wins big, no?
  
  Well, IBM has created an open source char * analogue library haven't they?
  Though, yes I do agree that a good Bstrlib-like implementation would
  distinguish it. But IBM's library, also includes regexs which would be hard
  to be compete with unless I did a regex implementation as well.
  
  > Rather than go on at length about scope in this email, I suggets we do the
  > wiki thing, if you'r interested. In fact, I just decided I'll go ahead and
  > create a wiki site at one of the freebie wiki hosts and then post the url
  > here.
  
  Pointer?
  
  > Regarding the how: I'd say the first thing to attack is to decide on
  > internal representation. The world seems to be converging on utf-8 for
  > internal representation of unicode data, so my first suggestion would be to
  > find out how painful it's going to be to convert bstring to use this.
  
  Wait no. The UTF-* formats are meant for platform independent uniform stored
  representation. Internally, the representation need only be something
  sufficient to cover the Unicode code point ranges -- that means, basically,
  something that can hold 21 bits. I.e., ucs-4, or utf-32 is actually the best
  representation that is most in keeping with the Bstrlib design (i.e., support
  for direct addressing of code points by its exact position in the code point
  sequence.)
  
  So my thinking was actually to create a parallel library on data types called,
  maybe, "bustring"s which represented each character as a (typedef unsigned
  long) cpucs4_t, and to support conversion backward and forward to both utf-8
  (in a bstring) and utf-16 (in just some container) and eventually the other
  marginal formats like utf-7 and utf-1.
  
  An important point is that I don't want to directly modify Bstrlib at all.
  Bstrlib's implementation is a full range byte oriented string implementation
  suitable for manipulating raw binary data equivalently as ASCII text. I.e.,
  one should be able to reliably and portably perform an fread() straight into
  the contents of a bstring. Having a separate data structure for Unicode
  strings would pretty much be mandatory.
  
  > [...] For that I could help audit the code, at least.
  > Whaddya think?
  
  I've been wanting to look at this for some time, and even did some
  experimental stuff already. But I was worried that operating in a vacuum
  would easily lead me to make mistakes through ignorance. I think the key
  to proceeding here is enter a discussion about it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mikko Rantalainen - 2008-07-08
    
    > So one question I have, is, is it ok to only
    > include equality testing, or do we really want
    > locale support for lexical sorting?
    
    The bare minimum would be equality testing with unicode equivalence (http://en.wikipedia.org/wiki/Canonical_equivalence) in NFD normalization form (see below). It would be great to have locale support for lexical sorting but I don't think that would be required for broad acceptance. In any case, it could be added later. A good API for lexical sorting would be along the line of cmp(str1,str2,slocale) where slocale could be a string similar to POSIX LC_COLLATE environment variable or a struct of some kind.
    
    > The Unicode table also supplies attributes
    > changing case. So that should make
    > toupper()/tolower() like functions straight
    > forward. However, Unicode *also* includes
    > capitalization attributes -- but this would
    > require context/position characteristics to
    > be taken into account.
    
    I wouldn't bother with capitalization at first. It can be added later if it's really considered worth it.
    
    >> The world seems to be converging on utf-8 for
    >> internal representation of unicode data, so
    >> my first suggestion would be to find out how
    >> painful it's going to be to convert bstring
    >> to use this.
    
    > Wait no. The UTF-* formats are meant for platform
    > independent uniform stored representation. Internally,
    > the representation need only be something sufficient
    > to cover the Unicode code point ranges -- that means,
    > basically, something that can hold 21 bits. I.e.,
    > ucs-4, or utf-32 is actually the best representation
    > that is most in keeping with the Bstrlib design (i.e.,
    > support for direct addressing of code points by its
    > exact position in the code point sequence.)
    
    I disagree. See Mozilla et al for examples. UTF-8 is used regardless of it's variable length encoding for two reasons: (1) it's compatible with C strings if NULL byte (U+0) doesn't need to be encoded and (2) it requires equivalent storage to 8bit/Latin1 strings for ASCII (and it's really ASCII for the lowest 7bit unicode points).
    
    Even the programs that deal with multibyte UTF-8 strings deal with a great deal of ASCII characters too so jumping to 4 byte internal presentation for every character could be a disaster for memory usage and possibly for the performance, too.
    
    I'd suggest keeping the Bstrlib as is (and document that all existing functions deal with bytes, not characters) but add functions to do things with the characters where the actual string is really UTF-8. Think current Bstrlib as a high performance byte array manipulation library.
    
    A function to catenate two strings is equal to byte array and UTF-8 string, the same applies to function that copies two strings. However, a UTF-8 string may return multiple bytes when 5th character is extracted. So one would need, in addition to currently available Bstrlib strings, functions that deal with characters instead of byte indexes. With Latin1 there's no difference, with UTF-8 there's a difference.
    
    Also note that even though a variable length encoding is not used, the functions must make a difference between byte, unicode point and a grapheme. A "character" could be a single unicode point or a grapheme. Also, it would be preferable to allow storage of UTF-8 strings that are not normalized but there should be functions to convert from non-normalized string to normalized string (http://en.wikipedia.org/wiki/Unicode_normalization). Both NFC (most often used) and NFD (used internally by MacOS X) normalization standards should be supported. Personally I'd prefer NFC normalization and a character would be a single unicode point in an UTF-8 string.
    
    In the end, UTF-8 strings have disadvantage of poor random access character performance unless extra storage is used to store character indexes inside the byte array. On the other hand, UTF-8 strings allow storing ASCII strings mixed with characters outside ASCII without huge storage space penalty. In addition, compatibility with C string functions is better.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Paul Hsieh - 2008-07-08
      
      > > So one question I have, is, is it ok to only
      > > include equality testing, or do we really want
      > > locale support for lexical sorting?
      >
      > The bare minimum would be equality testing with
      > unicode equivalence (http://en.wikipedia.org/wiki/Canonical_equivalence)
      > in NFD normalization form (see below).
      
      Yes, I understand this. But I am not trying to get a checkmark in some feature box. Just as Bstrlib is supposed to be a very powerful string library base (powerful enough to making the use of char *'s a total non-starter in nearly all cases) I would want whatever support Bstrlib has for Unicode to feel at the same level of power (including functional coverage).
      
      > [...] It would
      > be great to have locale support for lexical sorting
      > but I don't think that would be required for broad
      > acceptance. In any case, it could be added later. A
      > good API for lexical sorting would be along the line of
      > cmp(str1,str2,slocale) where slocale could be a string
      > similar to POSIX LC_COLLATE environment variable or a
      > struct of some kind.
      
      Well obviously the problem is that its nearly impossible to do this in any other way that is sustainable and beyond falling back to the compiler's support for locales for char *'s anyways.
      
      > > The Unicode table also supplies attributes
      > > changing case. So that should make
      > > toupper()/tolower() like functions straight
      > > forward. However, Unicode *also* includes
      > > capitalization attributes -- but this would
      > > require context/position characteristics to
      > > be taken into account.
      >
      > I wouldn't bother with capitalization at first. It can
      > be added later if it's really considered worth it.
      
      Well, my thinking, as above, is that its an explicitely encoded feature of the Unicode attributes. So to be sure that Bstrlib was *the* Unicode solution, it would seem that actually encoding such a feature would make sense.
      
      > >> The world seems to be converging on utf-8 for
      > >> internal representation of unicode data, so
      > >> my first suggestion would be to find out how
      > >> painful it's going to be to convert bstring
      > >> to use this.
      >
      > > Wait no. The UTF-* formats are meant for platform
      > > independent uniform stored representation. Internally,
      > > the representation need only be something sufficient
      > > to cover the Unicode code point ranges -- that means,
      > > basically, something that can hold 21 bits. I.e.,
      > > ucs-4, or utf-32 is actually the best representation
      > > that is most in keeping with the Bstrlib design (i.e.,
      > > support for direct addressing of code points by its
      > > exact position in the code point sequence.)
      >
      > I disagree. See Mozilla et al for examples. UTF-8 is
      > used regardless of it's variable length encoding for two
      > reasons: (1) it's compatible with C strings if NULL byte
      > (U+0) doesn't need to be encoded and (2) it requires
      > equivalent storage to 8bit/Latin1 strings for ASCII (and
      > it's really ASCII for the lowest 7bit unicode points).
      
      Well, I could also cite examples from Java or most of Windows to support UTF-16 as the base type. I understand the advantages and disadvantages of the various encoding types.
      
      My reasoning for preferring UTF-32 was that it trades only one thing (memory footprint) in exchange for retaining every other basic advantage. Besides direct addressing, algorithms such as "find string in string" are trivial to implement, and there are no "false encodings" (except for illegal code point values, but they are propogated in isolation). It also maintains the basic philosophy of Bstrlib, which is to have retain the length as a direct value, not something that needs to be computed.
      
      To implement utf-8 (or utf-16) as the basic encoding type, I would probably have to implement a whole iterator paradigm.
      
      But now that I think about it, the code point sequence length is not very useful for what the end user really thinks of the string length (the number of graphemes) and "find string in string" is not trivial if by "string" we mean Unicode normalized strings. And to do Normalization I need to implement an iterator paradigm anyways (I have already written some demo code that does this.)
      
      So my thinking is that an iterator (through normalized graphemes) might look like:
      
          struct buIterator {
              bstring str;
              int offset;
              int next;
      
              int cpMlen; /* = 16 initially */
              int cpSlen; /* The number of code points */
              bu_utf32_t codepoints[16]; /* actually growable */
          };
      
      With the idea being that it would instrinsically reallocate if a grapheme was more than cpMlen (initially set to 16) code points.
      
      This would allow implementation of all the search scan and compare functions, as well as implementing a grapheme length calculator.
      
      > Even the programs that deal with multibyte UTF-8 strings
      > deal with a great deal of ASCII characters too so jumping
      > to 4 byte internal presentation for every character could
      > be a disaster for memory usage and possibly for the
      > performance, too.
      
      Well, you would have to have fairly large strings for this to be an issue. Remember that Bstrlib itself already costs about 50% more in memory overhead versus precisely sized string allocations, and yet it wins pretty much all benchmarks against the standard alternatives. So I don't really think of this as a sufficient argument.
      
      > I'd suggest keeping the Bstrlib as is (and document that
      > all existing functions deal with bytes, not characters)
      > but add functions to do things with the characters where
      > the actual string is really UTF-8. Think current Bstrlib
      > as a high performance byte array manipulation library.
      
      This has a lot of appeal in the sense of being able to merely extend Bstrlib as is. My original thinking was to create a completely seperate library, probably called BuStrlib, that would encode a new type but still use Bstrlib for conversions back and forth from utf-8 or utf-16. (Though you have made me reconsider this.)
      
      Bstrlib is a generalization of char * strings, so I think I would retain the language of "characters" for its strings. Furthermore, Unicode itself shys away from using the word "character" anyways. Grapheme and code points seem like the right terminology anyways.
      
      I think the right way is to add a bustrlib *module* that simply had functions that treated bstrings like UTF8 formatted sequences.
      
      > A function to catenate two strings is equal to byte array
      > and UTF-8 string, the same applies to function that copies
      > two strings.
      
      Well ... due to the potential for encoding errors, I think more would have to be done.
      
      > [...] However, a UTF-8 string may return multiple
      > bytes when 5th character is extracted. So one would need,
      > in addition to currently available Bstrlib strings,
      > functions that deal with characters instead of byte
      > indexes. With Latin1 there's no difference, with UTF-8
      > there's a difference.
      >
      > Also note that even though a variable length encoding is
      > not used, the functions must make a difference between byte,
      > unicode point and a grapheme. A "character" could be a
      > single unicode point or a grapheme. Also, it would be
      > preferable to allow storage of UTF-8 strings that are not
      > normalized but there should be functions to convert from
      > non-normalized string to normalized string
      > (http://en.wikipedia.org/wiki/Unicode_normalization).
      
      The nature of Bstrlib is to allow for arbitrary encodings of input without demanding structure, but rather *interpreting* structure from whatever happens to be there. Explicit normalization would have to be an extra function but it would serve little to no functional purpose (the iterators would have to always auto-normalize anyways).
      
      > [...] Both
      > NFC (most often used) and NFD (used internally by MacOS X)
      > normalization standards should be supported. Personally I'd
      > prefer NFC normalization and a character would be a single
      > unicode point in an UTF-8 string.
      
      Bstrlib is not about setting arbitrary policies for programmers. I think I would be forced to implement all of them including the compatibility normalizations.
      
      > In the end, UTF-8 strings have disadvantage of poor random
      > access character performance unless extra storage is used
      > to store character indexes inside the byte array.
      
      I think more to the point, random access to code points is not useful to any string algorithm anyways.
      
      > [...] On the
      > other hand, UTF-8 strings allow storing ASCII strings
      > mixed with characters outside ASCII without huge storage
      > space penalty. In addition, compatibility with C string
      > functions is better.
      
      Ok, the real issues come when we think about practical source code generation environments. One of the nicities that Bstrlib comes with is the bsStatic() macros and the blk functions in conjunction with bsStaticBlkParms(). These let you use Bstrlib with strings literals in a fairly natural way.
      
      With a UTF-8 encoding, basically, what is legal to write in that raw form is ok. But the modern wchar_t standard also has L"Literal" as a new kind of string type. Though many compilers
      do not support it. Should I make similar macros/functions for them? I don't really know the general state of compilers with respect to wchar_t support.
      
      I'd like to thank you for your feedback. Its nice to know that Bstrlib is valuable enough to someone to the degree that they want to see further development to the obvious next step.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

i18n

Forums

Help

i18n

i18n

Forums

Help

i18n document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

i18n