Menu

#41 TCharSet and TFilterValidator are broken for Unicode

unspecified
closed
nobody
5
2012-09-27
2009-07-26
No

The TCharSet class in "owl/bitset.h" is unable to represent a set of wide characters. It can only represent narrow (8-bit) characters. If the given set of wide characters (the constructor parameter) contains characters with codes above 255 then those characters are arbitrarily converted to characters with codes below 256 by truncation before adding those to the set representation. This limits the useful characters to the lower 256 character codes.

TCharSet is used by TFilterValidator. This means that TFilterValidator is also broken. No other classes in OWLNext currently uses TCharSet.

This bug is the cause of the following compiler warning when building the Unicode variant of OWLNext 6.21.9:

filtval.cpp(57) : warning C4244: 'argument' : conversion from 'TCHAR' to 'uint8', possible loss of data

I recommend that TCharSet is fixed so that it is able to represent a set of wide characters correctly. Alternatively, an exception should be thrown for unsupported character codes (> 256).

Related

Bugs: #244

Discussion

  • Vidar Hasfjord

    Vidar Hasfjord - 2009-07-27

    Unified diff applicable to OWLNext 6.21.9

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2009-07-27

    I've attached a patch with an untested fix for this issue.

    • Fixed: "TCharSet and TFilterValidator is broken for Unicode" (Tracker ID 2827517). Added support for wide characters in TBitSet (the base for TCharSet).

    The Unicode variant of 6.21.9 now builds cleanly without warnings with VC9.

    Note that the fix changes TBitSet substantially from a ordinary class to a class template. User client code will have to be updated. I have only compiled it with VC9, and it is uncertain how it will work with other compilers. And I have not tested the functionality of the fix; i.e. that it actually works in practice.

    So please test this patch.

    Also note that the implementation is brute-force. I just extended the bit array to cope with a wide character set (16-bit). This means that TCharSet does not support UTF-16 with multi-code-unit Unicode characters (surrogate pairs). It is limited to the UCS-2 fixed-length subset.

    For wide characters the implementation requires a 8 KiB buffer for each instance of TCharSet, which may be considered a large memory cost.

    Better solutions are welcomed.

     
  • Ognyan Chernokozhev

    Fixed in 6.21.10

     

Log in to post a comment.