Menu

#387 special chars in utf-8 are replaced by a space

closed
Build (23)
5
2012-11-10
2011-04-13
No

Special characters in utf-8 consists of 2 bytes. A leading control char followed by the char. Some danish characters are like that.
The leading char is #C3. Some, but not all danish characters, are changed to a space #20.
The behavior is different if the utf-8 file has the BOM (=#EFBBBF) or not. The utf-8 standard, as I know, allows an utf-8 document both to have a BOM and not have a BOM. It is good practise to have a BOM, but some editors remove the BOM, leaving us with both situations.
Example 1: if the file has NO BOM, the 2-byte chars #C3A5, #C3A6, #C3B8 are changed to #C320, #C320, #C3B8.
Example 2: if the file has a BOM, the 2-byte chars #C3A5, #C3A6, #C3B8 are changed to #20, #20, #C3B8
As I see, uncrustify may handle all chars after #C3 as non-space, without regard if the file has a BOM.
An example file using the special chars is attached.

Discussion

  • Theo Thustrup

    Theo Thustrup - 2011-04-13
     
  • Ben Gardner

    Ben Gardner - 2011-05-05

    Fixed in commit 6ed81d3.
    However, there is currently no support for UTF-8 multi-byte characters, so using them may cause alignment problems.
    This is because Uncrustify assumes that the byte-length of a chunk is the same as the display-length.

     
MongoDB Logo MongoDB