Menu

#6 What is correct handling of 0xe2 0x80 0x94 character?

0.3.3
closed
None
2017-01-11
2016-12-07
smitchell
No

On 7/3/2015 7:05 PM, Kevin Routley wrote:

Hi, Stan -
I've got another bug for you. The next GED file I tried has various "extended" punctuation characters in some NOTE lines - probably pasted in from Doc, Wordpad, etc. The plugin highlights them as invalid.

The attached smallest.ged shows the problem. The "-" character in the NOTE line is not a standard minus (0x2D) but something else. DebugTrace tells me it is 0xE2, followed by 0x80.

At that character, the state machine switches from LS_VALUE to LS_ERROR because isInvalidControlChar() returns true for 0xE2.

I changed the declaration of isInvalidControlChar() from
bool isInvalidControlChar(char ch) to
bool isInvalidControlChar(unsigned char ch) and all is as I desire.

I stumbled across your blog entry about the Function List capability and that looks useful - something new to explore!

My thanks again for this useful tool!
Kevin

1 Attachments

Discussion

  • smitchell

    smitchell - 2016-12-07

    7/6/2015
    Hi Kevin,

    The example GEDCOM you provided has "CHAR ASCII" in the header, so its contents should only use 7-bit ASCII. In this case it is an error to embed a UTF-8 character. However, if the header is changed to "CHAR UTF-8", the EM DASH (0xe2 0x80 0x94) character is still flagged as an error and that is a bug.

    Currently, no attempt is made to check the buffer encoding when validating characters nor is the CHAR tag setting taken into consideration. This would require closer integration with Scintilla (the editor component) and I didn't want to make that extra effort.

    -Stan

     

    Last edit: smitchell 2016-12-07
  • smitchell

    smitchell - 2017-01-11
    • status: open --> closed
    • Milestone: 0.3.4 --> 0.3.3
     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB