Searching for special characters

Will Haney
2013-12-20
2013-12-21
  • Will Haney

    Will Haney - 2013-12-20
     
    Last edit: Will Haney 2013-12-20
  • Will Haney

    Will Haney - 2013-12-20

    This file contains a special character indicated by the black highlighted text x92.
    When I search the x92 character as \x92 with Search Mode set to Extended I find no matches.

    Any assistance would be greatly appreciated.

    Notepad++ showing special character
    Search results

     
  • THEVENOT Guy

    THEVENOT Guy - 2013-12-21

    Hello Will,

    Well, Will, I think that your initial text certainly was : ST. LUKE’S CARE HOME

    According to your country language, you're using one of the Microsoft Windows-nnnn encodings ( in other words, the one-byte ANSI encoding format of N++ ), indicated at the address below :

    http://msdn.microsoft.com/en-us/goglobal/bb964654

    In all these Windows-nnnn encodings, you can noticed that the ANSI code-point \x92 ( decimal value 146 ) represents the Right Single Quotation Mark. This character is coded :

    • in the UNICODE Character Set, with the two-bytes value \x2019 ( UCS-2 Big Endian encoding of N++ )

    • in the UTF-8 encoding format, with the three-bytes value \xE28099 ( UTF-8 without BOM or UTF-8 encodings of N++)

    See, to this purpose, The Unicode General Punctuation chart at the address below :

    http://www.unicode.org/charts/PDF/Unicode-6.3/U63-2000.pdf


    So, for a file, containing the string LUKE’S, only

    • In a legal ANSI encoded file, the different bytes, of the file, are : 4C 55 4B 45 92 53

    • In a legal UTF-8 without BOM encoded file, the different bytes, of the file, are : 4C 55 4B 45 E2 80 99 53 ( ONE byte for letters with value of code-point < \x7F and the THREE bytes underlined for the Right Single Quotation Mark )

    • In a legal UTF-8 encoded file, the different bytes, of the file, are : EF BB BF 4C 55 4B 45 E2 80 99 53 ( same encoding as above, with, in addition, three invisible bytes which represent the BOM [ Byte Order Mark ] of a pure UTF-8 file )


    Now, let's imagine that your example text was initially, part of a legal ANSI encoded file, in N++ and that you used the transformation ( Menu Encoding -> Encode in UTF-8 without BOM ) you exactly get the text of your attached picture ! Why ? Because, after encoding the actual contents of the file, as it would have been an UTF-8 file, the file is NOT modified, and the single byte \x92, is detected as an unexpected byte, instead of a normal continuation byte of a legal multi-bytes UTF-8 sequence.

    To this matter, consult UTF-8 explanations, at the address below :

    http://en.wikipedia.org/wiki/UTF-8#Codepage_layout

    Moreover, remember that the Search/Replacement engine, is based on character's search and NOT on byte's search. So, as in an UTF-8 encoded file, an unique byte \x92 is forbidden and doesn't represent any character, you can't find it by any kind of search :-(

    So, the solution should be :

    • First, RE-encode your file in ANSI ( Menu Encoding -> Encode in ANSI )

    • Secondly, CONVERT your file in UTF-8 without BOM ( Menu Encoding --> Convert to UTF-8 without BOM )

    This last operation will change the contains, of the file, from 4C 55 4B 45 92 53 to 4C 55 4B 45 E2 80 99 53 In that way, the 3 bytes E2 80 99 will be correctly detected as the Right Single Quotation Mark ( = U+2019 )


    As I French and I'm using the Western European Windows-1252 encoding, I made, some time ago, a complete list of the ANSI, UNICODE and UTF-8 values of the 256 characters of the Windows-1252 table !

    Just see the two attached pictures below, which are hard-copies of a Word document !

    Cheers,

    guy038

    P.S. :

    Remind that if you hold down the ALT key and type, at the same time, the four digits of the decimal value of a character ( from 0001 to 0255 ) of this table, on the Numeric Keyboard, you write the corresponding character, at cursor's position, once the ALT key is released -:)

     
    Last edit: THEVENOT Guy 2013-12-22