#326 UTF-8 detection problem?

Feature_Request
closed-rejected
Neil Hodgson
SciTE (619)
2
2005-07-14
2005-07-11
Neo
No

SciTE 1.64 is not able to detect the attached file as
UTF-8 encoded. However, if select UTF-8 Cookie from the
File/Encoding menu, SciTE correctly displays this file.

Or I simply can be wrong because SciTE is not designed
to detect this kind of file.

Discussion

1 2 > >> (Page 1 of 2)
  • Neo
    Neo
    2005-07-11

     
    Attachments
  • Neil Hodgson
    Neil Hodgson
    2005-07-12

    Logged In: YES
    user_id=12579

    I don't see how SciTE could tell that this file is UTF-8.
    There is no BOM or cookie.

     
  • Neil Hodgson
    Neil Hodgson
    2005-07-12

    • assigned_to: nobody --> nyamatongwe
    • priority: 5 --> 2
    • status: open --> closed-rejected
     
  • Neo
    Neo
    2005-07-12

    Logged In: YES
    user_id=644683

    I don't know what a 'cookie' is, but if I select UTF-8
    Cookie from the
    File/Encoding menu, SciTE correctly displays this file.

    So, SciTE CAN handle this file.

    BTW, NOTEPAD and Firefox can hanle this file correctly. I
    myself even have a piece of code that can detect if a
    sequence of byte data is valid UTF-8 encoded characters. If
    you want that I can paste my code for you.

     
  • Neil Hodgson
    Neil Hodgson
    2005-07-12

    Logged In: YES
    user_id=12579

    A cookie is an editor convention (defined by one of the
    old editors like emacs or vi) where a comment in the first
    two lines deines the encoding. It looks like (using whatever
    the language likes as a comment indicator)
    # coding: utf-8
    or an XML declaration
    <?xml version='1.0' encoding='utf-8'?>
    Firefox doesn't recognise the file as UTF-8 on my
    machine, probably because the set of characters used is not
    compatible with my locale. Every valid UTF-8 files is also
    valid in many other encodings such as ISO-8859-1 so you
    start relying on statistical properties and this is
    inaccurate. I won't be adding any code like this.

     
  • Neo
    Neo
    2005-07-13

    Logged In: YES
    user_id=644683

    What do you mean by saying Firefox not recognising ths file?
    I guess you won't see the 4 Chinese characters because you
    don't have necessary fonts installed.

    Since this file is UTF-8 encoded, there is no such a thing
    as 'statistical'. A single UTF-8 file can contain any (?)
    characters from any language defined in UCS-2.

    BTW, I don't quite understand this - 'Every valid UTF-8
    files is also
    valid in many other encodings such as ISO-8859-1'. This is
    certainly true for normal ASCII files, but not necessarily
    for non-ASCII files.

     
  • Neil Hodgson
    Neil Hodgson
    2005-07-13

    Logged In: YES
    user_id=12579

    Firefox displays this file for me as ISO-8859-1, so the
    Chinese characters appear as "中文字". I do have the
    Chinese fonts installed and Notepad shows the file with
    Chinese text as does Firefox if UTF-8 is chosen from the
    encoding menu.
    A file is a sequence of 8 bit bytes. In most encodings such
    as ISO-8859-1 (Western Europe), KOI8-R (Russia) and
    Windows-1253 (Greek) every byte is valid and so are all
    sequences of bytes. This is not true for ASCII which is a 7
    bit encoding so ASCII files may not contain values larger
    than 127.

     
  • Neo
    Neo
    2005-07-13

    • status: closed-rejected --> open-rejected
     
  • Neo
    Neo
    2005-07-13

    Logged In: YES
    user_id=644683

    Let's forget other things and focus on 'Can you make SciTE
    auto-detect this kind of file (UTF-8 without BOM)'.

     
  • Neil Hodgson
    Neil Hodgson
    2005-07-14

    • milestone: --> Feature_Request
    • labels: --> SciTE
    • status: open-rejected --> closed-rejected
     
1 2 > >> (Page 1 of 2)