SourceForge has been redesigned. Learn more.
Close

#2979 detect UTF16LE nBOM

Trunk
closed-rejected
nobody
5
2012-12-31
2010-09-13
Matthias
No

WinCE is handling UFT16LE without BOM.
with this patch Winmerge can handle that also.

Discussion

  • Matthias

    Matthias - 2010-09-13

    codepage_detect

     
  • Kimmo Varis

    Kimmo Varis - 2010-09-13

    This looks fragile! If you don't know the unicode encoding you assume UTF-16LE? What if there are UTF-16 BE files without BOM? If you open the door for UTF-16 without BOM bytes you really need to handle it all not just partly. And I'm a bit skeptic if that can be done... I think the solution is to allow user to select the encoding for the file so user can select it in case one knows it...

     
  • Matthias

    Matthias - 2010-09-13

    so it also done by other editors.
    of coarse we can detect UFT16BE also, till now I have no file found.
    But the system should be same.

     
  • Matthias

    Matthias - 2010-09-14

    codepage_detect

     
  • Matthias

    Matthias - 2010-09-14

    foldercomp_7261.patch

     
  • Matthias

    Matthias - 2010-09-14

    I changed, still I could find a real UFT16BEnBOM file.
    So I created it myself useing notepad++ and Frhed.

    I found files with no BOM are shown as binaryfiles.
    So I add a second folder.patch. There I take the encoding from m_FileLocation.encoding. Now the view is correct.
    Note IsTextUnicode() coud not detect UTF16BEnBOM. So I start an other kind of detection, hope it's working correct.

     
  • Kimmo Varis

    Kimmo Varis - 2010-09-15

    I don't how many times I have to repeat this but I really don't see how it could work reliably.

    And you still haven't told any other platforms using such files than WinCE which is marginal. And can be better handled with GUI than trying to automatic detection which may fail. My worry is we detect files as UTF-16 that aren't UTF-16.

     
  • Matthias

    Matthias - 2010-09-15

    I only know that platform. We use it also in my company. So for me it's a daily job.
    How many PDAs are avaible with WinCE?

    UTF16LE can be detect by IsTextUnicode() .
    But you wanted me to add UTF16BE also. Here the function IsTextUnicode() fails.
    Just unkown. Sorry I cann't say somthing diffrend.
    Now I asked the first three char for unicode, checking for 0x00 at a proper position.
    That should be safe enough.

     
  • Kimmo Varis

    Kimmo Varis - 2010-09-15
    • status: open --> open-rejected
     
  • Kimmo Varis

    Kimmo Varis - 2010-09-15

    What if you read the description of the function you say is doing the magic trick?

    From MSDN:
    http://msdn.microsoft.com/en-us/library/dd318672\(VS.85).aspx
    IsTextUnicode Function:
    Determines if a buffer is likely to contain a form of Unicode text.

    Does it sound like a reliable detection? I don't want to see tens of bugs of WinMerge detecting binary files as UTF-16. And if you think about how UTF-16 works it should be pretty clear there is no reliable way. Come back when you have something remotely reliable.

     
  • Matthias

    Matthias - 2010-09-16

    codepage_detect

     
  • Matthias

    Matthias - 2010-09-16

    I got it working by checking last parameter
    to IS_TEXT_UNICODE_STATISTICS | IS_TEXT_UNICODE_REVERSE_STATISTICS
    UTF8 and Ansi return IS_TEXT_UNICODE_ILLEGAL_CHARS only.
    So far I study the comments , it seams the func is checking first 256 char only.
    Note our editor can't edit files like UFT16-BE!

     
  • Kimmo Varis

    Kimmo Varis - 2010-09-17

    >I got it working by checking last parameter
    Define "working"? According to the documentation it is still guesswork. And which part of my last comment about reliable detection you did not understood? Yes, the situation is a bit similar to UTF-8 detection without BOM, but there the wrong detection does not cause much harm. But detecting binary files as Unicode text files probably causes not so nice side-effects...

    > Note our editor can't edit files like UFT16-BE!
    I've told you that you are opening new doors...

     
  • Matthias

    Matthias - 2010-09-17

    >But detecting binary files as Unicode text files probably causes not so nice
    >side-effects.
    We can live with. Opposite would be more worse.

    >I've told you that you are opening new doors...
    No I'm not. We allready detect that since years.
    We can compare in folder without problems of coarse.
    Also remember you asked me todo it (UTF16BEnBOM).
    see your comment 2010-09-13

     
  • Jochen Tucht

    Jochen Tucht - 2010-09-18

    WinMerge is primarily targeted at processing text files. The initial assumption on whatever file should therefore be that it is some kind of text. Only if it proves otherwise, or if the user explicitly instructs WinMerge otherwise, WinMerge should handle it as binary. So yes, trying to detect UCS-2 w/o BOM makes perfect sense.

    > Note our editor can't edit files like UFT16-BE!

    Surprisingly, support for UCS-2BE with BOM exists in WinMerge. It is true that WinMerge does not support UFT-16, regadsless of endianness. UCS-2 is a subset of UFT-16, which includes only the code points below 0x8000.

     
  • Kimmo Varis

    Kimmo Varis - 2010-09-18

    Is this really so hard to understand?

    I'm not against trying to detect whatever format.

    I'm against doing it in way that is not *RELIABLE*.

    > We can live with. Opposite would be more worse.
    And how exactly will binary file break WinMerge? The binary file gets opened in binary file editor. Sending random binary data to the text editor is not what we want.

     
  • Jochen Tucht

    Jochen Tucht - 2010-09-18

    You cannot reliably tell that a file is definitely all text, not even if you find a BOM, nor if you find no null bytes at all. What you can reliably tell is that a file is definitely not all text if you find content that does not represent text. So there is a bit of guesswork involved no matter how you put it, and we have been living with it. In that respect, trying to detect UCS-2 w/o BOM is no conceptual change.

    > Sending random binary data to the text editor is not what we want.

    It is unlikely that random binary data would pass the test for UCS-2 w/o BOM.

     
  • Kimmo Varis

    Kimmo Varis - 2010-09-18

    Jochen, I know damn well that we are not reliably detecting - I get e-mails from all those bugs. And I was one implementing the UTF-8 file without BOM detection (after years of disciussion)... I was participating adding Unicode support to WinMerge years back.

    So yes, we have some uncertainty already. But this patch is adding MORE of it. I haven't seen any info which kind of files might not be detected correctly, what problems there might be in this proposed detection etc. The normal critical view to implementation. All I heard claimed is we now have some magic function doing the detection reliably. Which is not true. And which I've objected to.

    If this was presented honestly with facts I probably wouldn't have problem with it. But trying to hide facts is not the right thing to do.

     
  • Jochen Tucht

    Jochen Tucht - 2011-11-04

    This will go into next WinMerge 2011 beta. Thanks for this submission.

     
  • Christian List

    Christian List - 2012-12-31
    • status: open-rejected --> closed-rejected
     

Log in to post a comment.