WinCE is handling UFT16LE without BOM.
with this patch Winmerge can handle that also.
This looks fragile! If you don't know the unicode encoding you assume UTF-16LE? What if there are UTF-16 BE files without BOM? If you open the door for UTF-16 without BOM bytes you really need to handle it all not just partly. And I'm a bit skeptic if that can be done... I think the solution is to allow user to select the encoding for the file so user can select it in case one knows it...
so it also done by other editors.
of coarse we can detect UFT16BE also, till now I have no file found.
But the system should be same.
I changed, still I could find a real UFT16BEnBOM file.
So I created it myself useing notepad++ and Frhed.
I found files with no BOM are shown as binaryfiles.
So I add a second folder.patch. There I take the encoding from m_FileLocation.encoding. Now the view is correct.
Note IsTextUnicode() coud not detect UTF16BEnBOM. So I start an other kind of detection, hope it's working correct.
I don't how many times I have to repeat this but I really don't see how it could work reliably.
And you still haven't told any other platforms using such files than WinCE which is marginal. And can be better handled with GUI than trying to automatic detection which may fail. My worry is we detect files as UTF-16 that aren't UTF-16.
I only know that platform. We use it also in my company. So for me it's a daily job.
How many PDAs are avaible with WinCE?
UTF16LE can be detect by IsTextUnicode() .
But you wanted me to add UTF16BE also. Here the function IsTextUnicode() fails.
Just unkown. Sorry I cann't say somthing diffrend.
Now I asked the first three char for unicode, checking for 0x00 at a proper position.
That should be safe enough.
What if you read the description of the function you say is doing the magic trick?
Determines if a buffer is likely to contain a form of Unicode text.
Does it sound like a reliable detection? I don't want to see tens of bugs of WinMerge detecting binary files as UTF-16. And if you think about how UTF-16 works it should be pretty clear there is no reliable way. Come back when you have something remotely reliable.
I got it working by checking last parameter
to IS_TEXT_UNICODE_STATISTICS | IS_TEXT_UNICODE_REVERSE_STATISTICS
UTF8 and Ansi return IS_TEXT_UNICODE_ILLEGAL_CHARS only.
So far I study the comments , it seams the func is checking first 256 char only.
Note our editor can't edit files like UFT16-BE!
>I got it working by checking last parameter
Define "working"? According to the documentation it is still guesswork. And which part of my last comment about reliable detection you did not understood? Yes, the situation is a bit similar to UTF-8 detection without BOM, but there the wrong detection does not cause much harm. But detecting binary files as Unicode text files probably causes not so nice side-effects...
> Note our editor can't edit files like UFT16-BE!
I've told you that you are opening new doors...
>But detecting binary files as Unicode text files probably causes not so nice
We can live with. Opposite would be more worse.
>I've told you that you are opening new doors...
No I'm not. We allready detect that since years.
We can compare in folder without problems of coarse.
Also remember you asked me todo it (UTF16BEnBOM).
see your comment 2010-09-13
WinMerge is primarily targeted at processing text files. The initial assumption on whatever file should therefore be that it is some kind of text. Only if it proves otherwise, or if the user explicitly instructs WinMerge otherwise, WinMerge should handle it as binary. So yes, trying to detect UCS-2 w/o BOM makes perfect sense.
> Note our editor can't edit files like UFT16-BE!
Surprisingly, support for UCS-2BE with BOM exists in WinMerge. It is true that WinMerge does not support UFT-16, regadsless of endianness. UCS-2 is a subset of UFT-16, which includes only the code points below 0x8000.
Is this really so hard to understand?
I'm not against trying to detect whatever format.
I'm against doing it in way that is not *RELIABLE*.
> We can live with. Opposite would be more worse.
And how exactly will binary file break WinMerge? The binary file gets opened in binary file editor. Sending random binary data to the text editor is not what we want.
You cannot reliably tell that a file is definitely all text, not even if you find a BOM, nor if you find no null bytes at all. What you can reliably tell is that a file is definitely not all text if you find content that does not represent text. So there is a bit of guesswork involved no matter how you put it, and we have been living with it. In that respect, trying to detect UCS-2 w/o BOM is no conceptual change.
> Sending random binary data to the text editor is not what we want.
It is unlikely that random binary data would pass the test for UCS-2 w/o BOM.
Jochen, I know damn well that we are not reliably detecting - I get e-mails from all those bugs. And I was one implementing the UTF-8 file without BOM detection (after years of disciussion)... I was participating adding Unicode support to WinMerge years back.
So yes, we have some uncertainty already. But this patch is adding MORE of it. I haven't seen any info which kind of files might not be detected correctly, what problems there might be in this proposed detection etc. The normal critical view to implementation. All I heard claimed is we now have some magic function doing the detection reliably. Which is not true. And which I've objected to.
If this was presented honestly with facts I probably wouldn't have problem with it. But trying to hide facts is not the right thing to do.
This will go into next WinMerge 2011 beta. Thanks for this submission.
Log in to post a comment.