From: Angel T. <fn...@fm...> - 2011-02-15 13:29:36
|
Hello. I'm using KDiff3 0.9.92 (with KDE 3.5.10 on Debian Lenny) with encoding set to UTF-8 for everything (A, B, C, merge output and saving, etc.) and "Auto Detect Unicode" checked. With these settings I accidentally merged two CP-1251-encoded files and saved the result. Now I cannot figure out how to "decode" the result properly and, unfortunately, the original files are gone. Does anyone have an idea on decoding the result? Thanks in advance, Angel Tsankov |
From: Joachim E. <joa...@gm...> - 2011-02-15 20:41:12
|
Hi Angel, I'm not quite sure that I understand the problem correctly. When reading as UTF8 and writing the same data as UTF8 then I would not expect many changes, because except for a few places everything should stay the same, regardless of what codec is really used as input. Yet you wouldn't write if there were no problems. Could you repeat this with a test file and send the original and modified versions? Cheers, Joachim > Hello. > > I'm using KDiff3 0.9.92 (with KDE 3.5.10 on Debian Lenny) with encoding > set to UTF-8 for everything (A, B, C, merge output and saving, etc.) and > "Auto Detect Unicode" checked. With these settings I accidentally > merged two CP-1251-encoded files and saved the result. Now I cannot > figure out how to "decode" the result properly and, unfortunately, the > original files are gone. Does anyone have an idea on decoding the result? > > Thanks in advance, > Angel Tsankov > > --------------------------------------------------------------------------- > --- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio > XE: Pinpoint memory and threading errors before they happen. > Find and fix more than 250 security defects in the development cycle. > Locate bottlenecks in serial and parallel code that limit performance. > http://p.sf.net/sfu/intel-dev2devfeb > _______________________________________________ > Kdiff3-user mailing list > Kdi...@li... > https://lists.sourceforge.net/lists/listinfo/kdiff3-user |
From: Angel T. <fn...@fm...> - 2011-02-15 21:19:01
Attachments:
test.files.tar.gz
|
On 02/15/11 22:41, Joachim Eibl wrote: > Hi Angel, > > I'm not quite sure that I understand the problem correctly. > > When reading as UTF8 and writing the same data as UTF8 then I would not expect > many changes, because except for a few places everything should stay the same, > regardless of what codec is really used as input. I forgot to mention that the original file contains cyrillic characters. [...] > Could you repeat this with a test file and send the original and modified > versions? See attached archive. It contains 2 files: the first one lists the lowercase letters (30 in total) of the Bulgarian alphabet plus a LF character and the second one was generated by merging the first one with a copy of itself (both opened as UTF8 files) and saving the output as UTF8. A binary editor shows that the output file contains the same character duplicated 30 times followed by a LF character. Regards, Angel Tsankov |
From: Joachim E. <joa...@gm...> - 2011-02-16 19:53:54
|
Hi Angel, No good news: When looking at the output in hex (e.g. via od -t x cp1251.saved.as.utf8.txt ) you see that there is no useful information left anymore. 0000000 efbdbfef bfefbdbf bdbfefbd efbdbfef 0000020 bfefbdbf bdbfefbd efbdbfef bfefbdbf 0000040 bdbfefbd efbdbfef bfefbdbf bdbfefbd 0000060 efbdbfef bfefbdbf bdbfefbd efbdbfef 0000100 bfefbdbf bdbfefbd efbdbfef bfefbdbf 0000120 bdbfefbd efbdbfef 000abdbf So in your concrete situation I can't do much for you. But I must admit, that I was not aware of that problem. As I mentioned before I expected no irreversible conversion loss, but now I think that Qt internally converts to 16 bit although UTF8 allows 32 bit characters. So most random combinations will result in a "invalid" character. I will try to detect this and display a warning in KDiff3 for such situations. Thanks for telling! I really do hope you find some backup. Joachim > On 02/15/11 22:41, Joachim Eibl wrote: > > Hi Angel, > > > > I'm not quite sure that I understand the problem correctly. > > > > When reading as UTF8 and writing the same data as UTF8 then I would not > > expect many changes, because except for a few places everything should > > stay the same, regardless of what codec is really used as input. > > I forgot to mention that the original file contains cyrillic characters. > > [...] > > > Could you repeat this with a test file and send the original and modified > > versions? > > See attached archive. It contains 2 files: the first one lists the > lowercase letters (30 in total) of the Bulgarian alphabet plus a LF > character and the second one was generated by merging the first one with > a copy of itself (both opened as UTF8 files) and saving the output as > UTF8. A binary editor shows that the output file contains the same > character duplicated 30 times followed by a LF character. > > > Regards, > Angel Tsankov |
From: Angel T. <fn...@fm...> - 2011-02-17 21:17:14
|
On 02/16/11 21:54, Joachim Eibl wrote: > Hi Angel, > > No good news: When looking at the output in hex > (e.g. via od -t x cp1251.saved.as.utf8.txt ) > you see that there is no useful information left anymore. > > 0000000 efbdbfef bfefbdbf bdbfefbd efbdbfef > 0000020 bfefbdbf bdbfefbd efbdbfef bfefbdbf > 0000040 bdbfefbd efbdbfef bfefbdbf bdbfefbd > 0000060 efbdbfef bfefbdbf bdbfefbd efbdbfef > 0000100 bfefbdbf bdbfefbd efbdbfef bfefbdbf > 0000120 bdbfefbd efbdbfef 000abdbf > > So in your concrete situation I can't do much for you. > > But I must admit, that I was not aware of that problem. As I mentioned before > I expected no irreversible conversion loss, but now I think that Qt internally > converts to 16 bit although UTF8 allows 32 bit characters. So most random > combinations will result in a "invalid" character. > > I will try to detect this and display a warning in KDiff3 for such situations. > > Thanks for telling! I really do hope you find some backup. Hello Joachim, And don't worry. The file I lost in this incident was not too long and I was able to remember its contents. Despite this little problem, I am still a happy user of KDiff3. I also think that the suggested warning (or smth similar) is highly desirable in this situation. Best regards, Angel Tsankov |