AlphaCocoa / Tickets / #240 Encoding

James Connolly - 2020-10-24

p.s. to be extra clear : I don't want to change the file open pref. to default to "ISO-8859" either, since this would introduce popups for UTF-8 files I open instead.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bernard Desgraupes - 2020-11-20

Hi James,
thank you for the suggestion. The problem with the file command is that it is not sufficiently accurate and reliable. For instance, I tested it (with option -I to get a mime type) on a file encoded in macRoman and got:
text/plain; charset=unknown-8bit
I tested it on a file containing greek text written ISO-8859-7 and got :
text/plain; charset=iso-8859-1 (sic, 8859-1, not 8859-7).
Same wrong result with a russian text in ISO8859-5.
I'm afraid this would lead to wrong decisions that would be quite frustrating for the user.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Joachim Kock - 2020-11-20
  
  Hi,
  
  I would like to suggest a more pragmatic solution,
  namely a second option prefs setting.
  
  For example, I would set UTF-8 as my first option
  and MacRoman as my second option, and Alpha would
  try first UTF-8, and it if doesn't work, try
  MacRoman (without presenting the dialogue to ask
  me to choose another encoding). It would work fine
  most of the time. Only very rarely would I have to
  open a windows file (and I would be willing to
  deal with that case in a more manual way, if I
  could have a more automatic treatment of all my
  old files in MacRoman).
  
  James would select ISO-8859 as second option,
  and it would seem to do the job for him.
  
  People having many files of different encodings
  would simply leave the second option undefined,
  to avoid automatic mistakes.
  
  If it is deemed to unsafe to open files automatically
  without really being sure, another possibility
  would be to use the second-option pref simply
  to pre-set the default item in the pop-up of
  the dialogue. That would allow the user at least to
  proceed very quickly by hitting OK by return,
  without having to dig into the pop-up with the mouse.
  
  Cheers,
  Joachim.
  
  Last edit: Bernard Desgraupes 2020-11-20
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bernard Desgraupes - 2020-11-20

Hi Joachim,
I agree, it is an elegant solution.
I'll try to implement it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Fischlin Andreas - 2020-11-20
  
  Dear all,
  
  Perhaps that’s a solution that means less effort when opening older files, indeed. I would basically welcome to have less to do when opening files.
  
  However, I am not fully convinced that simply defining a default encoding is always what we want for following reasons, my default encoding being UTF-8:
  
  (i) I like nevertheless to be reminded that the file is not encoded in UTF-8, e.g. MacRoman, since I prefer to “convert” all my files sooner or later to UTF-8 encoding whenever I modify files. Perhaps a preference can be added that warns me nevertheless that the file was opened in a not UTF-8 when you implement this solution. If someone does not want such reminders, the preference can be set to no give such warnings.
  
  (ii) Another option might be to have a preference for automatic conversion when implementing this solution, i.e. the file is opened without warning and silently converted to UTF-8, a choice you might have to confirm when you save it, where you could refuse to save it in UTF-8 if you have second thoughts about this auto conversion to the UTF-8 encoding.
  
  Perhaps above arguments are not particularly valid, since I may not have well understood what is proposed. In any case I have to admit that it is not particularly clear what is meant by "Default encoding : Check file on open”. Does this mean I can define what my default encoding is (my understanding)? And do I have a preference whether the mismatch triggers the current dialog or not? The latter might be my point (i) above.
  
  Regards,
  Andreas
  
  ETH Zurich
  Prof. em. Dr. Andreas Fischlin
  IPCC Vice-Chair WGII
  Systems Ecology - Institute of Biogeochemistry and Pollutant Dynamics
  CHN E 24
  Universitaetstrasse 16
  8092 Zurich
  SWITZERLAND
  
  andreas.fischlin@env.ethz.chandreas.fischlin@env.ethz.ch
  www.sysecol.ethz.ch/people/andreas.fischlin.hmlhttp://www.sysecol.ethz.ch/people/andreas.fischlin.hml
  
  +41 44 633-6090 phone
  +41 44 633-1136 fax
  +41 79 595-4050 mobile
  
  Make it as simple as possible, but distrust it!
  
  Last edit: Bernard Desgraupes 2020-11-20
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bernard Desgraupes - 2020-11-20

If I understand correctly Joachim's proposal this SecondaryEncoding would be empty by default which would correspond to the current behavior (which is what Andreas prefers) but Joachim himself would set it to MacRoman, and James would set it to ISO-8859-1 (aka Latin1). So everybody would be happy.
But if thereafter Joachim tries to open a Latin1 file, this file will be silently opened in MacRoman giving wrong characters for all the accented letters: he would have to use the Open File command and explicitely set the encoding to Latin1 in the dialog.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Fischlin Andreas - 2020-11-21
  
  Dear Bernard,
  
  I do not necessarily prefer the current behavior, since it is indeed a bit cumbersome. I would therefore also prefer a simpler, more automated approach. However, to really explain what I would prefer, I would need more clarity what the proposal actually is as I explained in my previous e-mail.
  
  The only thing I can say at this point already is, (i) yes, I welcome some support in easier conversion to a desirable encoding such as UTF-8, (ii) but I wish to have sufficient control over the behavior, e.g. easy to dismiss warnings when a problematic conversion was made. No warning when I make a big step in transforming many files in a particular manner or a warning when I occasionally deal with some files, when I would appreciate to learn that a particular conversion is about to take place, so I can decide whether I wanna go with it or not.
  
  This seems to me to perhaps mean:
  
  A preference by which the user can define a target encoding, e.g UTF-8
  
  convenient facilities for convenient conversions such as opening of a MacRoman file triggers no warning, when the original encoding could be changed to the user defined target encoding such as UTF-8
  
  A preference to suppress or enable warnings when a conversion as described above takes place, my preference for the default would be to show the warnings (similar to current behavior)
  
  However, a preference to suppress all warnings, even when the encoding cannot be converted without error, I consider to be questionable (e.g. silently open a Latin1 file and giving wrong characters), at least please not by default. I prefer here clearly a warning and a dialog as currently offered.
  
  Perhaps the solution is an even other one:
  
  A preference by which the user can define the expected encoding, e.g MacRoman, of any file that is opened
  
  A preference by which the user can define the target encoding, e.g UTF-8, for any currently open file that is to be saved
  
  No preference to suppress any warnings, since they show up always when a conversion fails, e.g. when a file to be opened does not match the expected encoding or when the user tries to save a file with an encoding that does not allow to save all characters, e.g. a UTF-8 encoded file saving to MacRoman that contains characters that are not representable in MacRoman. Of course you should also be able to override such a warning, e.g. in case of an intentional manual encoding.
  
  Regards,
  Andreas
  
  Last edit: Bernard Desgraupes 2020-11-21
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bernard Desgraupes - 2020-11-21

Dear Andreas,
thank you for your suggestions which go way beyond the initial request contained in this Ticket.

As far as this ticket is concerned, I'd like to clarify what happens when Alpha reads a file. The file contains bytes and Alpha must translate (not convert) these bytes to letters. To make this possible, the user must specify the encoding (which seves as a translation table). If your file contains the byte 0xE9, here is what happens depending on the encoding declared by the user:

if the user told to use Latin1 encoding (ISO-8859-1), Alpha translates the byte to 'é'

if the user told to use macRoman encoding, Alpha translates the byte to 'È'

if the user told to use UTF8 encoding form, there is an error because byte 0xE9 is forbidden in UTF8.

So when I wrote that opening a Latin1 file in macRoman encoding yields "wrong" characters, what is wrong there is that the user misled Alpha. Alpha just does what it was told to do.

If you ask to open any file in any 1-byte encoding (like macRoman, Latin1, ISO-8859-7 for greek, KOI8 for russian, etc), there will never be an error message. OTOH, you may see an error message when the input encoding is UTF8 because some bytes are forbidden depending on their position in multi-byte sequences, so Alpha will tell you that your file can not be UTF8 because it contains invalid sequences.

The only "conversion" which takes place occurs when Alpha builds its internal buffer which contains only UTF16 two-byte sequences. The user should not be concerned by this: it is only an internal representation. When you save your modified file, Alpha performs the necessary backward conversion to write out the proper bytes in the desired output encoding.

The original request made in this ticket means that there should be a fallback mechanism when Alpha detects that the file is not valid UTF8 : it suggests to use some heuristics to guess what the encoding could be. Unfortunately there is no reliable method for detecting an encoding. James suggested to use the file command line tool which tries to guess the encoding : but it can easily be fooled and give wrong answers. I would not rely on it.

Joachim suggested that the user define a sort of secondary encoding (or fallback encoding) that Alpha would use silently when UTF8 fails : this could be useful but still it is the user's responsibility to specify a secondary encoding that suit her needs. If mostly all your non-UTF8 files are in macRoman, it makes sense to set this secondary encoding to macRoman. But if the file was Latin1, well, too bad... you just misled Alpha.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- James Connolly - 2020-11-22
  
  Hello Bernard, thanks for the detailed summary and discussions which explain the problems well !
  
  The solution this is converging towards (the secondary encoding which is the user’s responsability) makes perfect sense, and I’ll say no more.
  
  All the best, James
  
  On 21 Nov 2020, at 16:38, Bernard Desgraupes via AlphaCocoa-devel alphacocoa-devel@lists.sourceforge.net wrote:
  
  Dear Andreas,
  thank you for your suggestions which go way beyond the initial request contained in this Ticket.
  
  As far as this ticket is concerned, I'd like to clarify what happens when Alpha reads a file. The file contains bytes and Alpha must translate (not convert) these bytes to letters. To make this possible, the user must specify the encoding (which seves as a translation table). If your file contains the byte 0xE9, here is what happens depending on the encoding declared by the user:
  if the user told to use Latin1 encoding (ISO-8859-1), Alpha translates the byte to 'é'
  if the user told to use macRoman encoding, Alpha translates the byte to 'È'
  * if the user told to use UTF8 encoding form, there is an error because byte 0xE9 is forbidden in UTF8.
  
  So when I wrote that opening a Latin1 file in macRoman encoding yields "wrong" characters, what is wrong there is that the user misled Alpha. Alpha just does what it was told to do.
  
  If you ask to open any file in any 1-byte encoding (like macRoman, Latin1, ISO-8859-7 for greek, KOI8 for russian, etc), there will never be an error message. OTOH, you may see an error message when the input encoding is UTF8 because some bytes are forbidden depending on their position in multi-byte sequences, so Alpha will tell you that your file can not be UTF8 because it contains invalid sequences.
  
  The only "conversion" which takes place occurs when Alpha builds its internal buffer which contains only UTF16 two-byte sequences. The user should not be concerned by this: it is only an internal representation. When you save your modified file, Alpha performs the necessary backward conversion to write out the proper bytes in the desired output encoding.
  
  The original request made in this ticket means that there should be a fallback mechanism when Alpha detects that the file is not valid UTF8 : it suggests to use some heuristics to guess what the encoding could be. Unfortunately there is no reliable method for detecting an encoding. James suggested to use the file command line tool which tries to guess the encoding : but it can easily be fooled and give wrong answers. I would not rely on it.
  
  Joachim suggested that the user define a sort of secondary encoding (or fallback encoding) that Alpha would use silently when UTF8 fails : this could be useful but still it is the user's responsibility to specify a secondary encoding that suit her needs. If mostly all your non-UTF8 files are in macRoman, it makes sense to set this secondary encoding to macRoman. But if the file was Latin1, well, too bad... you just misled Alpha.
  
  [tickets:#240] Encoding
  
  Status: open
  Created: Sat Oct 24, 2020 07:43 AM UTC by James Connolly
  Last Updated: Fri Nov 20, 2020 05:28 PM UTC
  Owner: nobody
  
  Hello, first thanks for all the great work from a long term user (≈1997 on) !
  
  A suggestion :Would it be possible to have a "Default encoding : Check file on open"?
  
  This "check file on open" would, upon opening a file, have Alpha check the file's (e.g. BSD "file" command) and use that. And only then open the "Encodings popup" if this fails.
  
  At present I have many files in ISO-8859 and the pop-up is working rather hard. And I'm not inclined to convert them all to UTF-8 in order to avoid further potential problems : if Alpha can handle ISO-8859, why not silently continue to do so where appropriate is my thinking.
  
  Cheers, James
  
  p.s. not files as "bug" nor "task" but as "RFE" which I hope means "suggestion".
  
  Sent from sourceforge.net because alphacocoa-devel@lists.sourceforge.net is subscribed to https://sourceforge.net/p/alphacocoa/tickets/
  
  To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/alphacocoa/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
  
  AlphaCocoa-devel mailing list
  AlphaCocoa-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/alphacocoa-devel
  
  Related
  
  Tickets: ~~#240~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bernard Desgraupes - 2021-03-07

This is fixed now implementing Jpachim's solution. I have defined a new preference called Secondary Input Encoding that is the default choice offered if Alpha failed to open a file with the input encoding.
Changes committed to the repository (rev. 2003).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bernard Desgraupes - 2021-03-07

status: open --> fixed

Version: 9.2 --> 9.2.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bernard Desgraupes - 2021-06-01

status: fixed --> closed

Version: 9.2.2 --> 9.2.3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Encoding

Searches

Help

#240 Encoding

Related

Discussion

Related