Menu

#1579 Detect TSV glossary encoding via magic comment

5.6
closed-fixed
None
5
2021-10-06
2021-07-01
No

This is a mitigation for [bugs:#1046]

Detecting the encoding of TSV glossary files is fallible due to the nature of encoding detection algorithms.

To help avoid misdetections, OmegaT now attempts to determine the encoding of a TSV glossary file (except with extension .utf8) by inspecting the first line of the file for a "magic comment".

A magic comment is a comment line with content formatted like

-*- foo: bar; biz: baz -*-

which represents instructions to set foo to bar and biz to baz.

OmegaT recognizes only the setting coding: <charset> where <charset> is a charset recognized by Java's Charset#forName, such as utf-8.

A magic comment setting the coding to utf-8 will be automatically included as the first line of a writable glossary file created by OmegaT. Since the recognized comment marker for glossary files is #, the magic comment is:

# -*- coding: utf-8 -*-

Note that you can include arbitrary content between the # and the first -*-.

Existing glossary files are not modified with respect to the magic comment, and still suffer from [bugs:#1046]. Users should add an appropriate magic comment if desired.

Related

Bugs: #1046

Discussion

  • Aaron Madlon-Kay

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -4,16 +4,20 @@
    
     To help avoid misdetections, OmegaT now attempts to determine the encoding of a TSV glossary file (except with extension `.utf8`) by inspecting the first line of the file for a &#34;magic comment&#34;.
    
    -A magic comment is a comment line with the format
    +A magic comment is a comment line with the content
    
     ```
    -# -*- foo: bar; biz: baz -*-
    +-*- foo: bar; biz: baz -*-
     ```
    
     which represents instructions to set `foo` to `bar` and `biz` to `baz`.
    
     OmegaT recognizes only the settings `coding: &lt;charset&gt;`  where `&lt;charset&gt;`  is a charset recognized by Java&#39;s [Charset#forName](https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#forName-java.lang.String-), such as `utf-8`.
    
    -A magic comment setting the coding to `utf-8` will be automatically included as the first line of a writable glossary file created by OmegaT.
    +A magic comment setting the coding to `utf-8` will be automatically included as the first line of a writable glossary file created by OmegaT. Since the recognized comment marker for glossary files is `#`, the magic comment is:
    +
    +```
    +# -*- coding: utf-8 -*-
    +```
    
     Existing glossary files are not modified with respect to the magic comment, and still suffer from [bugs:#1046]. Users should add an appropriate magic comment if desired.
    
     

    Related

    Bugs: #1046

  • Aaron Madlon-Kay

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -4,7 +4,7 @@
    
     To help avoid misdetections, OmegaT now attempts to determine the encoding of a TSV glossary file (except with extension `.utf8`) by inspecting the first line of the file for a &#34;magic comment&#34;.
    
    -A magic comment is a comment line with the content
    +A magic comment is a comment line with content formatted like
    
     ```
     -*- foo: bar; biz: baz -*-
    @@ -12,7 +12,7 @@
    
     which represents instructions to set `foo` to `bar` and `biz` to `baz`.
    
    -OmegaT recognizes only the settings `coding: &lt;charset&gt;`  where `&lt;charset&gt;`  is a charset recognized by Java&#39;s [Charset#forName](https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#forName-java.lang.String-), such as `utf-8`.
    +OmegaT recognizes only the setting `coding: &lt;charset&gt;`  where `&lt;charset&gt;`  is a charset recognized by Java&#39;s [Charset#forName](https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#forName-java.lang.String-), such as `utf-8`.
    
     A magic comment setting the coding to `utf-8` will be automatically included as the first line of a writable glossary file created by OmegaT. Since the recognized comment marker for glossary files is `#`, the magic comment is:
    
    @@ -20,4 +20,6 @@
     # -*- coding: utf-8 -*-
     ```
    
    +Note that you can include arbitrary content between the `#` and the first `-*-`.
    +
     Existing glossary files are not modified with respect to the magic comment, and still suffer from [bugs:#1046]. Users should add an appropriate magic comment if desired.
    
     

    Related

    Bugs: #1046

  • Aaron Madlon-Kay

    This is implemented in [85bd5c].

     

    Related

    Commit: [85bd5c]

  • Aaron Madlon-Kay

    • status: open-fixed --> closed-fixed
     
  • Aaron Madlon-Kay

    Released in OmegaT 5.6.0.

     

Log in to post a comment.

MongoDB Logo MongoDB