Menu

#1485 Optionally suppress output of source text for untranslated segments

5.3
closed-duplicate
nobody
None
5
2020-11-28
2020-03-19
msoutopico
No

When a segment is untranslated, OmegaT maintains the source text when generating the target files. This produces bilingual documents, with parts in the target language and other parts in the source language.

This can be convenient for some users and purposes, but inconvenient for other users and purposes, for example, if a file needs a partial translation only. An incomplete target document is the expected result in those cases. Also, delivering a translated file to the client with parts in the source language can be a problem in certain cases.

A use case from an anonymous user:

I want to pretranslate a source file with all my offline memories and after upload it to on online crowdsourcing platform. When I pretranslate these files with OmegaT, I get a mixed target text that includes both translated and untranslated text. Then I upload these files to the platform (which, by the way, does not allow to upload TMX files), but the whole content is then marked as 'translated'. This prevents other online users to filter by untranslated segments.

On the other hand, since OmegaT does not have segment statuses, in some projects, entering some text (even if it's the source text) in the target segment is the only way for the user to "confirm" that segment and to show that they have not overlooked it. WHen the PM receives a project back from the translator conataining untranslated segments, the PM should not have to guess whether it is because the translator forgot to touch them or because no translation is needed or because the needed translation is identical to the source text.

Therefore, this ticket is to request an option by which the user can decide what OmegaT must do with untranslated segments when generating the target files. For example, it could be just a check box called "For untranslated segments, use the source text as the translation in the target documents" which is ticked by default.

Or it could be an option with two radio buttons:
For untranslated segments:
a) use the source text as the translation (default behaviour)
b) leave the translation empty

To avoid generating invalid target files because some necessary tags are missing, a safety measure for option (b) above should be the inclusion of the tags (either standard or custom) that the untranslated source segment contains. In other words, if the source text contains tags, they could be added (as an option) to the empty translation when creating the target document, to avoid a corrupt file or a file that cannot be open. That should not affect the status of the untranslated segments to the effects of statistics, for example.

Ideally, that should be an sub-option (under b):

b.1) Include tags at export for untranslated segments

Related

Feature Requests: #1527

Discussion

1 2 > >> (Page 1 of 2)
  • Didier Briel

    Didier Briel - 2020-03-19

    Note that this behaviour exists already for some filters (e.g., PO filter).

     
    👍
    1
  • msoutopico

    msoutopico - 2020-10-09

    Another useful suboption of this feature would be:
    [x] Use the source text as the translation.
    |__ [x] Only when source and target languages are language variants.

    That would make the export use the source text as the translation when the language subtag of the source and the target language codes is the same. For example when, say, the source language is en-US and the target languge is, say, en-GB.

    That setting would leave untranslated segments empty in the target document when the language combination is, say, en-US -> fr-CA.

     

    Last edit: msoutopico 2020-10-09
  • Samuel Murray

    Samuel Murray - 2020-10-09

    I'd like to suggest added value to this suggested feature.

    I understand that features that are considered workarounds for the lack of segment status are typically not preferred at this stage (until the problem of segment status is resolved), but the fact is that some translators do currently use dummy text for segment status (e.g. adding ### if you're unsure about the translation and don't want to confirm it yet, or having OmegaT add [fuzzy] if a fuzzy match translation is not finalized).

    To make this requested feature useful for such people as well, I suggest that a section be added in the preferences with the following options:

    Exporting empty segments:

    x Export as empty segment if target text is empty.
    x Export as empty segment if source text contains: [text field]
    x Export as empty segment if target text contains: [text field]
    x Export as empty segment if source text does not contain: [text field]
    x Export as empty segment if target text does not contain: [text field]
    x Export tags for empty segments

    (The "x" represents a checkbox. The option "Export tags for empty segments" is greyed out if none of the other options are checked. The [text field] represents a box where the user can type a regular expression.)

    So, if the user selects the first option and third option with e.g. \[fuzzy\]|!!! in the text field, then all segments that are either empty or whose target text contains either "[fuzzy]" or "!!!" will be exported as nothing.

    Or, if the user selects only the second option with e.g. DoNotTtranslate:: in the text field, then all segments that contain "DoNotTranslate::" in the source text will be exported as nothing, but other segments with an empty target field will get exported as usual since the first option was not selected.

    Added: I just realised that the option x Export as empty segment if target text is empty can be omitted, because the same effect can be achieved by adding "." (i.e. match any character) to the text field of the option x Export as empty segment if target text does not contain: [text field].

     

    Last edit: Samuel Murray 2020-10-09
    • Samuel Murray

      Samuel Murray - 2020-10-09

      I don't mean to hijack this RFE, but... if my above suggestion is viewed favourably, then I might as well suggest a further enhancement to it that will make the feature also useful for l10n debugging and CAT roundtripping purposes, which may be not much extra work to implement:

      Instead of the above mentioned options, rather have a section in the preferences with these options:

      Define special segments

      x Consider as special segment if source text contains: [text field]
      x Consider as special segment if target text contains: [text field]
      x Consider as special segment if source text does not contain: [text field]
      x Consider as special segment if target text does not contain: [text field]

      *y Export special segments as: [text field]
      y Export special segments as empty

      • x Export tags for empty segments*

      (The "x" represents a checkbox, the "y" represents a radio button. The [text field] represents a box where the user can type a regular expression.)

      (The option x Export tags for empty segments is greyed out unless the option y Export special segments as empty is checked. The two "Export special segments" options are greyed out unless one or more of the first four options are checked.)

      The option x Export special segments as: [text field] has a special syntax that allows the user to include elements such as the source text ($sourcetext), target text ($targettext), date & time ($nowdatetime), an incremented number ($id), the segment number ($segid), etc. in the segment when it gets exported.

      For example, if the user puts \[$id\]{{$sourcetext}} in that option's text field, then segments that were identified in the first four options as "special segments" get exported with a number (starting from 1) in square brackets, plus the source text in double curly brackets. This is useful for software localization debugging (i.e. when viewing the strings file in the UI, to see where in the UI certain text appear).

      Or, if the user puts just {{$sourcetext}} in that option's text field, then segments that were identified in the first four options as "special segments" get exported as just the source text, in double curly brackets. While this is useful for CAT roundtripping, it is also useful if the user wants to flag such text in the target file visually by e.g. opening it in MS Word and highlighting all text between {{ and }}.

       
      • Thomas CORDONNIER

        "x Consider as special segment if source text contains: [text field]"

        If you apply a filter to source, do you mean something equivalent to Trados's "locked" segments, in which the user is not allowed to put a translation at all? Then, OmegaT should not export the translation of these segments, even if user puts something in it. Is it what you mean here?
        If not, I doubt that we can use options related to the source, since it is not possible to change the source inside OmegaT.

         
        • Samuel Murray

          Samuel Murray - 2020-10-12

          @t_cordonnier Aaron has already ruled on my suggestion, but since you ask, allow me to respond: 1. No, this has nothing to do with locked segments. 2. The fact that the user can't edit the source text in OmegaT doesn't mean that the source text can't contain unique (definable) text that can be used to identify the segment as being of a specific type.

           
    • Aaron Madlon-Kay

      I understand that features that are considered workarounds for the lack of segment status are typically not preferred at this stage (until the problem of segment status is resolved)

      If I understand your suggestion to be primarily a vehicle for controlling what is or is not exported, then it represents enshrining a hack as a fully supported feature. Yes, this RFE is a workaround for the lack of proper statuses, but at least as originally formulated it has a natural evolution into a world with statuses. A string-based pseudo-status system that is smuggled into the content of the segment doesn't have a future; it can only be discarded, which would be a breaking change for anyone using it.

      If I understand your suggestion to be primarily about the latter use case you describe, namely to create custom behaviors based on user-defined patterns in the source text, then that is what I would call a Swiss Army knife feature: something with too broad a scope covering too many disparate use cases such that boiling it down to a reasonable implementation with a coherent UI (that anyone outside of yourself could actually use) is infeasible. This would be more suited to custom scripting or plugins.

      I would accept a patch for the RFE as originally described.

       
    • Thomas CORDONNIER

      "x Export as empty segment if target text is empty."

      Here I must alert you about an important point. There is a contextual menu in OmegaT which is probably so rarely used that you forgot it: "Set Empty Translation". That means that there is a difference between empty translation (translation = "", the empty string) and segment is not translated (which is internally represented by translation = null)

      At minima the term you used in your checkbox is incorrect.

      Once you remember that, then I less understand why you make a link between this RFE and the notion of status. The RFE is speaking about segments where translation = null. But if you export them to, for example, XLIFF, then <target> will contain something, and even if you added recently the possibility to set a status in XLIFF, not sure all tools will look for it if <target> is not empty.</target></target>

      Before implementing something directly in OmegaT core, maybe a temporary solution could be to write a script which sets translation = "" (empty string) for each segment where translation = null ?

       

      Last edit: Thomas CORDONNIER 2020-10-12
      • Samuel Murray

        Samuel Murray - 2020-10-12

        @t_cordonnier You're right, when I said "empty" I did not mean the "Set translation as empty" type of emptiness (which I will be referring to as <empty> now, since that is how it appears in OmegaT), but rather the "there is nothing in the target field" type of emptiness. :-)</empty>

        We don't really know what kinds of segments the RFE is talking about. We can guess, based on the examples given. The anonymous user may be using XLIFF, but Manuel may be referring to OOXML. (Fortunately, Manuel is here, so he can tell us his side, at least.)

        I did a short test with a DOCX and an XLIFF file with one of the segments set to <empty>. In the XLIFF file (Okapi filter), the <empty> segment was exported with the source text in the target field, despite the fact that the XLIFF format is perfectly capable of having empty target fields.</empty></empty>

         
        • Thomas CORDONNIER

          "I did a short test with a DOCX and an XLIFF file with one of the segments set to <empty>. In the XLIFF file (Okapi filter), the <empty> segment was exported with the source text in the target field,"</empty></empty>

          That is a part of the problem: the "default" behaviour which seems actually implemented is, in reality, re-implemented in all filters individually! That is the reason why your tests in native filter (DOCX) and in Okapi filter (XLIFF, provided by another team, which may have different understanding of the API) give a different result!

          " despite the fact that the XLIFF format is perfectly capable of having empty target fields."

          I agree; for me what the Okapi filter does here is an incorrect behaviour: when you use "Set empty translation", you explicitly decide that the translation equals to empty string, so it should be treated exactly as if you have inserted an arbitrary text instead.
          In my own filter, when translation is explicitly set to empty, I produce an empty target markup but I also set the status as translated. If the translation is not set, I leave the file as is but I also do not set the status (in the xliff), it remains untranslated!
          Note that this also means that in such a case, tags should not automatically be added. In principle I agree with Aaron that we should always produce correct documents, but here we speak about a case where the user explicltly decided that the translation is empty: if the source contains tags and the target (empty in this case) does not, this is an error from the user, not from the application, so it should be detected by the validation checker.

          "how Okapi XLIFF deals with this, it makes most sense to me that this should be an option that is set on a per-filter basis. "

          Your test only shows us that actually each filter has its own way to manage this. That does not mean that developing a global option is not possible.
          I agree that I would also prefer a per-filter option, but don't forget that this is much more work and that some filters (like Okapi or the filters I have written) are not a part of OmegaT, and will need to be updated by their respective teams.

          "On the other hand, exporting blanks isn't likely to be something that a user would want to become the default behaviour for a filter."

          I do not say the contrary. And if we finally decide that we have a per-filter option, then the default may differ from one filter to another (for example, sounds logical to use the source for OOXML, but in XLIFF or PO the defaults to empty sounds more logical)

          "This means that ideally the user should be able to set this setting on a per-session or per-project basis."

          If we have one option per filter, then it is already possible: filters configuration can be global but exported/modified for each project.

           

          Last edit: Thomas CORDONNIER 2020-10-14
  • Aaron Madlon-Kay

    To avoid generating invalid target files because some necessary tags are missing, a safety measure for option (b) above should be the inclusion of the tags (either standard or custom) that the untranslated source segment contains. In other words, if the source text contains tags, they could be added (as an option) to the empty translation when creating the target document, to avoid a corrupt file or a file that cannot be open. That should not affect the status of the untranslated segments to the effects of statistics, for example.

    This part would be the primary difficulty of this RFE.

    Ideally, that should be an sub-option (under b):

    b.1) Include tags at export for untranslated segments

    I disagree. There should be no sub-option; (b) should simply do the correct thing in all cases. I think that should be feasible, but care needs to be taken in the implementation to make sure the changes are made above the individual filter layer; otherwise filter plugins will either not respect the (b) option, or they will do so partially or in a broken way.

     
    👍
    1

    Last edit: Aaron Madlon-Kay 2020-10-09
    • msoutopico

      msoutopico - 2020-10-09

      I can only agree: (b) should simply do the correct thing in all cases. Fine with me, and I think that should be fine too for Adrià (the guy who originally proposed this feature).

       
    • adrm

      adrm - 2020-10-09

      I disagree. There should be no sub-option; (b) should simply do the correct thing in all cases.

      Could you please clarify what would you mean by the correct thing? Would it imply exporting the tags?

       
      • Aaron Madlon-Kay

        Yes, the correct thing is to never create translated files with invalid structure; that means that even when suppressing output we must output at least the required tags. (I think it's infeasible to determine which tags are required and which are not, so in practice that probably means all tags should be output.)

         
  • Aaron Madlon-Kay

    • summary: RFF: Make an option to use the source text for untranslated segments --> Make an option to use the source text for untranslated segments
     
  • msoutopico

    msoutopico - 2020-10-09

    The feature requested was not meant as a workaround for segment statuses in any way. This ticket is about the export of the project, not whatever metadata each segment shows in OmegaT's UI. This RFF simply aims to meet a specific need that other CAT tools do meet (no translation => no text in the target document), and segment statuses per se wouldn't necessarily meet that need.

    I think Samuel's proposals are not an enhancement to this RFF, they are a different feature in themselves, therefore I think they deserve their own ticket (which you can link from here, of course, if you think the two things are related). In any case, they don't address the need that initially brought about this ticket, they actually seem a workaround for segment statuses.

     
    • Samuel Murray

      Samuel Murray - 2020-10-09

      Then I might have misunderstood the original RFE, sorry.

      1 While trying to make sense of what the "anonymous user" was trying to achieve, it occurred to me that perhaps the anonymous user was dealing with some kind of bilingual file. If it is XLIFF, then the RFE may be more problematic, because how OmegaT saves it depends on which XLIFF filter he's using (and potentially also the XLIFF file itself).

      2 As for whether to keep tags, I think it depends on what the particular user wants to do with the output file. I can think of two scenarios in which it would matter:

      2-1 Our dear anonymous user intends to upload the target file to another CAT tool/online system which detects whether a segment is "translated" based on whether the segment has content. If OmegaT retains tags, then untranslated segments will still contain content, and will thus be marked as translated even though they contain only tags and may not actually have any visible content.

      2-2 If it is an HTML file with tag soup, and his client is going to view the file in a browser, then keeping the tags should not be a problem, but if his client intends to view the file in a text editor, then it is conceivable that the client is expecting the untranslated portion of the file to be largely "empty" of code except for block-level elements.

      By the way, the RFE title is a bit confusing. It should be "Make an option to not use the source text for untranslated segments", or "Make it optional to use the source text for untranslated segments".

       

      Last edit: Samuel Murray 2020-10-09
    • Aaron Madlon-Kay

      The feature requested was not meant as a workaround for segment statuses in any way.

      I think the implication was that this RFE would not be necessary in a world where OmegaT had proper support for segment statuses, because a natural use case for statuses is to use an "incomplete" or "unverified" status to suppress output.

      This RFE treats the translated/untranslated state of the segment as a primitive binary segment status to achieve just that.

       
  • Aaron Madlon-Kay

    • summary: Make an option to use the source text for untranslated segments --> Optionally suppress output of source text for untranslated segments
     
  • Aaron Madlon-Kay

    I agree with Samuel that the title was confusing. Please correct it if I misunderstood the intent.

     
  • Samuel Murray

    Samuel Murray - 2020-10-10

    FWIW, let's not make assumptions about what is normal behaviour for other CAT tools. I created a test HTML file: <html><body>And <b>one two! Three four</b>! Five <i>six seven? Eight nine</i> ten!</body></html> and tried it in Trados 2019, MemoQ, Wordfast Pro 3 and Wordfast Pro 5, translating only segments 1 and 4. Since HTML is a "forgiving" format, I also tried it as DOCX. Here's what happens:

    • Trados: empty and unconfirmed = exports source text (including tags); empty but confirmed = exports tags only (and spaces, so that the resulting HTML file still has all the tags). DOCX: empty and unconfirmed = exports source text; empty but confirmed = exports as formatting (and spaces, with each space still formatted as the original).

    • MemoQ: if there are empty segments, MemoQ asks what you want: export as source text or export as blank. If I choose blank, then the resulting HTML file lacks the tags from segment 2 and 3, so that the resulting HTML file has one opening bold tag and one closing italics tag. DOCX: empty and either confirmed or unconfirmed = exports as formatting (and spaces, with each space still formatted as the original).

    • Wordfast Pro 3: Doesn't have segment status. Always exports the source text for empty segments for both HTML and DOCX.

    • Wordfast Pro 5: empty and unconfirmed = exports source text (including tags); empty but confirmed = exports a blank, and the resulting HTML file lacks the tags from segment 2 and 3, so that the HTML file has one opening bold tag and one closing italics tag. DOCX: empty and unconfirmed = exports source text; empty but confirmed = exports blank, but completely messes up the tags, as the entire segment 4 is now bold (instead of just the first two words of segment 4).

    So, only in MemoQ would our user be able to export unconfirmed empty segments as blanks (and in HTML the resulting file would look "wrong" due to missing tags). In the two other tools that have statuses, the user would only be able to export empty segments as blanks if he confirms the empty segments, which is the opposite of what the RFE author is aiming for.

    This RFE (i.e. to export blanks for empty [unconfirmed] segments) is not really a common functionality. It would be something unique to OmegaT.

    (And even if OmegaT had segment statuses, the normally expected behaviour would still be the opposite of what is requested here. This is why I don't see this as feature as related to segment status... and also why I tried to approach it from an angle that is disconnected from segment status and instead relates to segment content.)

     
    • Thomas CORDONNIER

      "So, only in MemoQ would our user be able to export unconfirmed empty segments as blanks"

      What I understand in the RFE is that the user wants to decide in a checkbox what to do with untranslated (translation = null, not translation = "") segments. And as you said, MemoQ does not implement this behaviour as default, it asks you what to do with these segments.
      OmegaT does not have the possibility to set a translation with status = untranslated, draft or something like that, but it still makes difference between untranslated (with no translation) and explicitly translated as empty. And now we ask what to do with the first case.

      What you describe for MemoQ seems logical for me. The only question is whenever the users will accept to receive a question box each time they build a document: this would be the first time OmegaT works like that. I suppose that people want instead an option which is set before any compilation and then applies always until we change it again.

      Another question to be decided is whenever we want a common option (and in such a case, in which existing option pane to put it) or if we have to implement it specifically for each filter. I had a look in the code, I see solutions for both approaches. What do you think?

       
      • Samuel Murray

        Samuel Murray - 2020-10-12

        @t_cordonnier In the light of my "discovery" :-) of how Okapi XLIFF deals with this, it makes most sense to me that this should be an option that is set on a per-filter basis. The way to implement blankification properly would likely be different for different file types.

        On the other hand, exporting blanks isn't likely to be something that a user would want to become the default behaviour for a filter. This means that ideally the user should be able to set this setting on a per-session or per-project basis. And this means either overhauling the entire way OmegaT saves settings, or allowing for e.g. a button "Apply for this project only" in the various settings dialogs where global settings are generally set, so that the specific alteration becomes project-specific.

         
  • msoutopico

    msoutopico - 2020-11-27

    I agree; for me what the Okapi filter does here is an incorrect behaviour: when you use "Set empty translation", you explicitly decide that the translation equals to empty string, so it should be treated exactly as if you have inserted an arbitrary text instead.

    Reported: https://groups.google.com/g/okapi-users/c/cpXVBuw3Mno
    I had meant to report that for years...

     
  • msoutopico

    msoutopico - 2020-11-27

    It seems the original formulation and title was unclear and too wordy, sorry about that, and it seems there this brought about a lot of confusion, whereas the thing should be pretty neat and simple (at least for the user). Sorry about that again!

    I wanted to reformulate the original post altogether, but I can't do this in this ticket (I can't edit the original text) and if I did, all comments that refer to it would become orphan, so I think it's better to leave it as it is.

    Therefore I have created a new ticket, which I think should be clearer and a better instrument to get this thing implemented. I think this one can be closed now. Aaron, could you please close this one as duplicate? The new one is [#1527].

    As soon as you confirm that's fine, I'll forward Adrià's meail to the omegat-development mail list with reference to the new ticket that I've just created.

    If you have further comments or would like to suggest added value to the feature, please comment in the omegat-development mail list (as soon as I forward Adrià's email there) and I will update the new ticket if clarification is needed. Thanks!

     

    Related

    Feature Requests: #1527


    Last edit: Aaron Madlon-Kay 2020-11-28
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB