OmegaT - multiplatform CAT tool / Feature Requests / #360 Aligner

Marc Prior - 2007-08-18

Logged In: YES
user_id=722901
Originator: NO

I suggest that:

1. We don't re-invent the wheel.

2. We use open standards whereover possible.

In this case, the obvious solution is therefore for SRX to be implemented in both OmegaT and bitext2tmx.

Marc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Murray - 2007-08-18

Logged In: YES
user_id=168045
Originator: YES

>> 2. We use open standards whereover possible. <<

What OmegaT is experiencing right now, is a feature freeze in the name of standards. New enhancements are frozen until the implementation of standards have been perfected. But standards are only relevant for activists. Users don't use standards -- they use features.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Prior - 2007-08-18

Logged In: YES
user_id=722901
Originator: NO

Hardly, Samuel. Delay in the progress of OmegaT's development is due to one thing alone: a lack of time on the part of developers.

An aligner with a practical GUI already exists, bitext2tmx. What's more, the OmegaT and bitext2tmx projects are similar in a number of ways (open-source, Java-based) and there is already close contact with the two groups.

Given this situation, I see no need to duplicate the work of the bitext2tmx project. If bitext2tmx doesn't meet your needs, the obvious solution is further development of bitext2tmx, and that would be the appropriate place to submit your RFE.

As far as the standards are concerned, SRX already exists, and I doubt anyone would oppose its full implementation in OmegaT (the existing segmentation is based upon SRX, but it doesn't implement the standard in full, with facility for portable segmentation rules). The only thing stopping us is developer time. If we don't have the resources to implement SRX in OmegaT (and in bitext2tmx - in fact, they could even share some code), I can't believe that an efficient solution would be to create a proprietary OmegaT aligner purely because it could use OmegaT's segmentation rules files. It would make far more sense to modify bitext2tmx so that it recognizes OmegaT's segmentation rules files.

More generally, I don't think segmentation rules need be an issue for alignment. It's much more practical to segment by the paragraph for alignment purposes.

Marc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2007-08-18

Logged In: YES
user_id=915082
Originator: NO

I think that Samuel's proposal is very elegant and in fact corresponds to what I do manually except that 1) I use a texte editor so I don't have access to OmegaT's segmentation and 2) I use either TMXEdit or CSVConverter for the TMX finalization because I found bitext2tmx not easy to use.

SRX is something that describes segmentation rules to ensure that TMX data is equivalent so to implement Marc's proposal we'd have to wait until OmegaT exports its rules as SRX, then we 'd have to wait until bitext2tmx imports SRX rules _and_ is easier to use to create segments to get _yet_ another external process.

Here, we have a process that can eventually be used for editing TMX files too (if used in "reverse" mode) and the idea of being able to have access to all the text with a fully editable pane is something that does not exist in any other tool.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Prior - 2007-08-18

Logged In: YES
user_id=722901
Originator: NO

OK, JC. Should I assign this RFE to you, or to Samuel? :-)

I appreciate what Samuel is trying to achieve: an alignment tool that segments in the same way as OmegaT.

What I don't understand is all this "we'd have to wait" business. A developer is going to have to be found/persuaded to implement either proposal (Samuel's or mine). Why do we "have to wait" /either/ for SRX to be implemented in OmegaT and bitext2tmx /or/ for bitext2tmx to be adapted to segment according to OmegaT rules (which, although they are not in SRX, are "open" and readily accessible to developers in the form of an XML file), but we don't "have to wait" for a whole new aligner to be developed for OmegaT? Do you guys know a developer who is keen to create a new, GUI aligner in OmegaT but has an aversion to contributing to bitext2tmx? If so, that's sad, but I can live with it.

My point is that my solution is - I believe - far more efficient. I accept that bitext2tmx isn't perfect, and I also admit that I generally use TMXEdit myself for alignment, but I find it hard to accept that bitext2tmx's deficiencies warrant creating a whole new aligner in OmegaT.

Marc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2007-08-18

Logged In: YES
user_id=915082
Originator: NO

The thing is that what Samuel is proposing is _not_ an aligner. It is a segmenter based on OmegaT rules and outputing text. That is a whol different thing, and much simpler too. Since you know TMXEdit, you know how cunbersome it is to use. Samuel's proposal is elegant because it removes all the cumbersome tasks from the segmenting: you just work on a text file on contents that has been imported and split by OmegaT. When you are done, OmegaT gets the results of the 2 text files and pastes XML code to create a TM. And since all the code already exists (filtering/parsing/segmenting/outputing) is already here, I argue that it is way easier to implement that to _wait_ for an SRX export on OmegaT and an SRX inport on bitext2tmx (knowing that we'd still have to deal with bitext2tmx interface, which is similar to TMXEdit: cumbersome).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Murray - 2007-08-18

Logged In: YES
user_id=168045
Originator: YES

>> As far as the standards are concerned, SRX already exists, and I doubt anyone would oppose its full implementation in OmegaT... <<

I have no objection, but what is the benefit of full implementation of SRX over what OmegaT currently has for segmentation?

>> I see no need to duplicate the work of the bitext2tmx project. If bitext2tmx doesn't meet your needs, the obvious solution is further development of bitext2tmx. <<

Bitext2tmx has a certain design which is, as far as I can see, largely incompatible with what I'm proposing. My proposal would basically be a proposal to rewrite bitext2tmx. OmegaT, on the other hand, already has two of the most crucial features implemented -- dockable editing panes and good segmentation of complex documents.

>> What's more, the OmegaT and bitext2tmx projects are similar in a number of ways (open-source, Java-based)... <<

From a user's perspective, bitext2tmx and OmegaT has nothing in common. They even look different. An OmegaT aligner based on panes would give the user a familiar environment to work in.

>> More generally, I don't think segmentation rules need be an issue for alignment. It's much more practical to segment by the paragraph for alignment purposes. <<

You can't be serious, Marc -- a TM created from paragraphs will be of little use in a project that uses sentence segmentation. There will be hardly any fuzzy matches.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Murray - 2007-08-18

Logged In: YES
user_id=168045
Originator: YES

Bitext2tmx is cumbersome, and it assumes that the two files are mostly perfect matches for each other. But say you have a target text in which the author decided to swap around a few paragraphs -- a nightmare in bitext2tmx, because you can't "edit" the two files on the fly, and you can only work with one line at a time. With bitext2tmx, you might as well prepare the files first in two instances of Notepad (or your favourite text editor) before you can align them.

Actually, my initial proposal for an aligner didn't even mention OmegaT's segmenter, but then I realised that an aligner needs a segmenter, and what better segmenter than OmegaT's own segmenter?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Prior - 2007-08-18

Logged In: YES
user_id=722901
Originator: NO

>> Bitext2tmx is cumbersome, and it assumes that the two files are mostly
perfect matches for each other. But say you have a target text in which
the author decided to swap around a few paragraphs -- a nightmare in
bitext2tmx, because you can't "edit" the two files on the fly, and you can
only work with one line at a time....

bitext2tmx has its limitations, the biggest probably being - if I'm not mistaken - that you can't save intermediate work. But it comes back to whether bitext2tmx should be disregarded altogether and a new aligner, or aligner function within OmegaT, designed from scratch. We can discuss this all we like, but which approach is more efficient would have to be decided by the developers.

>> ...but then I realised that an aligner needs a segmenter, and what
better segmenter than OmegaT's own segmenter?

What counts is the results. It shouldn't matter what tool is used for segmenting, provided the results are transparent and controllable.

The logic "what better segmenter than OmegaT's own segmenter" is precisely the logic used by Trados: Trados segments in a certain way, and only if a translator uses Trados can a customer be sure that files are segmented in the same ways as in the customer's own database. Result: only use translators with Trados. This is precisely what SRX was intended to overcome: segmentation by rules, not by product.

>> The thing is that what Samuel is proposing is _not_ an aligner. It is a
segmenter based on OmegaT rules and outputing text. That is a whole
different thing, and much simpler too.

OK, I'm starting to see this (see below), though Samuel certainly described it as an aligner; but that raises other questions (see below).

>> Since you know TMXEdit, you know how cunbersome it is to use.

I don't find TMXEdit cumbersome to use.

However, I think I'm beginning to see the attraction (to you) of Samuel's approach: it's the facility to edit the plain-text files directly. By analogy, you might say that an advantage of OmegaT's glossary files is that they are plain-text files and you can edit them directly.

I have a lot of sympathy with that approach - but in that case, I don't see the reason for creating dedicated UI functionality for it within OmegaT. The basic functionality is this:

* Plain-text source and target files are segmented (using OmegaT's rules) and, still in the original plain-text files, the segments numbered.

* The plain-text files can be edited by the user.

* After editing, the two plain-text files are merged into a single, TMX file.

From a coding point of view, the functionality as I've described it here is pretty simple. I could probably code it myself (in tcl/tk). There are two things I don't like about it though:

1. A dedicated UI, within OmegaT - that sounds like unnecessary overhead. Since there's no particular aligner-specific functionality, why not just open the two files in a text editor?

2. It's geeky. You may find TMXEdit's interface cumbersome, but all the aligners I've seen - TMXEdit, bitext2tmx, WF PlusTools, Cypersoft Aligner - have a similar, table-based UI. Without such an interface it's very easy for the inexperienced user to produce horribly misaligned files.

>> You can't be serious, Marc -- a TM created from paragraphs will be of little use in a project that uses sentence segmentation. There will be hardly any fuzzy matches.

I am perfectly serious. When it comes to aligning legacy texts from another source (not my own, which of course don't need any aligning because I do everything in OmegaT), I am hardly ever interested in fuzzy matches. I'm interested in the Text Search function. Sentence-level segments are slightly more convenient in this case, it's true, but not much more.

The real problem with alignment, from my experience, is having to intervene manually at all. I very rarely have a legacy text + translation (but no TM) that are closely related to my current job. I am much more likely to have a wealth of legacy material that *might* be useful. Aligning it all is a major effort and the benefits relatively small - to the extent that it is often easier not to align it and just to search through the source texts and find the corresponding position in the target text.

Paragraph-level alignment, though, is more reliable by an order of magnitude than sentence-level alignment, and makes alignment worthwhile in some cases where sentence-level alignment wouldn't be.

Where I would really like to see more work in the area of alignment is on techniques (e.g. the use of metatext) to make automated sentence-level alignment more reliable. But once the manual intervention stage is reached, I'm quite happy with the existing interfaces (bitext2tmx, TMXEdit, etc.). The problem as I see it is the task, i.e. the fact that it has to be performed at all, not the tools.

In the light of the above, I'd say yes to Samuel's approach. But not by modifying OmegaT. It would make more sense to modify aligner or bligner.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Murray - 2007-08-18

Logged In: YES
user_id=168045
Originator: YES

>> The logic "what better segmenter than OmegaT's own segmenter" is precisely the logic used by Trados... <<

Actually, the logic is simply: use what is simplest.

Besides, an SRX approach would yield the same segments as OmegaT would, if the rules are the same, right? If you want something to segment by SRX, you could either (a) write an SRX interpreter from scratch or (b) use an existing program that has SRX functionality.

If my aligner used OmegaT for segmentation, and OmegaT later adopted full SRX, the aligner would automatically be SRX capable.

>> OK, I'm starting to see this (see below), though Samuel certainly described it as an aligner... <<

Something tells me that what you call an aligner and what I call an aligner are two different things. To me, an aligner is a program or functionality that enables the translator to create translation memories from existing translations of source documents.

I suppose a program that simply aligns two identical (SL and TL) files can also be called an aligner, but that is being pedantic and the end-result is of no practical use for a translator who wants to create a TM from existing translations. An aligner that doesn't include a segmenter and a TM generator is worthless IMO.

>> 1. A dedicated UI, within OmegaT - that sounds like unnecessary overhead. Since there's no particular aligner-specific functionality, why not just open the two files in a text editor? <<

1. In your approach, the user has to segment the text beforehand.

2. In your approach, the user has to create a TM himself, by using find/replace or some other method to insert the complicated TMX tags, and hope he doesn't screw it up.

>> 2. It's geeky. ... all the aligners I've seen - TMXEdit, bitext2tmx, WF PlusTools, Cypersoft Aligner - have a similar, table-based UI.

Actually, what I've described is exactly the same as PlusTools' large document aligner.

Yes, the small document aligner in PlusTools is table based, but in previous versions of PlusTools the two-document approach was the only one. When Yves replaced the two-document aligner with a table based aligner, there was such an outcry that he had to put the two-document aligner back.

These table based aligners all have the problems that I have described, which would be solved by my aligner. Is there a market for a "different" aligner, for power users?

>> I am perfectly serious. When it comes to aligning legacy texts from another source ... I am hardly ever interested in fuzzy matches. <<

I believe you are part of a minority. Just last week I was able to cut the translation time of a 10 000 words medical document to that of a 1 000 word document because I was able to align (at sentence level) a number of documents in both languages (I was the editor of those documents, so I knew the quality was good).

>> The real problem with alignment, from my experience, is having to intervene manually at all. <<

For people who hate intervening during alignment, bitext2tmx would be perfect, because in it, intervening is unpractical.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Prior - 2007-08-18

Logged In: YES
user_id=722901
Originator: NO

>> Actually, the logic is simply: use what is simplest.

Why the dedicated UI, then? :-)

>> Besides, an SRX approach would yield the same segments as OmegaT would, if the rules are the same, right? If you want something to segment by SRX, you could either (a) write an SRX interpreter from scratch or (b) use an existing program that has SRX functionality.

SRX is not a means to itself. The point of SRX is that it makes the segmentation rules portable between different tools, different projects and different users. Nor does SRX define "segmentation rules". It defines the way the segmentation rules are presented. There is no "segmentation by SRX".

>> If my aligner used OmegaT for segmentation, and OmegaT later adopted full SRX, the aligner would automatically be SRX capable.

That would depend upon the implementation, but presumably.

>> Something tells me that what you call an aligner and what I call an aligner are two different things. To me, an aligner is a program or functionality that enables the translator to create translation memories from existing translations of source documents.

To me too.

>> I suppose a program that simply aligns two identical (SL and TL) files can also be called an aligner, but that is being pedantic and the end-result is of no practical use for a translator who wants to create a TM from existing translations. An aligner that doesn't include a segmenter and a TM generator is worthless IMO.

Agreed. I don't know where you get the idea from that I'm suggesting that.

>> 1. In your approach, the user has to segment the text beforehand.

>> 2. In your approach, the user has to create a TM himself, by using
find/replace or some other method to insert the complicated TMX tags, and
hope he doesn't screw it up.

No and no. Re-read what I wrote:
____________________
* Plain-text source and target files are segmented (using OmegaT's rules)
and, still in the original plain-text files, the segments numbered.

* The plain-text files can be edited by the user.

* After editing, the two plain-text files are merged into a single, TMX
file.

From a coding point of view, the functionality as I've described it here
is pretty simple. I could probably code it myself (in tcl/tk).
____________________

The tool would segment the text beforehand and create the TM afterwards, just as in your suggestion. What I'm questioning is a) the need for a dedicated GUI, and b) the need to integrate all this functionality into OmegaT.

>> Actually, what I've described is exactly the same as PlusTools' large
document aligner.

>> Yes, the small document aligner in PlusTools is table based, but in previous versions of PlusTools the two-document approach was the only one. When Yves replaced the two-document aligner with a table based aligner, there was such an outcry that he had to put the two-document aligner back.

>> These table based aligners all have the problems that I have described, which would be solved by my aligner.

Fair enough - but I think your RFE then needs to be a little more comprehensive.

>> I believe you are part of a minority. Just last week I was able to cut the translation time of a 10 000 words medical document to that of a 1 000 word document because I was able to align (at sentence level) a number of documents in both languages (I was the editor of those documents, so I knew the quality was good).

Possibly. You asked me if I was serious about paragraph-level segmenting, and I said yes, and why. From what other colleagues tell me, alignment is considered to be seldom worthwile at all.

>> For people who hate intervening during alignment, bitext2tmx would be perfect, because in it, intervening is unpractical.

I don't see how anyone can *like* intervening in alignment. The more accurate the alignment and the less the scale of manual intervention, the better, surely? But whether one hates intervening or not is not the point. There will always be some need for manual adjustment, and personally I don't find bitext2tmx impractical.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2009-05-22

milestone: --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-03-22

Description has changed:

Diff:

--- old +++ new @@ -1,4 +1,3 @@ - I really think OmegaT should include an aligner, even if only as an auxiliary tool distributed in the same package. The aligner that I have in mind, works like this: Process:

status: open --> closed-duplicate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-03-22

A GUI aligner has been implemented in [#1201]. Closing this as duplicate because the implementation was unrelated to, and differs significantly from, this RFE.

Related

Feature Requests: ~~#1201~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aligner

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#360 Aligner

Discussion

Related