OmegaT+ CAT Tools / Feature Requests / #28 Remove extra spaces added around tags by Google Translate

Raymond Martin - 2011-03-08

Sorry for late reply. Investigating the issue...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Raymond Martin - 2011-03-08

Sadi, can you send me a sample text that results in these spaces, other details. Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sadi Yumusak - 2011-03-08

Thanks, Raymond. The file "Test.odt" will show most of these.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sadi Yumusak - 2011-03-08

Test.odt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Raymond Martin - 2011-03-08

Thanks for the file, Sadi.

It is relatively easy to remove these spaces, once I can figure out what the exact representation of those symbols are programmatically. The main issue is what set of symbols need to be handled. I imagine there must be other symbols across different languages. So this presents a problem in that I can hardly have a comprehensive set to work against. All I can do at this point is hard code the values, a less that optimal solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sadi Yumusak - 2011-03-08

I think initially we can perhaps suffice with most frequently used (international) symbols like these.
The list at www.ascii.cl/htmlcodes.htm might perhaps be useful for this purpose.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Raymond Martin - 2011-03-09

Okay, that was helpful. I now have a general idea as to why Google Translate puts extra spaces in the translations returned.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Raymond Martin - 2011-03-26

After having looked at this issue for awhile I have come to the conclusion that it is impossible to come close to handling all cases where spaces occur. The only way around it would be to set up functionality with rules that users could edit/create, similar to segmentation rules. And that would be quite a bit of work for questionable return on the effort (GT translations are not of very high confidence factor).

In addition, the best way for this to be fixed is either for Google to do it or for them to provide functionality to return their modified version of the original string they use that machine translation is appied to. In the former case it makes sense because we have no understanding of the rules used to preprocess text before MT. And in the latter it might be possible to deduct various rules from seeing the preprocessed text before MT and work backwards to fixed returned translations. Neither of these is possible currently, but I have filed a bug report with Google about the spaces problem (ultimately they should be the ones to fix it).

In any event, I have managed to make a first effort that does cover some cross section of theses spaces. It is very limited and I have not been able to ascertain whether it works across many languages. It is better than just removing spaces for tags though. This will be included in the next release.

A particular caveat that should be observed is that GT sometimes moves text around and dissociates them from connected symbols, like a name and a trademark symbol. These may end up in different locations than expected. These are hard to reproduce as it is not possible to know the rules used to make decisions. Unless there are highly repetitive cases of these "errors" the effort to program them out is somewhat wasteful compared to working on other issues. Let's just go with what is working thus far and see where fine tuning can be applied from user experiences.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Remove extra spaces added around tags by Google Translate

Group

Searches

Help

#28 Remove extra spaces added around tags by Google Translate

Discussion