Remove extra spaces added around tags by Google Translate
Brought to you by:
laseray
Google Translating is sometimes very helpful in quickly entering translations of some simple texts in a project, but extra spaces added by GT around texts makes this feature almost useless in many cases.
OmegaT seems to have solved this issue, but there are two issues:
(1) There are still spaces around some symbols: / @ ™ ®
(2) Spaces in manual bullet formatting are also removed: "● Text" becomes "● Text"
Sorry for late reply. Investigating the issue...
Sadi, can you send me a sample text that results in these spaces, other details. Thanks.
Thanks, Raymond. The file "Test.odt" will show most of these.
Thanks for the file, Sadi.
It is relatively easy to remove these spaces, once I can figure out what the exact representation of those symbols are programmatically. The main issue is what set of symbols need to be handled. I imagine there must be other symbols across different languages. So this presents a problem in that I can hardly have a comprehensive set to work against. All I can do at this point is hard code the values, a less that optimal solution.
I think initially we can perhaps suffice with most frequently used (international) symbols like these.
The list at www.ascii.cl/htmlcodes.htm might perhaps be useful for this purpose.
Okay, that was helpful. I now have a general idea as to why Google Translate puts extra spaces in the translations returned.
After having looked at this issue for awhile I have come to the conclusion that it is impossible to come close to handling all cases where spaces occur. The only way around it would be to set up functionality with rules that users could edit/create, similar to segmentation rules. And that would be quite a bit of work for questionable return on the effort (GT translations are not of very high confidence factor).
In addition, the best way for this to be fixed is either for Google to do it or for them to provide functionality to return their modified version of the original string they use that machine translation is appied to. In the former case it makes sense because we have no understanding of the rules used to preprocess text before MT. And in the latter it might be possible to deduct various rules from seeing the preprocessed text before MT and work backwards to fixed returned translations. Neither of these is possible currently, but I have filed a bug report with Google about the spaces problem (ultimately they should be the ones to fix it).
In any event, I have managed to make a first effort that does cover some cross section of theses spaces. It is very limited and I have not been able to ascertain whether it works across many languages. It is better than just removing spaces for tags though. This will be included in the next release.
A particular caveat that should be observed is that GT sometimes moves text around and dissociates them from connected symbols, like a name and a trademark symbol. These may end up in different locations than expected. These are hard to reproduce as it is not possible to know the rules used to make decisions. Unless there are highly repetitive cases of these "errors" the effort to program them out is somewhat wasteful compared to working on other issues. Let's just go with what is working thus far and see where fine tuning can be applied from user experiences.