Olifant/TMX: bad handling of reserved chars

See http://okapi.opentag.com/ for LATES VERSION of the Okapi Framework

Brought to you by: dalcook, ysavourel

#108 Olifant/TMX: bad handling of reserved chars

Milestone: Olifant

Status: open

Owner: nobody

Labels: Functionality (100)

Priority: 5

Updated: 2011-12-15

Created: 2011-12-15

Creator: Eric Voisard

Private: No

First, thanks for the nice work. Olfiant really helps us!
Though I'd have liked to reopen bug #2971151 about "ascii character" but it's impossible.

When I create a new TMX file and add entries containing some '<' and '>' for example, Olifant does accept to save the file, but then it fails to reopen it, and I get an XML parser exception.

As ysavourel explains in bug #2971151, XML specs forbid some special characters, in particular '&', '<' and '>'. These characters are reserved and must be escaped.
But I don't buy ysavourel workaround (importing TMX files instead of opening them) because that's something you can't do when you're creating new TMX files.

These special characters should be escaped BEFORE to save the file.

Then (and here I'm thinking about translators who are not all XML/HTML specialits), I think Olifant should be capable of replacing XML escape entities with human readable characters before displaying them in the edit boxes and in the grid of the user interface. Olifant could then re-parse changed strings and escape again new reserved chars before to save the strings in the TMX file (a back and forth regexp job).

The internal storage format of the data (TMX/XML) should not interfere with the VIEW for the users in the GUI.

For example I don't mind of MS Word internal file format, what interests me is that it is capable of rendering a readable document.
I think XML formatting, including proper special character escaping, is Olifant's job, not the users one, and it should remain in the background. Users should never have to deal with <, > and the like.

Thanks again for your work, Eric

Discussion

Yves Savourel - 2011-12-15

Hi Eric,

#2971151 talks about control characters. those are just not allowed (even escaped) in XML. The Import function work around this doing a first pass that replaces such XML forbidden characters by some text marker, and then the document can be loaded.

I think your concern is more about having to deal with meta characters such as < or & in the Olifant interface. And expecting them to be escaped automatically.

I agree that, ideally, Olifant should present the user with a much cleaner segment content where the inline codes are protected and any special character escaped/unescaped automatically.

It would require a much more complicated back-end that simply was not part of the initial requirments/design of Olifant.

We are currently working on the successor of .NET version of Olifant. And we will try very hard to provide a cleaner edit interface to the user. But I have to say that such goal is very hard to obtain. It is also associated with a large computing cost when doing batch processing. We're working on testing different ways to implement a good compromise and hopefully will have a solution.

The early alpha version of this new Olifant is available in the latest snapshot of the Okpai tools here: http://www.opentag.com/okapi/wiki/index.php?title=Main_Page.

cheers,
-yves

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eric Voisard - 2011-12-23

Merci Yves.
I understand your points and that's right I'm talking about meta characters such as < and &.

I'm a programmer, but not the rest of the team which is actually doing some translation jobs with Olifant. It happens that some colleagues comes to me complaining that a part of his actual translation has disappeared (only a part of his TMX loads up), so I have to open the TMX file and hunt down the special character that has not been escaped when he saved his file and that causes the XML parsing error at read time.

Maybe a complete rework of this part of Olifant would be very hard, but how about a simple Regexp to replace &, <, >, \, ', " metacharacters with & and < etc, entities just before to save a string into the TMX file. Reverse process could be used just after reading a string from the file.

There are also other (and more automated) ways to do that than using regular expressions in .NET: http://tinyurl.com/58r9az

But again, you're doing a great job and there probably are higher priority tasks waiting...
Thanks again, Eric

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yves Savourel - 2011-12-23

Hi Eric,

> ...but how about a simple Regexp to replace &, <, >, \, ', "
> metacharacters with & and < etc, entities just before
> to save a string into the TMX file. Reverse process could
> be used just after reading a string from the file.

The problem is that the TMX inline tags (<bpt>, etc.) would be in the way and get converted too. We would need to separate text and tag and represent the tags in some way that.
That's what we are trying to do in the new Java version. Hopefully that will solve the issue.
cheers,
-yves

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: