OmegaT - multiplatform CAT tool / Bugs / #108 Structure of HTML not respected in output

Didier Briel - 2005-11-23

Source and target HTML file

HTML example.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-11-23

Logged In: YES
user_id=915082

From what I have seen it looks like some extra spaces and line breaks are eaten
in the compile process.

Is there any other modified part ?

JC

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2005-11-23

Logged In: YES
user_id=1343245

>Is there any other modified part ?
No, as far as I've seen, that's it.
But, in this particular case, a sufficient reason for me
to not use OmegaT.

And it's always annoying to have a regression.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-11-23

Logged In: YES
user_id=915082

Didier,

I am not questionning your request :) I think you are totally right expecting
OmT to not modify the code.

I think the 1.4.6 beta I used in July did not have this behavior.

I just meant to have a confirmation of what I have only visually checked.

JC

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2005-11-23

Logged In: YES
user_id=1343245

>I think the 1.4.6 beta I used in July did not have this
>behavior.
Yes, it was still OK the week-end between July and August.
I know, because I used it on a "last minute" job.
But I don't remember exactly the version, that's why I
haven't mentioned it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2005-11-30

assigned_to: nobody --> mihmax

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2005-11-30

Logged In: YES
user_id=488500

Let's consider this example:
== in HTML
<title>
this string will have
compressed space
</title>
== in OmegaT currently
this string will have compressed space
==

It is wrong to:
1. remove newlines/other space before & after the meaningful
text?
2. compress space inside a segment?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-11-30

Logged In: YES
user_id=915082

If I am not wrong, the example you show is not exactly what Didier had in mind,
still, my answer is:

1) white space within a translatable segment should be left as is and the
translator should decide whether it keeps it or not.

2) white space within the code should be kept as is since the author may have
tools that work on the html structure and depend on the html being formatted a
specific way.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2005-11-30

Logged In: YES
user_id=1343245

I agree with JC:
"white space within a translatable segment should be left
as is and the translator should decide whether it keeps it
or not."
Of course. Let's take the American habit, for instance, of
putting 2 spaces after a dot. E.g.,
First sentence. Second sentence.

Will you remove the extra space?

"white space within the code should be kept as is since the
author may have tools that work on the html structure and
depend on the html being formatted a specific way."
The code can even be layout out in a specific way because
of human maintenance.
If the client gives me a code such as
<a href="xxxx"
alt="ddd"
title="fff">
The visible test</a>
I see no reasons to give him back everything compressed on
a single line.

The example you gave
<title>
The title here
</title>
is also a classic one.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2005-11-30

Logged In: YES
user_id=488500

JC,
2) is already there, HTML filter respects the non-text HTML
portions as much as it can...

1) it will be so ugly :-(

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-11-30

Logged In: YES
user_id=915082

:)

Well, about 2) the project Didier uploaded seems to have problems, I tested it
and would not have been satisfied with the output.

The reason is: if I produce an html that is different from the one I got, the
client's web responsible is going to ask me why I changed the code when I am
only supposed to translate. In some cases, the client would refuse to pay for
parts of my work that induce corrections on his side.

As for 1), well people are weird sometimes, let them be weird :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2005-11-30

Logged In: YES
user_id=488500

JC, about formatting outside meaningful text, what were the
problems? I just tested (created target documents without
translating anything), and I found no structural changes...

I'm convinced, and will change 1.6.RC4 filter accordingly,
look out in ~1 hour.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2005-12-03

Logged In: NO

Compressing whitespace in translatable text:

First, the HTML code structure in RC4 is perfectly
respected, it's really nice.

Secondly, I think I was wrong about not compressing white
space *in translatable text* (example: this string
will have
compressed space).

First, it goes against HTML specifications:
http://www.w3.org/TR/html401/struct/text.html#h-9.1
"In particular, user agents should collapse input white
space sequences when producing output inter-word space.
This can and should be done"

Which means my mentioning of two spaces after a dot in US
English (First sentence. Next sentence) was wrong, since
user agents will keep only one space.

Here, as far as translatable text is concerned, I think we
can consider OmegaT as a user agent.

Secondly, very often, newlines in translatable text are
not introduced by users, but by tools, for instance every
80 characters.

Third, since the spaces and newlines are part of the
segment, there are recorded as such in the TMX (expected
behaviour). As a consequence, two totally identical
segments (but with newlines in different position, because
of random newlines introduced by the input tool) will not
be recognised as the same segment.

My apologies for giving (what I think was) a wrong
feedback.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-12-03

Logged In: YES
user_id=915082

I don't think we should consider OmT as a user agent. OmT is not here to
display html, but to parse its contents and give write access to it. This is not
what a user agent does. A user agent must make the contents readable, but
similarly, the UA does not modify the source, it just displays it in a readable
way. OmT cannot only display the source without not modifying it. It is not a
UA.

JC

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2005-12-05

Logged In: YES
user_id=1343245

JC,
you might be right about OmegaT not being a UA.

Anyway, the point was just that, with a source HTML
containing random newlines in the translatable text every
30/40 characters, it is hard to use (on the screen), plus
the usefulness of the TM is much less, because identical
sentences will not have newlines at the same place.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-12-05

Logged In: YES
user_id=915082

Well, one must assume that the html was not entirely generated by a
chimpanzee typing random keys on a computer ;) If the structure of a segment
is weird for the translator, it may not be weird for the writer, or the code
manager.

Maybe a way to solve the problem would be to have OmegaT add segments of
its own when dealing with extra white space. OOo does that for strings of more
than 1 space etc. I could imagine an html OmegaT tag like <s> for space and
<nl> for newline, <t> for tabs etc... but I suppose that would slightly add to
the parsing load.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2005-12-07

Logged In: YES
user_id=488500

So what do you think, is it useful to compress spaces?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-12-07

Logged In: YES
user_id=915082

If OmegaT was a user agent it would be a _must_, OmegaT should really respect
whatever space is wherever it is. Otherwise we'll find exceptional cases where
the html is broken a way or another because of space compression.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2005-12-07

Logged In: YES
user_id=1343245

>So what do you think, is it useful to compress spaces?
Me, yes! (inside translatable text, of course).
I have given to JC an example of a HTML that is really
painful to translate without compressing spaces.
I can send it to you privately if needed.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jean-Christophe Helary - 2005-12-07

Logged In: YES
user_id=915082

ok, after considering the file Didier sent me (sorry I have not done that earlier)
and the fact that the translation _will_ get rid of the useless white space _in_ the
translatable part (since it is not displayed anyway), I think we can go ahead.

Line breaks in the translatable text included of course.

All the stuff that is _not_ a translatable string should be left with the same
indentation (either tab of space) and same vertical spacing (end of lines etc).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2005-12-13

status: open-accepted --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maxym Mykhalchuk - 2005-12-13

Logged In: YES
user_id=488500

OK, then I'm closing this as fixed in RC4

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2005-12-13

Logged In: YES
user_id=1343245

>OK, then I'm closing this as fixed in RC4
There might be a misunderstanding wrt to the comments
below.

>So what do you think, is it useful to compress spaces?
I answered yes, and JC finally changed is mind, and
answered yes, too.

So, since the 1.6 RC 4 is *not* compressing spaces inside
translatable text, I'm not sure whether this bug should be
closed.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2006-02-10

Logged In: YES
user_id=1343245

Maxym,

I'm reopening, since the closing was not clear for me (and
there was no answer to my last comment).

The "structure not respected" is solved.
But we concluded (JC and I) that whitespaces should be
compressed (as what you did initially) inside translatable
text.
The bug was closed and whitespaces are not compressed.

If you think that's a different issue: should I fill an
RFE?

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2006-02-10

status: closed-fixed --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Structure of HTML not respected in output

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#108 Structure of HTML not respected in output

Discussion