Menu

#108 Structure of HTML not respected in output

1.6
closed-fixed
None
6
2006-02-22
2005-11-23
No

1.6 RC3
The structure of an HTML (newlines) is not respected
in the output file. It was working in 145 04 and
later, I don't know when it started to be broken
again.

I have included an example of an initial HTML file,
and the result produced by "Create target documents".

Discussion

1 2 > >> (Page 1 of 2)
  • Didier Briel

    Didier Briel - 2005-11-23

    Source and target HTML file

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    From what I have seen it looks like some extra spaces and line breaks are eaten
    in the compile process.

    Is there any other modified part ?

    JC

     
  • Didier Briel

    Didier Briel - 2005-11-23

    Logged In: YES
    user_id=1343245

    >Is there any other modified part ?
    No, as far as I've seen, that's it.
    But, in this particular case, a sufficient reason for me
    to not use OmegaT.

    And it's always annoying to have a regression.

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    Didier,

    I am not questionning your request :) I think you are totally right expecting
    OmT to not modify the code.

    I think the 1.4.6 beta I used in July did not have this behavior.

    I just meant to have a confirmation of what I have only visually checked.

    JC

     
  • Didier Briel

    Didier Briel - 2005-11-23

    Logged In: YES
    user_id=1343245

    >I think the 1.4.6 beta I used in July did not have this
    >behavior.
    Yes, it was still OK the week-end between July and August.
    I know, because I used it on a "last minute" job.
    But I don't remember exactly the version, that's why I
    haven't mentioned it.

     
  • Maxym Mykhalchuk

    • assigned_to: nobody --> mihmax
    • status: open --> open-accepted
     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    Let's consider this example:
    == in HTML
    <title>
    this string will have
    compressed space
    </title>
    == in OmegaT currently
    this string will have compressed space
    ==

    It is wrong to:
    1. remove newlines/other space before & after the meaningful
    text?
    2. compress space inside a segment?

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    If I am not wrong, the example you show is not exactly what Didier had in mind,
    still, my answer is:

    1) white space within a translatable segment should be left as is and the
    translator should decide whether it keeps it or not.

    2) white space within the code should be kept as is since the author may have
    tools that work on the html structure and depend on the html being formatted a
    specific way.

     
  • Didier Briel

    Didier Briel - 2005-11-30

    Logged In: YES
    user_id=1343245

    I agree with JC:
    "white space within a translatable segment should be left
    as is and the translator should decide whether it keeps it
    or not."
    Of course. Let's take the American habit, for instance, of
    putting 2 spaces after a dot. E.g.,
    First sentence. Second sentence.

    Will you remove the extra space?

    "white space within the code should be kept as is since the
    author may have tools that work on the html structure and
    depend on the html being formatted a specific way."
    The code can even be layout out in a specific way because
    of human maintenance.
    If the client gives me a code such as
    <a href="xxxx"
    alt="ddd"
    title="fff">
    The visible test</a>
    I see no reasons to give him back everything compressed on
    a single line.

    The example you gave
    <title>
    The title here
    </title>
    is also a classic one.

     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    JC,
    2) is already there, HTML filter respects the non-text HTML
    portions as much as it can...

    1) it will be so ugly :-(

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    :)

    Well, about 2) the project Didier uploaded seems to have problems, I tested it
    and would not have been satisfied with the output.

    The reason is: if I produce an html that is different from the one I got, the
    client's web responsible is going to ask me why I changed the code when I am
    only supposed to translate. In some cases, the client would refuse to pay for
    parts of my work that induce corrections on his side.

    As for 1), well people are weird sometimes, let them be weird :)

     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    JC, about formatting outside meaningful text, what were the
    problems? I just tested (created target documents without
    translating anything), and I found no structural changes...

    I'm convinced, and will change 1.6.RC4 filter accordingly,
    look out in ~1 hour.

     
  • Nobody/Anonymous

    Logged In: NO

    Compressing whitespace in translatable text:

    First, the HTML code structure in RC4 is perfectly
    respected, it's really nice.

    Secondly, I think I was wrong about not compressing white
    space *in translatable text* (example: this string
    will have
    compressed space).

    First, it goes against HTML specifications:
    http://www.w3.org/TR/html401/struct/text.html#h-9.1
    "In particular, user agents should collapse input white
    space sequences when producing output inter-word space.
    This can and should be done"

    Which means my mentioning of two spaces after a dot in US
    English (First sentence. Next sentence) was wrong, since
    user agents will keep only one space.

    Here, as far as translatable text is concerned, I think we
    can consider OmegaT as a user agent.

    Secondly, very often, newlines in translatable text are
    not introduced by users, but by tools, for instance every
    80 characters.

    Third, since the spaces and newlines are part of the
    segment, there are recorded as such in the TMX (expected
    behaviour). As a consequence, two totally identical
    segments (but with newlines in different position, because
    of random newlines introduced by the input tool) will not
    be recognised as the same segment.

    My apologies for giving (what I think was) a wrong
    feedback.

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    I don't think we should consider OmT as a user agent. OmT is not here to
    display html, but to parse its contents and give write access to it. This is not
    what a user agent does. A user agent must make the contents readable, but
    similarly, the UA does not modify the source, it just displays it in a readable
    way. OmT cannot only display the source without not modifying it. It is not a
    UA.

    JC

     
  • Didier Briel

    Didier Briel - 2005-12-05

    Logged In: YES
    user_id=1343245

    JC,
    you might be right about OmegaT not being a UA.

    Anyway, the point was just that, with a source HTML
    containing random newlines in the translatable text every
    30/40 characters, it is hard to use (on the screen), plus
    the usefulness of the TM is much less, because identical
    sentences will not have newlines at the same place.

    Didier

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    Well, one must assume that the html was not entirely generated by a
    chimpanzee typing random keys on a computer ;) If the structure of a segment
    is weird for the translator, it may not be weird for the writer, or the code
    manager.

    Maybe a way to solve the problem would be to have OmegaT add segments of
    its own when dealing with extra white space. OOo does that for strings of more
    than 1 space etc. I could imagine an html OmegaT tag like <s> for space and
    <nl> for newline, <t> for tabs etc... but I suppose that would slightly add to
    the parsing load.

     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    So what do you think, is it useful to compress spaces?

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    If OmegaT was a user agent it would be a _must_, OmegaT should really respect
    whatever space is wherever it is. Otherwise we'll find exceptional cases where
    the html is broken a way or another because of space compression.

     
  • Didier Briel

    Didier Briel - 2005-12-07

    Logged In: YES
    user_id=1343245

    >So what do you think, is it useful to compress spaces?
    Me, yes! (inside translatable text, of course).
    I have given to JC an example of a HTML that is really
    painful to translate without compressing spaces.
    I can send it to you privately if needed.

    Didier

     
  • Jean-Christophe Helary

    Logged In: YES
    user_id=915082

    ok, after considering the file Didier sent me (sorry I have not done that earlier)
    and the fact that the translation _will_ get rid of the useless white space _in_ the
    translatable part (since it is not displayed anyway), I think we can go ahead.

    Line breaks in the translatable text included of course.

    All the stuff that is _not_ a translatable string should be left with the same
    indentation (either tab of space) and same vertical spacing (end of lines etc).

     
  • Maxym Mykhalchuk

    • status: open-accepted --> closed-fixed
     
  • Maxym Mykhalchuk

    Logged In: YES
    user_id=488500

    OK, then I'm closing this as fixed in RC4

     
  • Didier Briel

    Didier Briel - 2005-12-13

    Logged In: YES
    user_id=1343245

    >OK, then I'm closing this as fixed in RC4
    There might be a misunderstanding wrt to the comments
    below.

    >So what do you think, is it useful to compress spaces?
    I answered yes, and JC finally changed is mind, and
    answered yes, too.

    So, since the 1.6 RC 4 is *not* compressing spaces inside
    translatable text, I'm not sure whether this bug should be
    closed.

    Didier

     
  • Didier Briel

    Didier Briel - 2006-02-10

    Logged In: YES
    user_id=1343245

    Maxym,

    I'm reopening, since the closing was not clear for me (and
    there was no answer to my last comment).

    The "structure not respected" is solved.
    But we concluded (JC and I) that whitespaces should be
    compressed (as what you did initially) inside translatable
    text.
    The bug was closed and whitespaces are not compressed.

    If you think that's a different issue: should I fill an
    RFE?

    Didier

     
  • Didier Briel

    Didier Briel - 2006-02-10
    • status: closed-fixed --> open-fixed
     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB