#879 tidy not nesting tags correctly

closed-out-of-date
nobody
None
5
2016-03-06
2008-04-14
Anonymous
No

When I try to tidy this:

<p><b><i><u><strike>Bold, italics, strike-through, underline</b></i></u></strike> back to normal</p>

I get this:

<p><b><i><u><strike>Bold, italics, strike-through,
underline</strike></u> back to normal</i></b></p>

I'm using HTML Tidy for Windows (vers 22 March 2008) which I compiled manually.

I tried this with an old verion (HTML Tidy for Windows released on 12 April 2005) and it seems to work correctly.

Discussion

  • Geoff

    Geoff - 2008-04-15

    Logged In: YES
    user_id=1408861
    Originator: NO

    Tuesday, April 15, 2008.

    Hi Mr/Ms nobody,

    It is certainly agreed the latest Tidy, (circa 22 March 2008) deals VERY BADLY with this VERY ERRANT html, while browsers tend to render/display it correctly.

    Going back to a 2005 code version is not a solution, as it also has had many other bugs fixed meantime. But if this old version works for you, then good luck ;=))

    See http://tidy.sf.net/issue/1426419, and see the patches http://tidy.sf.net/issue/1426424 for some more discussion on this issue ... Some SMALL effort had been made to allow Tidy to 'fix' SOME SMALL errors of this type - namely OUT-OF-ORDER inline elements.

    And it is VERY MUCH agreed that the warning text issued is highly misleading, saying something like -
    1942407.htm:7:13: Warning: replacing unexpected b by </b>
    It should read more like -
    1942407.htm:7:13: Warning: replacing unexpected b by </strike>
    but that is another question.

    I am attempting to trace, and understand the process, and see if a more extensive patch can be found that fixes things further ...

    Please excuse some programming notes. I want to record it carefully, in case there are other programmers out the who would like to try to help with a solution ...

    The case of OUT-OF-ORDER inline elements, like :-
    <p>
    <b>
    <i>
    <u>
    <strike>
    Bold, italics, strike-through, underline.
    </b>
    </i>
    </u>
    </strike>
    back to normal
    </p>

    It starts in ParseInline( ... , node * element, ... ), where the 'element' is the last <strike> tag ... ParseInline does a GetToken(), where the 'plain text' is absorbed.

    On finding the '<' of the </b>, the lexer state changes to LEX_GT ... then get next char ... On finding the '/' of the </b>, in case LEX_GT, the lexer reads the next char ...

    If it is a letter, in this case a 'b', then the lexer is backed up 3 chars, and the character is 'un-got' , the lexer state is set the LEX_ENDTAG.

    Since there is some 'text' before this, a lexer token is created, and returned to ParseInline(), which appends this text node to the tree, and continues ...

    The next GetToken() call gets the 'b' off the un-got stack, and is parsed in LEX_ENDTAG state ... the tag text is added to the lexer until a non-tag name character is encountered, a new EndTag node is created, with the tag added, the lexer state is set to LEX_CONTENT, and the node returned to ParseInline() ...

    This node is the </b>, so it is marked as a type = EndTag, so the current 'element', <strike> is INLINE, and this next 'node', </b> is INLINE, and as the comment states -

    /* allow any inline end tag to end current element */

    If this was </strike> then there would be NO PROBLEM, but it is in fact the close of a stacked element, <b> ... and some effort has been made to allow Tidy to tolerate, to a degree, such out-of-order inline closing ... but presently only _VERY_ minimally ... see if I can carefully extend it ...

    So, the stacked inline tags are exchanged, the <stike> for the <b>, which gets past the first problem. But the warning message must also be addressed as mentioned above. The token </b> is put back, and a close of 'strike', this current element, is done - that is a return from this ParseInline() ...

    This returns to ParseTag(), which returns to ParseInline(), where the 'element' is the previous <u>, and we cycle to GetToken() ... which gets the put-back token </b> ... so again checks the stacked elements to see if a 'switch' can be made, but this time this FAILS - we are checking switch 'u' with 'b' ...

    But the present way the 'search' is coded, it first check back in the stack for the element, and then further back in the stack for the node, but in this second case, the node is in front of the element (or vice versa ;=) ... so the search for the switch partners must be both forward and backwards ...

    So I change the code to do independent stack searches ... now in the <u> and </b> case the stack is again switched successfully, and we return from ParseInline(), back to the <b> element. Now the (node->tag == element->tag && node->type == EndTag), so the node is freed, and we again return from ParseInline().

    This of course is back to the <p> element ... but the problem now is that as part of the above, some nodes were duplicated, so they could be propagated, but in this case they have all been closed, so I guess the 'duplication' process should take into account nodes that have been explicitly closed ...

    That is remove implicit node when an explicit close is encountered ... hmmm ... have to think about that ... and of course as the lexer moves on, it now emits warnings about discarding unexpected </u> and </strike> ... another hmmm ...

    But in no time at all we fall back to ParseBody() ... and back through ParseHTML() ...

    *** AND IT IS DONE ;=)) ***

    The output is -
    <p><b><i><u><strike>Bold, italics, strike-through,
    underline.</strike></u></i></b> back to normal</p>

    which is CLOSE TO PERFECT !!!

    Of course, still, the warning messages could leave one scratching your head -

    line 7 column 13 - Warning: replacing unexpected b by </b>
    line 7 column 10 - Warning: replacing unexpected b by </b>
    line 7 column 7 - Warning: replacing unexpected b by </b>
    line 7 column 66 - Warning: inserting implicit <i>
    line 7 column 70 - Warning: discarding unexpected </u>
    line 7 column 74 - Warning: discarding unexpected </strike>
    line 7 column 66 - Warning: trimming empty <i>

    You will note the 'implicit' <i> that I was concerned about was pruned as an empty element ... more hmmms ...

    Anyway, time to test this patch further, but since it actually only extended the stack searching to forward and back, it is really the same as before, but now handles this particular HTML mess ;=))

    Unfortunately, out of time today to prepare the diff, and post it to patches, but will get around to this soonest ... If you want to 'test' my tidydev.exe, then I have posted a zipped copy on my site at :-

    http://geoffair.net/tidy/zips/tidydeve03.zip
    MD5 (tidydeve03.zip) = 473db0c232c16244be0ab000c6b951cd

    Be warned this has a slightly changed command line interface. For example, it does NOT do stdin, but other than that most command parameters are exactly the same, but can be given in any order, and it will also accept an @input_file, with options line separated, and will abort on any command line error ...

    This is my 'development' version, and so there are also some Word html tidying enhancements, and it has better javascript parsing ... aside from these few things, it is the same as Tidy CVS ... but the help, -?, does not reflect the above ...

    Have FUN ;=))

    Regards,

    Geoff.

    PS: Sorry for the LONG post ...

    EOF - 1942407-01.doc

     
  • Geoff

    Geoff - 2008-04-16

    Logged In: YES
    user_id=1408861
    Originator: NO

    See the patches http://tidy.sf.net/issue/1426424 where I have added a new file tidycvs04.patch, and all this has been written up on my site -
    http://geoffair.net/tidy/tidy_10.htm
    where a WIN32 test application can be downloaded ...

    Geoff.

     
  • Geoff

    Geoff - 2016-03-06

    Thanks for the report... now long ago... sorry for the delay...

    Tidy source has moved on to https://github.com/htacg/tidy-html5, site to http://www.html-tidy.org/

    Wow, I worked hard to get some patches into CVS tidy way back in 2006, see my patches, but on re-testing the simple original out of order sample modern tidy still outputs the same mess! I guess it must all be done again! If I get the time I will try to re-open this in modern tidy.

    Meantime, if you want to persue this, or find another tidy bug, or feature request, then please file an issue, together with sample html and config used, and if you find, fix, and test the feature in a tidy fork then you can issue a Pull Request.

    Tidy needs your support...

    Meantime closing this here as out-of-date...

     
  • Geoff

    Geoff - 2016-03-06
    • status: open --> closed-out-of-date
    • Group: --> Current - all platforms
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks