#814 Tidy API function tidyNodeGetText escapes output

closed-fixed
nobody
6
2014-08-21
2007-01-15
ayermakov
No

We use tidy mainly through its API, we walk parse tree and extract information about each node in the tree, and transform it in our own tree structure. The issue happens when we use tidyNodeGetText() on the TidyNode_Text, TidyNode_Comment and TidyNode_CDATA type of node.

Consider the input:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<TITLE> New Document </TITLE>
<SCRIPT type="text/javascript">document.write("<STYLE>");</SCRIPT>
</HEAD>

<BODY>
Hello!
</BODY>
</HTML>

I'm walking the tidy parse tree and print out a text of each node (for simplicity reasons). Here is a small function to walk the tree:

static void processNode(TidyDoc tdoc, TidyNode tnod)
{
TidyNode child;

for ( child = tidyGetChild(tnod); child; child = tidyGetNext(child) )
{
ctmbstr name = tidyNodeGetName( child );
TidyNodeType type = tidyNodeGetType(child);
switch ( type )
{
case TidyNode_Comment:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Comment: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Text:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Text: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_CDATA:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_CDATA: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Start:
{
processNode( tdoc, child);
}
break;
default:
break;
}
}
}

I call the function as follows:

processNode(tdoc, tidyGetRoot(tdoc));

The output of the function is:

TidyNode_Text: New Document
TidyNode_Text: document.write("&lt;STYLE&gt;");
TidyNode_Text: Hello!

Note that angle brackets converted to sgml entities. It's not clear why, seems a bug to me.

Even if there is some reason behind this, we would like to get an undistorted original text without escaping. Is there any option to do that?

Discussion

  • Arnaud Desitter

    Arnaud Desitter - 2007-01-16

    Logged In: YES
    user_id=566665
    Originator: NO

    See http://tidy.sf.net/issue/1166491 which contains a patch that may be correct.
    It would be nice if you could provide a patch with a rationale so this issue could be nailed down.

     
  • Arnaud Desitter

    Arnaud Desitter - 2007-01-16
    • labels: --> TidyLib APIs
    • priority: 5 --> 6
     
  • ayermakov

    ayermakov - 2007-01-16

    Logged In: YES
    user_id=1688233
    Originator: YES

    Well, seems the bug entry 1166491 describes the same issue. However I do have a slightly different opinion how it should be resolved. I believe that tidyNodeGetText should output text of any node 'as-is', without any processing (escaping). It's true not only for script and style type of node, but also for a regular html text.
    Or at least it should be under control of some option.
    File Added: 1636028.diff

     
  • ayermakov

    ayermakov - 2007-01-16

    patch for suggested bugfix

     
    Attachments
  • Arnaud Desitter

    Arnaud Desitter - 2007-01-16
    • assigned_to: nobody --> hoehrmann
     
  • Arnaud Desitter

    Arnaud Desitter - 2007-01-16

    Logged In: YES
    user_id=566665
    Originator: NO

    Bjoern,
    Could you comment on this patch ?
    Thanks,

     
  • Björn Höhrmann

    Logged In: YES
    user_id=188003
    Originator: NO

    tidyNodeGetText is supposed to partially serialize a node, it is not meant to get an element's text content, I think it would be incorrect for it to not escape special characters in text. We lack a function to get the content of text nodes, see the Jan 2003 thread "How to access lexbuf?" on tidy-develop. I think the addition of such a function would address the requestor's problem.

    If tidyNodeGetText is to improved indepentently of the addition of such a function, it should continue to escape normal text nodes, whether CDATA element content like for <script> and <style> is escaped should depend on whether XML/XHTML output is requested, and the content of comments and PIs should probably never be escaped as if it was text.

     
  • Björn Höhrmann

    • assigned_to: hoehrmann --> nobody
     
  • Arnaud Desitter

    Arnaud Desitter - 2008-01-27
    • status: open --> pending-fixed
     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 30 days (the time period specified by
    the administrator of this Tracker).

     
  • SourceForge Robot

    • status: pending-fixed --> closed-fixed
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks