#905 Null Char in XML result

XML Parser (16)


<edef>A string of the form &#xdddd or &#dddd</edef>


<edef>A string of the form &amp;#xdddd or ~dddd</edef>

Where I use ~ (tilde) for the NULL char (x00)

Can you help ?


  • christophe  chenon

    Source file

  • Geoff

    Geoff - 2009-03-13

    Friday, March 13, 2009.

    RE: http://tidy.sf.net/issue/2683371

    Yes, it does add a null character when it finds '&#d' ... to find out why?

    Tidy starts, on seeing the '&', to build an 'entity'... and on the '#' switches entity mode.

    Then it aborts the 'entity' name building on seeing the 'd', and thinks it has the whole entity, and tries to look it up in EntityInfo()...

    If the second entity character, name[1], is '#' then it tries to 'decode' the 3rd, setting 0 (zero) default, using -
    sscanf( name+2, "%u", &c );
    but in this case name+2 is a NULL... and WITHOUT checking the result of sscanf(), which in this case returns -1, proceeds to set the 'code' to zero, and return 'yes', entity 'found'...

    It should return 'no' - no such entity!

    So, it seems this could be fixed by the following patch :-
    --- tidycvs\src\entities.c Thu Sep 18 16:47:12 2008
    +++ tidydev\src\entities.c Fri Mar 13 12:22:57 2009
    @@ -366,16 +366,18 @@
    if ( name[1] == '#' )
    uint c = 0; /* zero on missing/bad number */
    + int res;
    /* 'x' prefix denotes hexadecimal number format */
    if ( name[2] == 'x' || (!isXml && name[2] == 'X') )
    - sscanf( name+3, "%x", &c );
    + res = sscanf( name+3, "%x", &c );
    - sscanf( name+2, "%u", &c );
    - *code = c;
    - *versions = VERS_ALL;
    - return yes;
    + res = sscanf( name+2, "%u", &c );
    + if ( res != -1 )
    + {
    + *code = c;
    + *versions = VERS_ALL;
    + return yes;
    + }

    /* Named entity: name ="&" followed by a name */

    That is check the result of the sscanf() is not EOF (-1). This could be made even tighter by using if ( res == 1 ), since that is the result expected...

    Or the code could check the length of 'name' to see if a sscanf() is even possible, but this 'minimum length' would change whether there was the 'x'/'X' first...

    Then it seems the correct xml is output...
    <edef>A string of the form &amp;#xdddd or &amp;#dddd</edef>

    Hope this helps...



    EOF - 2683371.doc

  • christophe  chenon

    Thanks a lot Geoff !!

    Any chance that a new version of Tidy correcting that bug will be available soon ?


  • christophe  chenon

    Thanks a lot Geoff !!

    Any chance that a new version of Tidy correcting that bug will be available soon ?


  • Arnaud Desitter

    Arnaud Desitter - 2009-03-25
    • labels: 819775 --> XML Parser
  • Arnaud Desitter

    Arnaud Desitter - 2009-03-25
    • priority: 5 --> 7
  • Geoff

    Geoff - 2016-02-14

    Thanks for the report... now long ago... sorry for the delay...

    Tidy source has moved on to https://github.com/htacg/tidy-html5, site to http://www.html-tidy.org/

    Back then I was not a Tidy maintainer, and for what even reason my patch never made it into the CVS source, thus is not in our current github source.

    I have now raised issue #373 to address this bug. Will add a new patch after testing soonest.

    If you do find another tidy bug please file an issue, and if you find, fix, and test the feature in a tidy fork then you can issue a Pull Request together with sample html and config used.

    Tidy needs your support...

    Meantime closing this here as out-of-date...

  • Geoff

    Geoff - 2016-02-14
    • status: open --> closed-out-of-date
  • Geoff

    Geoff - 2016-02-15

    If you get a chance checkout the issue-373 branch for the fix... This will be merged to master after testing, as it hopefully closes this old bug!


Log in to post a comment.