XML output not wellformed

  • Frederick Schulz

    I'm using 0.40 from darwin ports on Mac OS 10.5.2.

    The XML output option generates invalid xml files:

    - <span> and </span> are reversed, output is like </span> spanned text <span>
    - <A> is closed by </a>

    There's a simple workaround (at least for me): run xml output through a filter (e.g. sed) to correct these.
    But: Is there a possibility this is fixed anytime soon?


    • spirov

      spirov - 2008-12-18

      I confirm what Frederick is saying:

      I get reversed <span> ; this was not the case in the previous version

      (also on Mac)

  • stfwi

    stfwi - 2010-01-13

    I found a similar bug in

    pdftohtml version 0.40 http://pdftohtml.sourceforge.net/, based on Xpdf version 3.01

    @MacOS 10.6.2.

    The following output is a capitalised page number decl, where the first <span> and the last </span> is missing:

    <text top="1166" left="106" width="83" height="22" font="21">S</span><span class="ft1">EITE </span><span class="ft0">1/41</text>

    However, it could be the same bug. Thanks for coding the great tool!



  • deaddecoy

    deaddecoy - 2011-05-09

    The problem code is in "src/HtmlOutputDev.cc" lines 538 - 546:

          GString *fntFix;
          GString *iStr=GString::fromInt(str2->fontpos);     
          fntFix = new GString("</span><span class=\"ft");
          if (((hlink1 == NULL) && (hlink2 == NULL)) && (hfont1->isEqualIgnoreBold(*hfont2) == gFalse))

    It looks like the developer is trying to add support for subscripts and superscripts, but that this feature isn't fully implemented or is for html only (I'm dumping out to xml). I simply commented out these lines and recompiled.


