I'm using 0.40 from darwin ports on Mac OS 10.5.2.
The XML output option generates invalid xml files:
- <span> and </span> are reversed, output is like </span> spanned text <span>
- <A> is closed by </a>
There's a simple workaround (at least for me): run xml output through a filter (e.g. sed) to correct these.
But: Is there a possibility this is fixed anytime soon?
I confirm what Frederick is saying:
I get reversed <span> ; this was not the case in the previous version
(also on Mac)
I found a similar bug in
pdftohtml version 0.40 http://pdftohtml.sourceforge.net/, based on Xpdf version 3.01
The following output is a capitalised page number decl, where the first <span> and the last </span> is missing:
<text top="1166" left="106" width="83" height="22" font="21">S</span><span class="ft1">EITE </span><span class="ft0">1/41</text>
However, it could be the same bug. Thanks for coding the great tool!
The problem code is in "src/HtmlOutputDev.cc" lines 538 - 546:
fntFix = new GString("</span><span class=\"ft");
if (((hlink1 == NULL) && (hlink2 == NULL)) && (hfont1->isEqualIgnoreBold(*hfont2) == gFalse))
It looks like the developer is trying to add support for subscripts and superscripts, but that this feature isn't fully implemented or is for html only (I'm dumping out to xml). I simply commented out these lines and recompiled.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.