We have a catalog in PDF format that we're converting
to HTML. pdftohtml generally works well except the
whitespace between text on quite a few price tables is
incorrect.
Attached is a PDF of one page where this occurs (using
pdftohtml-0.36 under Linux and Cygwin/Windows).
For example, when I run "./pdftohtml -c -noframes
A40.pdf" the resulting text 'W351 0.020"' is all scrunched
together instead of 'W351' and '0.020"' being in separate
table columns.
Here's a patch I made:
=====================================
--- pdftohtml-0.36/src/HtmlOutputDev.cc 2003-08-
23 19:04:40.000000000 -0400
+++ pdftohtml/src/HtmlOutputDev.cc 2003-08-23
18:57:12.000000000 -0400
@@ -249,10 +249,15 @@
int n, i;
state->transform(x, y, &x1, &y1);
n = curStr->len;
-
+
+ // dmanura--2003-08-23
+ // large whitespace
+ GBool bigWhitespace = isspace(*u) && dx > (curStr-
>yMax - curStr->yMin);
+
// check that new character is in the same direction
as current string
// and is not too far away from it before adding
- if ((UnicodeMap::getDirection(u[0]) != curStr->dir) ||
+ if (bigWhitespace || // dmanura--2003-08-23
+ (UnicodeMap::getDirection(u[0]) != curStr->dir) ||
(n > 0 &&
fabs(x1 - curStr->xRight[n-1]) > 0.1 * (curStr-
>yMax - curStr->yMin))) {
endString();
@@ -267,8 +272,10 @@
w1 /= uLen;
h1 /= uLen;
}
- for (i = 0; i < uLen; ++i) {
- curStr->addChar(state, x1 + i*w1, y1 + i*h1, w1,
h1, u[i]);
+ if(!bigWhitespace) { // dmanura--2003-08-23
+ for (i = 0; i < uLen; ++i) {
+ curStr->addChar(state, x1 + i*w1, y1 + i*h1, w1,
h1, u[i]);
+ }
}
}
====================================
This makes it work just fine, but I'm not sure it's the
best way of implementing it. What's happening is the
PDF has ASCII 0x20 (space) characters that apparantly
are very wide (as determined by inserting debugging
statements above), but these are each rendered into a
single normal-width space character. Therefore, when
this occurs, I now have the code break the string and
start a new one. The criteria used above may or may
not be applicable in all situations, but I believe it is
reasonable.
PDF test file.
Logged In: YES
user_id=173287
please attach this patch or email it to me.
patch for handling large whitespace
Logged In: YES
user_id=850909
yup, that didn't include well. The file is now attached.
Logged In: NO
This needs to use iswspace to handle the unicode characters
properly.
GBool bigWhitespace = iswspace(*u) && dx > (curStr->yMax -
curStr->yMin);