Menu

#46 wide whitespace not rendering well

open
nobody
None
5
2003-08-23
2003-08-23
No

We have a catalog in PDF format that we're converting
to HTML. pdftohtml generally works well except the
whitespace between text on quite a few price tables is
incorrect.

Attached is a PDF of one page where this occurs (using
pdftohtml-0.36 under Linux and Cygwin/Windows).

For example, when I run "./pdftohtml -c -noframes
A40.pdf" the resulting text 'W351 0.020"' is all scrunched
together instead of 'W351' and '0.020"' being in separate
table columns.

Here's a patch I made:

=====================================
--- pdftohtml-0.36/src/HtmlOutputDev.cc 2003-08-
23 19:04:40.000000000 -0400
+++ pdftohtml/src/HtmlOutputDev.cc 2003-08-23
18:57:12.000000000 -0400
@@ -249,10 +249,15 @@
int n, i;
state->transform(x, y, &x1, &y1);
n = curStr->len;
-
+
+ // dmanura--2003-08-23
+ // large whitespace
+ GBool bigWhitespace = isspace(*u) && dx > (curStr-
>yMax - curStr->yMin);
+
// check that new character is in the same direction
as current string
// and is not too far away from it before adding
- if ((UnicodeMap::getDirection(u[0]) != curStr->dir) ||
+ if (bigWhitespace || // dmanura--2003-08-23
+ (UnicodeMap::getDirection(u[0]) != curStr->dir) ||
(n > 0 &&
fabs(x1 - curStr->xRight[n-1]) > 0.1 * (curStr-
>yMax - curStr->yMin))) {
endString();
@@ -267,8 +272,10 @@
w1 /= uLen;
h1 /= uLen;
}
- for (i = 0; i < uLen; ++i) {
- curStr->addChar(state, x1 + i*w1, y1 + i*h1, w1,
h1, u[i]);
+ if(!bigWhitespace) { // dmanura--2003-08-23
+ for (i = 0; i < uLen; ++i) {
+ curStr->addChar(state, x1 + i*w1, y1 + i*h1, w1,
h1, u[i]);
+ }
}
}
====================================

This makes it work just fine, but I'm not sure it's the
best way of implementing it. What's happening is the
PDF has ASCII 0x20 (space) characters that apparantly
are very wide (as determined by inserting debugging
statements above), but these are each rendered into a
single normal-width space character. Therefore, when
this occurs, I now have the code break the string and
start a new one. The criteria used above may or may
not be applicable in all situations, but I believe it is
reasonable.

Discussion

  • David Manura

    David Manura - 2003-08-23

    PDF test file.

     
  • Mikhail Kruk

    Mikhail Kruk - 2003-08-29

    Logged In: YES
    user_id=173287

    please attach this patch or email it to me.

     
  • David Manura

    David Manura - 2003-08-29

    patch for handling large whitespace

     
  • David Manura

    David Manura - 2003-08-29

    Logged In: YES
    user_id=850909

    yup, that didn't include well. The file is now attached.

     
  • Nobody/Anonymous

    Logged In: NO

    This needs to use iswspace to handle the unicode characters
    properly.

    GBool bigWhitespace = iswspace(*u) && dx > (curStr->yMax -
    curStr->yMin);

     

Log in to post a comment.