pdftohtml / Bugs / #46 wide whitespace not rendering well

#46 wide whitespace not rendering well

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2003-08-23

Created: 2003-08-23

Creator: David Manura

Private: No

We have a catalog in PDF format that we're converting
to HTML. pdftohtml generally works well except the
whitespace between text on quite a few price tables is
incorrect.

Attached is a PDF of one page where this occurs (using
pdftohtml-0.36 under Linux and Cygwin/Windows).

For example, when I run "./pdftohtml -c -noframes
A40.pdf" the resulting text 'W351 0.020"' is all scrunched
together instead of 'W351' and '0.020"' being in separate
table columns.

Here's a patch I made:

=====================================
--- pdftohtml-0.36/src/HtmlOutputDev.cc 2003-08-
23 19:04:40.000000000 -0400
+++ pdftohtml/src/HtmlOutputDev.cc 2003-08-23
18:57:12.000000000 -0400
@@ -249,10 +249,15 @@
int n, i;
state->transform(x, y, &x1, &y1);
n = curStr->len;
-
+
+ // dmanura--2003-08-23
+ // large whitespace
+ GBool bigWhitespace = isspace(*u) && dx > (curStr-
>yMax - curStr->yMin);
+
// check that new character is in the same direction
as current string
// and is not too far away from it before adding
- if ((UnicodeMap::getDirection(u[0]) != curStr->dir) ||
+ if (bigWhitespace || // dmanura--2003-08-23
+ (UnicodeMap::getDirection(u[0]) != curStr->dir) ||
(n > 0 &&
fabs(x1 - curStr->xRight[n-1]) > 0.1 * (curStr-
>yMax - curStr->yMin))) {
endString();
@@ -267,8 +272,10 @@
w1 /= uLen;
h1 /= uLen;
}
- for (i = 0; i < uLen; ++i) {
- curStr->addChar(state, x1 + i*w1, y1 + i*h1, w1,
h1, u[i]);
+ if(!bigWhitespace) { // dmanura--2003-08-23
+ for (i = 0; i < uLen; ++i) {
+ curStr->addChar(state, x1 + i*w1, y1 + i*h1, w1,
h1, u[i]);
+ }
}
}
====================================

This makes it work just fine, but I'm not sure it's the
best way of implementing it. What's happening is the
PDF has ASCII 0x20 (space) characters that apparantly
are very wide (as determined by inserting debugging
statements above), but these are each rendered into a
single normal-width space character. Therefore, when
this occurs, I now have the code break the string and
start a new one. The criteria used above may or may
not be applicable in all situations, but I believe it is
reasonable.

Discussion

David Manura - 2003-08-23

PDF test file.

A40.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikhail Kruk - 2003-08-29

Logged In: YES
user_id=173287

please attach this patch or email it to me.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Manura - 2003-08-29

patch for handling large whitespace

patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Manura - 2003-08-29

Logged In: YES
user_id=850909

yup, that didn't include well. The file is now attached.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2004-01-06

Logged In: NO

This needs to use iswspace to handle the unicode characters
properly.

GBool bigWhitespace = iswspace(*u) && dx > (curStr->yMax -
curStr->yMin);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

wide whitespace not rendering well

Group

Searches

Help

#46 wide whitespace not rendering well

Discussion