PDF Clown / Bugs / #42 Wrong Text Location

#42 Wrong Text Location

Milestone: 0.1.2.1

Status: closed-out-of-date

Owner: Stefano Chizzolini

Labels: text-extraction (3)

Priority: 3

Updated: 2015-04-28

Created: 2013-02-23

Creator: Mohsen Afshin

Private: No

Text extraction on the attached PDF works but with wrong char and sentence coordinations.
Also on some PDFs the space character is not detected correctly.

I've applied a temporary fix in the Extract method of ContentScanner.cs:
(It works on some PDF but on some other it add excess spaces :-( )
(I measure the width of a reference space character using the only character in the embedded font).

        private void Extract(ContentScanner level)
        {
            if (level == null)
                return;

            while (level.MoveNext())
            {
                ContentObject content = level.Current;

                if (content is ShowText)
                {
                    var currentWrapper = (TextStringWrapper)level.CurrentWrapper;

                    Bitmap b = new Bitmap(1, 1);
                    Graphics g = Graphics.FromImage(b);
                    float charSystemSize = g.MeasureString(currentWrapper.TextChars[0].Value.ToString(), SystemFonts.DefaultFont).Width;
                    float spaceSystemSize = g.MeasureString(' '.ToString(), SystemFonts.DefaultFont).Width;
                    float charFontSize = currentWrapper.TextChars[0].Box.Width;
                    float spaceFontSize = (charFontSize * spaceSystemSize) / charSystemSize;

                    if (charFontSize > 0.0f)
                    {
                        for (int i = 0; i < currentWrapper.TextChars.Count - 1; i++)
                        {
                            if (currentWrapper.TextChars[i].Value == ' ')
                                continue;

                            RectangleF box1 = currentWrapper.TextChars[i].Box;
                            float left1 = box1.Left + box1.Width;
                            float left2 = currentWrapper.TextChars[i + 1].Box.Left;
                            if (Math.Abs(Math.Abs(left2 - left1) - spaceFontSize) < 1.0f)
                            {
                                currentWrapper.TextChars.Insert(i + 1,
                                    new TextChar(' ', new RectangleF(left1 + 0.2f, box1.Top, 0.5f, 0.2f), null, true));
                            }
                        }
                    }

                    textStrings.Add(currentWrapper);
                }
                else if (content is ContainerObject)
                {
                    Extract(level.ChildLevel);
                }
            }
        }

1 Attachments

pdfsample.rar

Discussion

Stefano Chizzolini - 2015-03-12

assigned_to: Stefano Chizzolini

Group: --> 0.1.2.1

Priority: 5 --> 3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefano Chizzolini - 2015-04-28

Current version (PDF Clown 0.1.2.1) perfectly detects the actual glyph positions of your sample (see my attachment pdfsample-textInfo.zip).

pdfsample-textInfo.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefano Chizzolini - 2015-04-28

status: open --> closed-out-of-date
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wrong Text Location

General-Purpose PDF Library for Java and .NET

Group

Searches

Help

#42 Wrong Text Location

Discussion