Extract all possible textual information from a PDF file. This is intended mainly for tabular data where positional as well as textual information is required. PDF uses two text string placement operators, Tj and TJ. Tj places equally spaced characters while TJ places variably spaced characters starting from an X, Y coordinate in arbitrary units. A text fragment consists of the X and Y coordinates of the text string along with the text string. A list of text fragments containing all the text on page is extracted and then sorted to put the text fragments into reading order. The list is then converted into strings; 'X value, Y value, text\n' which are concatenated and stored in a text file, 'Page n', where n is the page number. Data extraction becomes much easier because parsing can be based on both text value and text position. This is very useful for data sources which are available only in PDF form and are updated regularly. For example, insider trading data from SEDI.ca.

Features

  • JavaScript based PDF text extractor optimized for tabular data.
  • Runs in Firefox as a temporary extension and is intended as a helper application for PDF data extraction in extensions.
  • Completely standalone with no external libraries.
  • Compatible with all versions of PDF up to 1.7 and will unencrypt files with default passwords.

Project Activity

See All Activity >

Follow pdf-to-text-fragments

pdf-to-text-fragments Web Site

Other Useful Business Software
Our Free Plans just got better! | Auth0 Icon
Our Free Plans just got better! | Auth0

With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
Try free now
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of pdf-to-text-fragments!

Additional Project Details

Registered

2022-05-01