Extract all possible textual information from a PDF file. This is intended mainly for tabular data where positional as well as textual information is required. PDF uses two text string placement operators, Tj and TJ. Tj places equally spaced characters while TJ places variably spaced characters starting from an X, Y coordinate in arbitrary units. A text fragment consists of the X and Y coordinates of the text string along with the text string. A list of text fragments containing all the text on page is extracted and then sorted to put the text fragments into reading order. The list is then converted into strings; 'X value, Y value, text\n' which are concatenated and stored in a text file, 'Page n', where n is the page number. Data extraction becomes much easier because parsing can be based on both text value and text position. This is very useful for data sources which are available only in PDF form and are updated regularly. For example, insider trading data from SEDI.ca.

Features

  • JavaScript based PDF text extractor optimized for tabular data.
  • Runs in Firefox as a temporary extension and is intended as a helper application for PDF data extraction in extensions.
  • Completely standalone with no external libraries.
  • Compatible with all versions of PDF up to 1.7 and will unencrypt files with default passwords.

Project Activity

See All Activity >

Follow pdf-to-text-fragments

pdf-to-text-fragments Web Site

You Might Also Like
Red Hat Ansible Automation Platform on Microsoft Azure Icon
Red Hat Ansible Automation Platform on Microsoft Azure

Red Hat Ansible Automation Platform on Azure allows you to quickly deploy, automate, and manage resources securely and at scale.

Deploy Red Hat Ansible Automation Platform on Microsoft Azure for a strategic automation solution that allows you to orchestrate, govern and operationalize your Azure environment.
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of pdf-to-text-fragments!

Additional Project Details

Registered

2022-05-01