Tipos for using ExtractText on windows

  • zakwer

    Sorry if this subject is more MS-DOS oriented than anything else.

    I am trying to extract the text out of some pdf files on a Windows PC.

    Using the "system symbol" (MS-DOS console) I have managed to "pipe" (ala UNIX | command) the output of ExtractText into a text file with this command:

    java -classpath H:\TMP\PDF\Multivalent20040415.jar tool.doc.ExtractText -verbose SAMPLE.pdf | TEXTO.txt

    The extracted text shows up in a notepad file, and I can then copy and paste it into a text editor to remove the funky return characters. Fine. Thanks. Love it.

    I wonder if anyone knows how to automate it a little more in order to work with a folder full of pdfs, as using the above command only "pipes" the first pdf file into the text file, seemengly ignoring the rest.

    Also if there are any other file formats available for output, I have tried .rtf, .doc, .htm to no avail.


    • Brian

      MS-DOS does have its own scripting tools, and you use them to build batch processing files (.BAT). They are not particularly extensive, or elegant when compared to Unix shell scripts, but you should be able to do what you hope to, here.

      The following URL has some brief guidelines on how to use the handful of DOS scripting commands. http://www.cs.ntu.edu.au/homepages/bea/home/subjects/ith305/description.html

      The FOR command will probably do what you want it to.

    • Robert Eden
      Robert Eden

      Instead of sending to a pipe, redirect to a file.

      > file.txt
      | file.txt

      (I'm actually surprised the pipe worked)