Menu

#6 Support for WordPerfect documents

open
None
4
2012-09-17
2008-11-10
No

Outline

It would be nice if WordPerfect documents could be added to the list of document formats supported by DocFetcher. However, to the best of my knowledge there's currently no Java library for WordPerfect files, only a C/C++ library named 'libwpd', which, AFAIK, is used in AbiWord and OpenOffice.org and which therefore can be considered mature and stable enough.

In my opinion the best way to add WordPerfect support for DocFetcher is to write a JNI bridge to libwpd. A good tutorial for JNI is here:
http://java.sun.com/docs/books/jni/html/jniTOC.html

The first steps of this undertaking do not require integration into the DocFetcher source code, just a simple non-GUI Java program that extracts text from a WordPerfect document and writes the output to the console using System.out.println(..).

What the JNI bridge should look like from the Java side:

// Basic text extraction, something like this:
InputStream in = new FileInputStream(new File("someWPFile.wpd"));
WordPerfectDocument wpDoc = new WordPerfectDocument(in);
String text = wpDoc.getText();

// Extraction of meta data:
String title = wpDoc.getTitle();
String author = wpDoc.getAuthor();
String keywords = wpDoc.getKeywords();

In other words, there should be a "WordPerfectDocument" Java class with the following members:

  • constructor accepts a java.io.InputStream
  • method getText() extracts the raw, unformatted text from the document and returns it as a String
  • methods getTitle(), getAuthor(), getKeywords(), etc. extract the available document meta data as Strings

Some additional requirements:

  • class must be usable from both Linux and Windows
  • It probably makes a lot of sense to create a separate SF.net project for this library, so projects other than DocFetcher can use it, too. In that case, the JNI bridge should be licensed under the LGPL.

About the libwpd library:

Discussion

  • Nam-Quang Tran

    Nam-Quang Tran - 2009-01-12

    It turned out that all needed components for this feature can be found here:
    http://sourceforge.net/projects/libwpd

    More precisely, we need the modules "libwpd2" and "libwpd2-bindings" from the CVS repository of that project. I've managed to build them on Linux (it was a horrible nightmare...) and now I could need some help with the Windows builds.

     
  • Tonio Rush

    Tonio Rush - 2009-02-21

    I managed to compile the libwpd2 with MSVC2008 (a horrible nightmare too).
    So now I have these libs :

    • libwpd-0.9.lib
    • libwpd-stream-0.9.lib
      and these tools :
    • wpd2html.exe
    • wpd2raw.exe
    • wpd2text.exe

    I have taken a look to libwpd2-bindings, it uses C# and Swig to build. I'm afraid it's beyond my skills (and time).
    I saw you asked the libwpd team for a release, did you get it ?

     
  • Nam-Quang Tran

    Nam-Quang Tran - 2009-02-21

    Sorry for the painful experience :( I was hoping the compilation would be far less troublesome for an experienced C++ coder. I'll spare you the upcoming nightmare with libwpd2-bindings ;)

    Please attach your files to this tracker if they aren't too big, , otherwise commit them to the /lib folder in the DocFetcher SVN folder.

     
  • Nam-Quang Tran

    Nam-Quang Tran - 2009-02-21

    And there was no sign of a libwpd release in the foreseeable future, so there seems to be no way around the compilation.

     
  • Tonio Rush

    Tonio Rush - 2009-03-03

    The two lib files have been commited in the /lib folder :

    • libwpd-0.9.lib
    • libwpd-stream-0.9.lib
     

Log in to post a comment.

MongoDB Logo MongoDB