#49 Externalise document format parsing


GATE currently has an internal mechanism for parsing document formats which converts the markup into annotations (at least for XML/HTML documents) and does some detection of MIME types.

The TIKA project (incubator.apache.org/tika/) does exactly that. It also generates some markup for PDF documents and is good at detecting MIME types and encodings. TIKA's API is simple and could be easily plugged into GATE.

Externalising the format parsing and use Tika would have the following advantages:
- simplify the code of GATE
- benefit from improvement in Tika
- easier to change behaviour (via Tika's XML config file)
- get markup for PDF documents


  • Mark Greenwood

    Mark Greenwood - 2014-04-11
    • status: open --> closed
    • Group: --> Next_Release_(example)
  • Mark Greenwood

    Mark Greenwood - 2014-04-11

    We use TIKA for most document parsing now, although there is currently a bug in it that is stopping us upgrading to the most recent version


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks