#49 Externalise document format parsing

closed
nobody
5
2014-04-11
2008-02-06
No

GATE currently has an internal mechanism for parsing document formats which converts the markup into annotations (at least for XML/HTML documents) and does some detection of MIME types.

The TIKA project (incubator.apache.org/tika/) does exactly that. It also generates some markup for PDF documents and is good at detecting MIME types and encodings. TIKA's API is simple and could be easily plugged into GATE.

Externalising the format parsing and use Tika would have the following advantages:
- simplify the code of GATE
- benefit from improvement in Tika
- easier to change behaviour (via Tika's XML config file)
- get markup for PDF documents

Discussion

  • Mark Greenwood

    Mark Greenwood - 2014-04-11
    • status: open --> closed
    • Group: --> Next_Release_(example)
     
  • Mark Greenwood

    Mark Greenwood - 2014-04-11

    We use TIKA for most document parsing now, although there is currently a bug in it that is stopping us upgrading to the most recent version

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks