GATE currently has an internal mechanism for parsing document formats which converts the markup into annotations (at least for XML/HTML documents) and does some detection of MIME types.
The TIKA project (incubator.apache.org/tika/) does exactly that. It also generates some markup for PDF documents and is good at detecting MIME types and encodings. TIKA's API is simple and could be easily plugged into GATE.
Externalising the format parsing and use Tika would have the following advantages:
- simplify the code of GATE
- benefit from improvement in Tika
- easier to change behaviour (via Tika's XML config file)
- get markup for PDF documents