Extractors
Extractors extract the full-text and/or metadata of a particular document type (one or more MIME types). There are two kinds of extractors. The first one Operates on an InputStream. (Called simply and Extractor). The second one operates on a java.io.File instance (called a FileExtractor). Both of them can optionally be accompanied by a MIME type and/or a Charset to tune the processing, and produce a set of RDF statements describing the full-text and metadata.
The extractor classes are summarized in the following diagram:
The FileExtractors are kept separate from normal Extractors because some binary data formats are built in a way that requires the entire file to be available. For example: ID3 Version 1 tags are placed in the last 128 bytes of an MP3 file. It is trivial to read them from a file on a hard disk, while reading them from a stream is difficult. It would necessitate reading the entire content from the stream to the very and and them counting 128 bytes backwards. The entire file would need to be kept in memory, or in a temporary file. It makes much more sense to create a temporary file by writing the content of the stream directly to disk and then passing it to a FileExtractor.
The current set of available Extractors focuses on typical document-like formats, such as word processor documents, spreadsheets and presentations. Future implementations are planned that also target images, videos and sound files. (Examples already present are JpgExtractor and MP3FileExtractor).
Available Extractors
Below is a list of available Extractors, their external dependencies and remarks. These implementations may vary in performance and extraction quality. Some use dedicated external libraries for handing a specific document format or family of document formats, others merely use a heuristic algorithm to extract readable text from a binary stream. Click the link to see the mappings produced by each of these extractors.
| Extractor | Dependencies | Remarks |
| [ExcelExtractor] | Poi libraries | Tests indicate that Excel 97 and higher are supported. Both the full-text and metadata are retrieved. |
| [HtmlExtractor] | Htmlparser | -- |
| [JpgExtractor?] | Metadata Extractor | Extracts EXIF annotations from JPG photos. It covers a limited subset of the EXIF specification, though this subset is bound to expand in future versions. |
| [MimeExtractor?] | JavaMail | An Extractor for message/rfc822 and message/news documents. Both the most significant headers and the body are extracted. Any attachments are ignored. |
| MP3FileExtractor | JAudiotagger | Extracts ID3 metadata from MP3 files. Supports those properties from ID3v1, ID3v2.3 and ID3v2.4 specifications that are covered by the NID3 ontology. ID3 2.2 is not supported (yet). Support for Synchronized Text (SYLT frame) is also planned for the future. |
| [OfficeExtractor?] | Poi libraries | An Extractor that can be used as a fall-back when the MIME type identifier was able to identify a document as an MS Office document but was not able to further classify it, e.g. as an MS Word file. Both text and metadata are extracted. |
| [OpenDocumentExtractor?] | -- | Extracts full-text and metadata from OpenDocument files and is backwards compatible with older OpenOffice (1.x) and StarOffice (6.x and 7.x) documents. |
| OpenXML? | -- | Extracts full-text and metadata from files in the Open XML format, generated by the newer versions of MS Office suite. |
| [PdfExtractor] | PDFBox | Extracts full-text and metadata from all PDF versions. |
| [PlainTextExtractor] | -- | A trivial extractor implementation for plain text files. |
| [PowerPointExtractor?] | Poi libraries | Text and metadata are extracted. Text extraction is noisy but sufficient for text indexing, if you're willing to accept that your index will contain non-word symbols. |
| [PresentationsExtractor?] | Poi libraries | Apparently Presentations files can have an OLE structure similar to MS Office files or use a document structure similar to WordPerfect. In both cases text can be extracted. In the first case metadata is also extracted. |
| [PublisherExtractor?] | Poi libraries | Poi is only used for document metadata retrieval, text retrieval uses a heuristic string extraction algorithm. |
| [QuattroExtractor?] | Poi libraries | Only recent Quattro Formats, as used by Quattro Pro 7 and Quattro Pro X3, are supported as these have a structure similar to MS Office documents. Older versions are not supported. Poi is only used for metadata retrieval, text retrieval uses a heuristic string extraction algorithm in both cases. |
| [RtfExtractor?] | None (uses the JRE's RTFEditorKit) | Only document text is extracted. Otherwise no known issues. |
| [VisioExtractor?] | Poi libraries | Poi is only used for document metadata retrieval, text retrieval uses a heuristic string extraction algorithm. |
| [WordExtractor?] | Poi libraries | Tests indicate that Word 97 and higher are supported. Both text and metadata are extracted. |
| [WordPerfectExtractor?] | -- | Implementation only extracts full-text. Text is extracted from WordPerfect documents from version 4.2 up to WordPerfect X3 (tested with 4.2, 5.0, 5.1/5.2 and X3, all created using WordPerfect X3), except for the 5.1/5.2 Far East format. Tests revealed that for WordPerfect 5.0 and 5.1 the document metadata also ends up at the start of the extracted full-text. |
| [WorksExtractor?] | -- | Implementation only extracts text and apparently only works well on Works 3.0 and 4.0 documents and Works 4.0/2000 spreadsheets. Other versions typically produce garbage, if anything at all. |
| [XmlExtractor?] | -- | Extracts all PCDATA and attribute values in the order in which they appear in the document. |
Note regarding the Poi libraries: in accidental cases Poi cannot process a document correctly, leading to some sort of Exception. In that case the Extractor typically catches and disposes of the Exception and falls back to applying a heuristic string extraction algorithm on the binary stream, which very often works surprisingly well on MS Office formats.
Example
The following code demonstrates how to apply an Extractor on a given File and dump the extraction results to System.out (using NTriples encoding):
package org.semanticdesktop.aperture.examples.tutorials;
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.PrintWriter;
import java.util.Set;
import org.ontoware.rdf2go.RDF2Go;
import org.ontoware.rdf2go.model.Model;
import org.ontoware.rdf2go.model.Syntax;
import org.ontoware.rdf2go.model.node.URI;
import org.ontoware.rdf2go.model.node.impl.URIImpl;
import org.semanticdesktop.aperture.extractor.Extractor;
import org.semanticdesktop.aperture.extractor.ExtractorFactory;
import org.semanticdesktop.aperture.extractor.ExtractorRegistry;
import org.semanticdesktop.aperture.extractor.impl.DefaultExtractorRegistry;
import org.semanticdesktop.aperture.mime.identifier.MimeTypeIdentifier;
import org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier;
import org.semanticdesktop.aperture.rdf.RDFContainer;
import org.semanticdesktop.aperture.rdf.impl.RDFContainerImpl;
import org.semanticdesktop.aperture.util.IOUtil;
import org.semanticdesktop.aperture.vocabulary.NIE;
public class ExtractorExample {
public static void main(String[] args) throws Exception {
// create a MimeTypeIdentifier
MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier();
// create an ExtractorRegistry containing all available
// ExtractorFactories
ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry();
// read as many bytes of the file as desired by the MIME type identifier
File file = new File("somefile.someextension");
FileInputStream stream = new FileInputStream(file);
BufferedInputStream buffer = new BufferedInputStream(stream);
byte[] bytes = IOUtil.readBytes(buffer, identifier.getMinArrayLength());
stream.close();
// let the MimeTypeIdentifier determine the MIME type of this file
String mimeType = identifier.identify(bytes, file.getPath(), null);
// skip when the MIME type could not be determined
if (mimeType == null) {
System.err.println("MIME type could not be established.");
return;
}
// create the RDFContainer that will hold the RDF model
URI uri = new URIImpl(file.toURI().toString());
Model model = RDF2Go.getModelFactory().createModel();
model.open();
RDFContainer container = new RDFContainerImpl(model, uri);
// determine and apply an Extractor that can handle this MIME type
Set factories = extractorRegistry.get(mimeType);
if (factories != null && !factories.isEmpty()) {
// just fetch the first available Extractor
ExtractorFactory factory = (ExtractorFactory) factories.iterator().next();
Extractor extractor = factory.get();
// apply the extractor on the specified file
// (just open a new stream rather than buffer the previous stream)
stream = new FileInputStream(file);
buffer = new BufferedInputStream(stream, 8192);
extractor.extract(uri, buffer, null, mimeType, container);
stream.close();
}
// add the MIME type as an additional statement to the RDF model
container.add(NIE.mimeType, mimeType);
// report the output to System.out
container.getModel().writeTo(new PrintWriter(System.out),Syntax.Ntriples);
}
}