Apache Tika™ is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. It provides command-line access to content parsers for various document formats developed for the Apache Lucene Project, notably OOXML, ODF, RTF, PDF, ePub, HTML and XML. It is also capable of parsing metadata from several audio, and image formats, as well as flv. It is available from http://tika.apache.org.
Log in to post a comment.