From: Joe W. <jo...@gm...> - 2011-08-06 04:16:16
|
Hi all, Tika 0.8 suffers from a well-documented problem, in which spaces are stripped from PDF content [1]. This problem was fixed in 0.9 [2], which was released back in February. Since eXist 1.5dev trunk contains 0.8, I experienced this problem where extracted PDF text had no spaces. When I replaced the 0.8 jar with 0.9, the problem was gone. I was using content:get-metadata-and-content() to extract text from a PDF file stored in the db. I experienced the problem with both PDFs I tested under 0.8. I'd suggest that we update to 0.9. I'd be happy to update trunk if the core devs give me the heads up, if it would help [3], but perhaps there are other considerations here I'm not aware of. Thoughts? Cheers, Joe [1] https://issues.apache.org/jira/browse/TIKA-548 [2] http://www.apache.org/dist/tika/CHANGES-0.9.txt [3] http://tika.apache.org/download.html |