From: Joe W. <jo...@gm...> - 2011-08-07 17:43:09
|
Hi all, On 6 Aug 2011, at 10:34 , Wolfgang Meier wrote: > I noticed the same issue of whitespace being lost. An update would be great. On Sat, Aug 6, 2011 at 5:12 AM, Dannes Wessels <da...@ex...> wrote: > sure, let's do it. I'd like to propose a to step thing…. > (1) just upgrade to 0.9 Okay, I've updated trunk to 0.9 in rev. 15092. See http://exist.svn.sourceforge.net/exist/?rev=15092&view=rev. For those who want to test/try tika 0.9: Note that if you have a local.build.properties, you'll need to update yours to match the new URL for 0.9 in the build.properties file I just committed. > (2) I'll split into jars, so we avoid double class entries > my question….. is there a (small) test case I can run, to show that the > stuff is still working? remote parse of a document or so…. Sounds good! As to your question, I tried creating a small test along these lines but encountered a problem -- apparently in the httpclient? The script downloads a PDF and parses it, but returns no text on each of the 35 pages - no error, but no text. If, instead, I read the same PDF from the database, tika returns all of the text. Strange! Here is my test script: === xquery version "1.0"; import module namespace content=" http://exist-db.org/xquery/contentextraction" at "java:org.exist.contentextraction.xquery.ContentExtractionModule"; declare namespace httpclient = "http://exist-db.org/xquery/httpclient"; let $uri := 'http://webcomposite.com/resource/pdf/x-advxquery-pdf.pdf' let $response := httpclient:get(xs:anyURI($uri), false(), ()) let $pdf := util:string-to-binary(util:base64-decode($response/httpclient:body/string())) let $content := content:get-metadata-and-content($pdf) return $content === The content returned is as follows: <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="xmpTPg:NPages" content="35"/> <meta name="Type" content="COSName{Info}"/> <meta name="producer" content="null"/> <meta name="Content-Type" content="application/pdf"/> <title/> </head> <body> <div class="page"> <p/> </div> <div class="page"> <p/> </div> (and so on for 35 divs.) But if I download the PDF and put it in the database and read from the database, the query returns with the expected results: === xquery version "1.0"; import module namespace content=" http://exist-db.org/xquery/contentextraction" at "java:org.exist.contentextraction.xquery.ContentExtractionModule"; let $pdf := util:binary-doc('/db/x-advxquery-pdf.pdf') let $content := content:get-metadata-and-content($pdf) return $content === The returned content is as follows: <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="xmpTPg:NPages" content="35"/> <meta name="Type" content="COSName{Info}"/> <meta name="producer" content="null"/> <meta name="Content-Type" content="application/pdf"/> <title/> </head> <body> <div class="page"> <p>Advancing with XQuery: Develop application idioms Work with extension functions, unit tests and assertions, recursion and sorting, and higher-order functions Skill Level: Intermediate James R. Fuller ( jim...@we...) Technical Director FlameDigital Limited & Webcomposite s.r.o. 30 Sep 2008 The XQuery specification is well over a year old now. A surfeit of solid implementations combined with (if developer chatter is anything to go by) marked new interest, seems to indicate that XQuery is finally experiencing higher adoption rates. Possibly this is due to developers starting to figure out how to utilize XQuery within a rich mixture of XML technologies (such as XML databases. XSLT, XML Schema). Learn how to use XQuery beyond its original role as an XML query language and apply it toward the development of middleware and Web applications. Section 1. Before you start Before you examine XQuery code samples, here's how to get the most of this tutorial, and instructions on how to install and use the included source code (see Downloads). About this tutorial This tutorial is about using XQuery to develop applications and middleware. It outlines some of XQuery's limitations while you develop applications, gives you Advancing with XQuery: Develop application idioms © Copyright IBM Corporation 1994, 2008. All rights reserved. Page 1 of 35</p> </div> (and so on for 35 pages.) I'm not sure what is causing the httpclient version of this script to fail. But if we can get it to work, it could be the basis of a test, along the lines of what Dannes requested. Cheers, Joe |