Re: [Exist-development] Suggestion to update to tika-app-0.9.jar

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

On 6 Aug 2011, at 10:34 , Wolfgang Meier wrote:
> I noticed the same issue of whitespace being lost. An update would be
great.

On Sat, Aug 6, 2011 at 5:12 AM, Dannes Wessels <da...@ex...> wrote:
> sure, let's do it. I'd like to propose a to step thing….
> (1) just upgrade to 0.9

Okay, I've updated trunk to 0.9 in rev. 15092.  See
http://exist.svn.sourceforge.net/exist/?rev=15092&view=rev.

For those who want to test/try tika 0.9: Note that if you have a
local.build.properties, you'll need to update yours to match the new URL for
0.9 in the build.properties file I just committed.

> (2) I'll split into jars, so we avoid double class entries
> my question….. is there a (small) test case I can run, to show that the
> stuff is still working? remote parse of a document or so….

Sounds good!

As to your question, I tried creating a small test along these lines but
encountered a problem -- apparently in the httpclient?  The script downloads
a PDF and parses it, but returns no text on each of the 35 pages - no error,
but no text.  If, instead, I read the same PDF from the database, tika
returns all of the text.  Strange!  Here is my test script:

===
xquery version "1.0";

import module namespace content="
http://exist-db.org/xquery/contentextraction"
   at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

declare namespace httpclient = "http://exist-db.org/xquery/httpclient";

let $uri := 'http://webcomposite.com/resource/pdf/x-advxquery-pdf.pdf'
let $response := httpclient:get(xs:anyURI($uri), false(), ())
let $pdf :=
util:string-to-binary(util:base64-decode($response/httpclient:body/string()))
let $content := content:get-metadata-and-content($pdf)
return $content
===

The content returned is as follows:

    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <meta name="xmpTPg:NPages" content="35"/>
            <meta name="Type" content="COSName{Info}"/>
            <meta name="producer" content="null"/>
            <meta name="Content-Type" content="application/pdf"/>
            <title/>
        </head>
        <body>
            <div class="page">
                <p/>
            </div>
            <div class="page">
                <p/>
            </div>

(and so on for 35 divs.)  But if I download the PDF and put it in the
database and read from the database, the query returns with the expected
results:

===
xquery version "1.0";

import module namespace content="
http://exist-db.org/xquery/contentextraction"
    at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

let $pdf := util:binary-doc('/db/x-advxquery-pdf.pdf')
let $content := content:get-metadata-and-content($pdf)
return $content
===

The returned content is as follows:

    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <meta name="xmpTPg:NPages" content="35"/>
            <meta name="Type" content="COSName{Info}"/>
            <meta name="producer" content="null"/>
            <meta name="Content-Type" content="application/pdf"/>
            <title/>
        </head>
        <body>
            <div class="page">
                <p>Advancing with XQuery: Develop application idioms Work
with extension functions, unit tests and assertions, recursion and sorting,
and higher-order functions Skill Level: Intermediate James R. Fuller (
jim...@we...) Technical Director FlameDigital Limited &
Webcomposite s.r.o. 30 Sep 2008 The XQuery specification is well over a year
old now. A surfeit of solid implementations combined with (if developer
chatter is anything to go by) marked new interest, seems to indicate that
XQuery is finally experiencing higher adoption rates. Possibly this is due
to developers starting to figure out how to utilize XQuery within a rich
mixture of XML technologies (such as XML databases. XSLT, XML Schema). Learn
how to use XQuery beyond its original role as an XML query language and
apply it toward the development of middleware and Web applications. Section
1. Before you start Before you examine XQuery code samples, here's how to
get the most of this tutorial, and instructions on how to install and use
the included source code (see Downloads). About this tutorial This tutorial
is about using XQuery to develop applications and middleware. It outlines
some of XQuery's limitations while you develop applications, gives you
Advancing with XQuery: Develop application idioms © Copyright IBM
Corporation 1994, 2008. All rights reserved. Page 1 of 35</p>
            </div>

(and so on for 35 pages.)

I'm not sure what is causing the httpclient version of this script to fail.
 But if we can get it to work, it could be the basis of a test, along the
lines of what Dannes requested.

Cheers,

Joe

Re: [Exist-development] Suggestion to update to tika-app-0.9.jar

eXist-db is a feature rich Open Source native XML database

Re: [Exist-development] Suggestion to update to tika-app-0.9.jar