Thank you Martin. I'll try this approach.
Most likely I'll have more questions along the way.
Have a good weekend!
On Fri, 13 Jan 2006, Martin Haye wrote:
> I think there's a third approach which doesn't involve any external Java
> class and is still pretty simple:
> In the prefilter for the .pdf file, use the file ID or path to grab the mets
> data and *insert* the meta-data into the PDF document. XTF still sees it as
> one document, so it gets indexed with the mets meta-data and the PDF as the
> full text. You can pull in any XML file using the xpath 'document' function,
> available to your stylesheets.
> This method is similar to the default XTF setup in the distribution: the
> prefilter for the main XML document pulls in meta-data from a paired Dublin
> Core file in the same directory.
> Note: this depends on XTF's default PDF conversion tool, PDFBox. Since we
> created this functionality, PDFBox has come out with new versions, which
> handle some PDF files better. You should try the XTF version out (possibly
> without meta-data) to make sure you're happy with the conversions.
> If you're not happy with the way XTF converts PDF files to text, then your
> #2 solution or a variant thereof would give you the flexibility to convert
> them any way you like. Don't worry about the size of the returned document;
> as long as it fits in memory, you should be fine. RAM is cheap.
> On 1/13/06, Giulia Hill <ghill@...> wrote:
> > An interesting, for me, scenario which I'm trying to solve.
> > Scenario:
> > I have two sets of files: mets & pdf which are related in pairs. I'll need
> > to create indexes of both, but pdf will just have a full text access for
> > searches. However, result coming from a found term in a pdf needs to link
> > to the related mets file rather than the pdf itself.
> > My approach(s)
> > I thought of two possible solutions, the first, though, is what I think
> > might the simpler.
> > 1) in the preFilter.xml for the pdf file, I change what is indexed as the
> > id of the file replacing it with the related mets fileName which I
> > calculate by using a call, from within the xsl, to an external java class
> > which analyze the files in the mets directory. I like this approach
> > because it leaves XTF with the task of indexing and making all of the
> > work with minimal intervention on my side.
> > My question: how do I change the $id of the file?
> > 2) in the preFilter.xml for the mets file, I make a call to an external
> > java class which finds the right pdf and returns its content which I would
> > put in a indexing snippet like this:
> > <myPdf xtf:meta="true">
> > <xsl:value-of select="$resultOfJavaCall"/>
> > </myPdf>
> > Where I would have to convert somehow the pdf into text before hand. What
> > concerns me about this approach is the huge entries that I might get.
> > Do you have suggestion on the easiest, cleanest way to solve this problem?
> > Thanks,
> > Giulia
> > ----------------------------
> > Giulia Hill
> > Programmer/Analyst
> > Library Systems Office
> > University of California at Berkeley
> > 386 Doe Annex
> > Berkeley, CA 94720
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> > files
> > for problems? Stop! Download the new AJAX search engine that makes
> > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> > http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> > _______________________________________________
> > Xtf-user mailing list
> > Xtf-user@...
> > https://lists.sourceforge.net/lists/listinfo/xtf-user
Library Systems Office
University of California at Berkeley
386 Doe Annex
Berkeley, CA 94720