Thread: [VuFind-General] Any best practices integrating catalogue enrichment with teasers?

Brought to you by: asnagy, crhallberg, demiankatz, design4lib

vufind-general

[VuFind-General] Any best practices integrating catalogue enrichment with teasers?

From: Andreas K. <And...@gm...> - 2009-07-28 20:05:46

Attachments: signature.asc

Hello everyone,

has anyone experience integrating scanned TOCs into a vufind-Search with
teaser display? I would be interested if there existed some best
practices. Escpecially, I would be very glad to hear wether there are
recommendations how to display teasers in vufind's result list and in
the record display screen.

A short summary about my experiments with that:
I got the scanned tocs as pdf and extracted plain text from them. For
indexing the texts with their associated Metadata-Documents I use the
filenames in the URLs (MARC-Field 856a). I extended VuFindIndexer.java
in MarcImporter.jar and added a field 'cecontents' to biblio's
schema.xml, which gets filled with the plain text in a custom method.
Indexing and Searching works fine and I  adjusted Solr.php to search the
additional field if a query is sent via Basic Search.

Finally I tried to display teasers, which worked fine using Solr's admin
interface. Today, my first try in vufind was to write a Smarty plugin
using vufind's Solr.php-class. Unfortunately this class only writes
MARC-fields into $result - but I need the highlight-option and my custom
cecontents-field (using plain XML-Output didn't help).
For tomorrow I am planning to try an implementation using SolrJS written
into the output-html via another Smarty-Plugin.

Is that a good idea? Do you have other experiences/ideas?

Andreas

Re: [VuFind-General] Any best practices integrating catalogue enrichment with teasers?

From: Andrew N. <as...@gm...> - 2009-07-28 21:14:46

Andreas - the best approach for this would be to use the solr
highlighting feature where you can configure it to pull a "snippet"
from a specified field.

Let me know if you have any more specific questions about doing this.

Andrew

On 7/28/09, Andreas Kahl <And...@gm...> wrote:
> Hello everyone,
>
> has anyone experience integrating scanned TOCs into a vufind-Search with
> teaser display? I would be interested if there existed some best
> practices. Escpecially, I would be very glad to hear wether there are
> recommendations how to display teasers in vufind's result list and in
> the record display screen.
>
> A short summary about my experiments with that:
> I got the scanned tocs as pdf and extracted plain text from them. For
> indexing the texts with their associated Metadata-Documents I use the
> filenames in the URLs (MARC-Field 856a). I extended VuFindIndexer.java
> in MarcImporter.jar and added a field 'cecontents' to biblio's
> schema.xml, which gets filled with the plain text in a custom method.
> Indexing and Searching works fine and I  adjusted Solr.php to search the
> additional field if a query is sent via Basic Search.
>
> Finally I tried to display teasers, which worked fine using Solr's admin
> interface. Today, my first try in vufind was to write a Smarty plugin
> using vufind's Solr.php-class. Unfortunately this class only writes
> MARC-fields into $result - but I need the highlight-option and my custom
> cecontents-field (using plain XML-Output didn't help).
> For tomorrow I am planning to try an implementation using SolrJS written
> into the output-html via another Smarty-Plugin.
>
> Is that a good idea? Do you have other experiences/ideas?
>
> Andreas
>
>
>

-- 
Sent from my mobile device

Re: [VuFind-General] Any best practices integrating catalogue enrichment with teasers?

From: Andreas K. <And...@gm...> - 2009-07-29 05:47:40

Attachments: signature.asc

Hello Andrew,

thanks for your message. Using Solr's highlighting is exactly what I am
planning to do. My main problem at the moment is to display the
highlighting in vufind. And I am not sure if I should use Solr.php or an
implementation of SolrJS.
In Solr.php I cannot find the code stripping off additional fields from
the result sets (the only output seems to be the contents of the
fullrecord-field), and it seems there is no option to activate the
highlighting feature in that class. (I am using 1.0RC 1)

Andreas

Andrew Nagy schrieb:
> Andreas - the best approach for this would be to use the solr
> highlighting feature where you can configure it to pull a "snippet"
> from a specified field.
>
> Let me know if you have any more specific questions about doing this.
>
> Andrew
>
> On 7/28/09, Andreas Kahl <And...@gm...> wrote:
>   
>> Hello everyone,
>>
>> has anyone experience integrating scanned TOCs into a vufind-Search with
>> teaser display? I would be interested if there existed some best
>> practices. Escpecially, I would be very glad to hear wether there are
>> recommendations how to display teasers in vufind's result list and in
>> the record display screen.
>>
>> A short summary about my experiments with that:
>> I got the scanned tocs as pdf and extracted plain text from them. For
>> indexing the texts with their associated Metadata-Documents I use the
>> filenames in the URLs (MARC-Field 856a). I extended VuFindIndexer.java
>> in MarcImporter.jar and added a field 'cecontents' to biblio's
>> schema.xml, which gets filled with the plain text in a custom method.
>> Indexing and Searching works fine and I  adjusted Solr.php to search the
>> additional field if a query is sent via Basic Search.
>>
>> Finally I tried to display teasers, which worked fine using Solr's admin
>> interface. Today, my first try in vufind was to write a Smarty plugin
>> using vufind's Solr.php-class. Unfortunately this class only writes
>> MARC-fields into $result - but I need the highlight-option and my custom
>> cecontents-field (using plain XML-Output didn't help).
>> For tomorrow I am planning to try an implementation using SolrJS written
>> into the output-html via another Smarty-Plugin.
>>
>> Is that a good idea? Do you have other experiences/ideas?
>>
>> Andreas
>>

Re: [VuFind-General] Any best practices integrating catalogue enrichment with teasers?

From: Demian K. <dem...@vi...> - 2009-07-30 13:21:19

> Secondly, @Tillk, my ideas about your page in the wiki:
> (Admittedly, this is perhaps somehow an outsider's view not taking into
> account things I still need to learn about. ) Why do we think about
> record-formats like MARC or something else when displaying data from an
> index? MARC and others are great to quickly get data into a basic
> index, but after that we could be free from that and use our own
> conventions according to local user needs.

I think there are two main reasons for retaining the original record format and using it for display:

1.) There are some fields that you want to display to the user that serve no indexing purpose.  I'm not a Solr expert, but I assume it takes less overhead to store a single "full record" field than to break it into every conceivable piece in separate indexes.

2.) Storing the full record makes it really easy to display the full record "Staff View" tab, which is a really useful debugging tool when records act unexpectedly.

That being said, I agree that we should rely on specific record formats as little as possible.  If we can get the data intact from the Solr index, we might as well use the index version.

I've been doing some brainstorming with Till about this subject (keeping it off the list until my thoughts are a little more organized), but I'll be writing up a more formal proposal in the next day or two.  Hopefully this will help us get the best of both worlds.

- Demian

Re: [VuFind-General] Any best practices integrating catalogue enrichment with teasers?

From: Till K. <kin...@gb...> - 2009-07-30 13:55:55

Demian Katz schrieb:
>> Secondly, @Tillk, my ideas about your page in the wiki:
>> (Admittedly, this is perhaps somehow an outsider's view not taking into
>> account things I still need to learn about. ) Why do we think about
>> record-formats like MARC or something else when displaying data from an
>> index? MARC and others are great to quickly get data into a basic
>> index, but after that we could be free from that and use our own
>> conventions according to local user needs.
> 
> I think there are two main reasons

I agree with those reasons.
A third is: There may be formats, that really don't match the "default 
semantics" of Solr index fields. But that "semantics" will govern record 
display (eg. you'll put a label "Author" or maybe more general "Person" 
in front of the contents of the author_fullStr index field). What to do, 
if a record format has no authors, but for example singers or butchers 
(to catalog your collection of argentine beef)? You may want to push the 
butchers into the "author" index anyway to make them findable, but 
define an individual view for them, not to pretend those artists being 
writers...
That's how I understood the outcome of the discussion on this list some 
weeks ago about handling of different formats: 
http://www.nabble.com/other-data-sources-for-the-index-to23867424.html

Till (should do some bbq tonight? :-)

-- 
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kin...@gb..., +49 (0) 551 39-13431, http://www.gbv.de

Re: [VuFind-General] Any best practices integrating catalogue enrichment with teasers?

From: Till K. <kin...@gm...> - 2009-07-29 07:29:34

Andreas Kahl schrieb:

> thanks for your message. Using Solr's highlighting is exactly what I am
> planning to do. My main problem at the moment is to display the
> highlighting in vufind. And I am not sure if I should use Solr.php or an
> implementation of SolrJS.

As far as I understand, using SolrJS requires opening access to your 
Solr server HTTP interface for everyone (which allows updating, deleting 
records, adding indexes/cores...). So I would highly recommend some kind 
of filter between users and the Solr server that allows only uncritical 
calls to be made.

Adding the highlighting parameters to Solr.php should be 
straightforward. Just pass the additional highlighting parameters to 
_call() as third argument in the $params array.
You may want to change getRecord() and search() to add these parameters 
to the options/params array before calling _call(). Maybe you even need 
to add an additional parameter to search() and getRecord().

> In Solr.php I cannot find the code stripping off additional fields from
> the result sets (the only output seems to be the contents of the
> fullrecord-field)

That's done by passing the Solr result through an XSLT sheet in 
_process() (which is called by _call()).
Add the relevant parts of the Solr response to the XSLT sheet. Finally 
you will change the display controller in web/services/Record/Record.php 
(for full title views, short title views should be straightforward in 
the Smarty template), which is only parsing the full MARC record stored 
in Solr field fullrecord yet.
I compiled some ideas how to make record display in VuFind MARC 
independent on http://www.vufind.org/wiki/other_than_marc. Maybe you 
want to implement that? :-) No, seriously: Maybe you have some comments 
on that. I'd really like to implement that in August, if nobody else is 
already working on that (or has done it already).

Till

-- 
http://twitter.com/tillk

Re: [VuFind-General] Any best practices integrating catalogue enrichment with teasers?

From: Andreas K. <And...@gm...> - 2009-07-30 13:03:12

Attachments: signature.asc Solr.php solr-convert.xsl view.tpl function.showTeaser.php modifier.highlight.php

Till, thank you very very much for your mail. With that help I managed
to implement the highlighted snippets.
With your knowledge, this was really rather straightforward. Anyway, I
will send my Code and some comments how the implementation works - in
case anyone is interested.  In the end I have some - personal - comments
about the other_than_marc-page and the concept of indexing arbitrary data.

Attached you find 5 files (I think differences are seen best with some
diff-tool):
- Solr.php got some new parameters to activate and configure highlighting
- solr-convert.xsl got a new rule just copying the highlighting-element
to the output
- /web/Record/view.tpl : here you can find the call for my new Smarty-Plugin
- function.showTeaser.php: The plugin itself.
Last but not least I modified styles.css:
span.highlighting{font-weight:bold;}

Besides that, I added (admittedly without much understanding) one line
of code to modifier.highlight.php. To make it split multiterm queries
and highlight those in the metadata, too.

Back to function.showTeaser.php
The implementation is done for displaying teasers in Record view. If you
intend to display highlighting in hitlists, some work needs possibly to
be done to identify the snippets and attach them to the correct hits.

At first, I instantiated a $solr = new Solr("http://localhost:8090") -
sorry, URL hardcoded for now. 'global $solr' as shown in
www.vufind.org/wiki/building_a_plugin did not work for me. The global
seems not to be found inside the plugin. On the building_a_plugin-site,
it would have been very helpful if the calling tag for the function
would have been also displayed (as a PHP newbie it takes some time to
understand the syntax without an example)

After that, the user's query is extracted from the lookfor-parameter.
Finally I build up a query limited to my TOC-field cecontents AND
id:<docid> to obtain highlighting and only one result to parse. Output
is returned as raw XML and highlighting is extracted via xpath.

If you have any questions about the code, feel free to ask.

Secondly, @Tillk, my ideas about your page in the wiki:
(Admittedly, this is perhaps somehow an outsider's view not taking into
account things I still need to learn about. )
Why do we think about record-formats like MARC or something else when
displaying data from an index? MARC and others are great to quickly get
data into a basic index, but after that we could be free from that and
use our own conventions according to local user needs. In my opinion, it
would be easier to display all data directly from separate index fields
- without any fullrecord-field in any format. E.g. titles could be
indexed in two fields: one for searching possibly with some fancy term
expansion etc., and the other for displaying containing a human readable
title for display (e.g. title-search, title-view; subject-search,
subject-view ...). With that there is no need for any special classes
reading special formats, and adding fields for fulltext or catalogue
enrichment is even a bit more straightforward than now.
I think, search servers like Solr are also a great tool for integrating
several data sources in one search interface. For importing records, I
need special classes and mappings. But after that there is only my
single index definition and fieldset to handle, not MARC and other
formats like DC, METS etc. I see the index as a custom abstraction layer
to make any data format searchable and viewable.

Thank you for your valuable help - a good community is sometimes worth
much more than expensive commercial support.

Andreas




Till Kinstler schrieb:
> Andreas Kahl schrieb:
>
>> thanks for your message. Using Solr's highlighting is exactly what I am
>> planning to do. My main problem at the moment is to display the
>> highlighting in vufind. And I am not sure if I should use Solr.php or an
>> implementation of SolrJS.
>
> As far as I understand, using SolrJS requires opening access to your
> Solr server HTTP interface for everyone (which allows updating,
> deleting records, adding indexes/cores...). So I would highly
> recommend some kind of filter between users and the Solr server that
> allows only uncritical calls to be made.
>
> Adding the highlighting parameters to Solr.php should be
> straightforward. Just pass the additional highlighting parameters to
> _call() as third argument in the $params array.
> You may want to change getRecord() and search() to add these
> parameters to the options/params array before calling _call(). Maybe
> you even need to add an additional parameter to search() and getRecord().
>
>> In Solr.php I cannot find the code stripping off additional fields from
>> the result sets (the only output seems to be the contents of the
>> fullrecord-field)
>
> That's done by passing the Solr result through an XSLT sheet in
> _process() (which is called by _call()).
> Add the relevant parts of the Solr response to the XSLT sheet. Finally
> you will change the display controller in
> web/services/Record/Record.php (for full title views, short title
> views should be straightforward in the Smarty template), which is only
> parsing the full MARC record stored in Solr field fullrecord yet.
> I compiled some ideas how to make record display in VuFind MARC
> independent on http://www.vufind.org/wiki/other_than_marc. Maybe you
> want to implement that? :-) No, seriously: Maybe you have some
> comments on that. I'd really like to implement that in August, if
> nobody else is already working on that (or has done it already).
>
> Till
>