From: Andreas K. <And...@gm...> - 2009-07-28 20:05:46
Attachments:
signature.asc
|
Hello everyone, has anyone experience integrating scanned TOCs into a vufind-Search with teaser display? I would be interested if there existed some best practices. Escpecially, I would be very glad to hear wether there are recommendations how to display teasers in vufind's result list and in the record display screen. A short summary about my experiments with that: I got the scanned tocs as pdf and extracted plain text from them. For indexing the texts with their associated Metadata-Documents I use the filenames in the URLs (MARC-Field 856a). I extended VuFindIndexer.java in MarcImporter.jar and added a field 'cecontents' to biblio's schema.xml, which gets filled with the plain text in a custom method. Indexing and Searching works fine and I adjusted Solr.php to search the additional field if a query is sent via Basic Search. Finally I tried to display teasers, which worked fine using Solr's admin interface. Today, my first try in vufind was to write a Smarty plugin using vufind's Solr.php-class. Unfortunately this class only writes MARC-fields into $result - but I need the highlight-option and my custom cecontents-field (using plain XML-Output didn't help). For tomorrow I am planning to try an implementation using SolrJS written into the output-html via another Smarty-Plugin. Is that a good idea? Do you have other experiences/ideas? Andreas |
From: Andrew N. <as...@gm...> - 2009-07-28 21:14:46
|
Andreas - the best approach for this would be to use the solr highlighting feature where you can configure it to pull a "snippet" from a specified field. Let me know if you have any more specific questions about doing this. Andrew On 7/28/09, Andreas Kahl <And...@gm...> wrote: > Hello everyone, > > has anyone experience integrating scanned TOCs into a vufind-Search with > teaser display? I would be interested if there existed some best > practices. Escpecially, I would be very glad to hear wether there are > recommendations how to display teasers in vufind's result list and in > the record display screen. > > A short summary about my experiments with that: > I got the scanned tocs as pdf and extracted plain text from them. For > indexing the texts with their associated Metadata-Documents I use the > filenames in the URLs (MARC-Field 856a). I extended VuFindIndexer.java > in MarcImporter.jar and added a field 'cecontents' to biblio's > schema.xml, which gets filled with the plain text in a custom method. > Indexing and Searching works fine and I adjusted Solr.php to search the > additional field if a query is sent via Basic Search. > > Finally I tried to display teasers, which worked fine using Solr's admin > interface. Today, my first try in vufind was to write a Smarty plugin > using vufind's Solr.php-class. Unfortunately this class only writes > MARC-fields into $result - but I need the highlight-option and my custom > cecontents-field (using plain XML-Output didn't help). > For tomorrow I am planning to try an implementation using SolrJS written > into the output-html via another Smarty-Plugin. > > Is that a good idea? Do you have other experiences/ideas? > > Andreas > > > -- Sent from my mobile device |
From: Andreas K. <And...@gm...> - 2009-07-29 05:47:40
Attachments:
signature.asc
|
Hello Andrew, thanks for your message. Using Solr's highlighting is exactly what I am planning to do. My main problem at the moment is to display the highlighting in vufind. And I am not sure if I should use Solr.php or an implementation of SolrJS. In Solr.php I cannot find the code stripping off additional fields from the result sets (the only output seems to be the contents of the fullrecord-field), and it seems there is no option to activate the highlighting feature in that class. (I am using 1.0RC 1) Andreas Andrew Nagy schrieb: > Andreas - the best approach for this would be to use the solr > highlighting feature where you can configure it to pull a "snippet" > from a specified field. > > Let me know if you have any more specific questions about doing this. > > Andrew > > On 7/28/09, Andreas Kahl <And...@gm...> wrote: > >> Hello everyone, >> >> has anyone experience integrating scanned TOCs into a vufind-Search with >> teaser display? I would be interested if there existed some best >> practices. Escpecially, I would be very glad to hear wether there are >> recommendations how to display teasers in vufind's result list and in >> the record display screen. >> >> A short summary about my experiments with that: >> I got the scanned tocs as pdf and extracted plain text from them. For >> indexing the texts with their associated Metadata-Documents I use the >> filenames in the URLs (MARC-Field 856a). I extended VuFindIndexer.java >> in MarcImporter.jar and added a field 'cecontents' to biblio's >> schema.xml, which gets filled with the plain text in a custom method. >> Indexing and Searching works fine and I adjusted Solr.php to search the >> additional field if a query is sent via Basic Search. >> >> Finally I tried to display teasers, which worked fine using Solr's admin >> interface. Today, my first try in vufind was to write a Smarty plugin >> using vufind's Solr.php-class. Unfortunately this class only writes >> MARC-fields into $result - but I need the highlight-option and my custom >> cecontents-field (using plain XML-Output didn't help). >> For tomorrow I am planning to try an implementation using SolrJS written >> into the output-html via another Smarty-Plugin. >> >> Is that a good idea? Do you have other experiences/ideas? >> >> Andreas >> |
From: Demian K. <dem...@vi...> - 2009-07-30 13:21:19
|
> Secondly, @Tillk, my ideas about your page in the wiki: > (Admittedly, this is perhaps somehow an outsider's view not taking into > account things I still need to learn about. ) Why do we think about > record-formats like MARC or something else when displaying data from an > index? MARC and others are great to quickly get data into a basic > index, but after that we could be free from that and use our own > conventions according to local user needs. I think there are two main reasons for retaining the original record format and using it for display: 1.) There are some fields that you want to display to the user that serve no indexing purpose. I'm not a Solr expert, but I assume it takes less overhead to store a single "full record" field than to break it into every conceivable piece in separate indexes. 2.) Storing the full record makes it really easy to display the full record "Staff View" tab, which is a really useful debugging tool when records act unexpectedly. That being said, I agree that we should rely on specific record formats as little as possible. If we can get the data intact from the Solr index, we might as well use the index version. I've been doing some brainstorming with Till about this subject (keeping it off the list until my thoughts are a little more organized), but I'll be writing up a more formal proposal in the next day or two. Hopefully this will help us get the best of both worlds. - Demian |
From: Till K. <kin...@gb...> - 2009-07-30 13:55:55
|
Demian Katz schrieb: >> Secondly, @Tillk, my ideas about your page in the wiki: >> (Admittedly, this is perhaps somehow an outsider's view not taking into >> account things I still need to learn about. ) Why do we think about >> record-formats like MARC or something else when displaying data from an >> index? MARC and others are great to quickly get data into a basic >> index, but after that we could be free from that and use our own >> conventions according to local user needs. > > I think there are two main reasons I agree with those reasons. A third is: There may be formats, that really don't match the "default semantics" of Solr index fields. But that "semantics" will govern record display (eg. you'll put a label "Author" or maybe more general "Person" in front of the contents of the author_fullStr index field). What to do, if a record format has no authors, but for example singers or butchers (to catalog your collection of argentine beef)? You may want to push the butchers into the "author" index anyway to make them findable, but define an individual view for them, not to pretend those artists being writers... That's how I understood the outcome of the discussion on this list some weeks ago about handling of different formats: http://www.nabble.com/other-data-sources-for-the-index-to23867424.html Till (should do some bbq tonight? :-) -- Till Kinstler Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) Platz der Göttinger Sieben 1, D 37073 Göttingen kin...@gb..., +49 (0) 551 39-13431, http://www.gbv.de |
From: Till K. <kin...@gm...> - 2009-07-29 07:29:34
|
Andreas Kahl schrieb: > thanks for your message. Using Solr's highlighting is exactly what I am > planning to do. My main problem at the moment is to display the > highlighting in vufind. And I am not sure if I should use Solr.php or an > implementation of SolrJS. As far as I understand, using SolrJS requires opening access to your Solr server HTTP interface for everyone (which allows updating, deleting records, adding indexes/cores...). So I would highly recommend some kind of filter between users and the Solr server that allows only uncritical calls to be made. Adding the highlighting parameters to Solr.php should be straightforward. Just pass the additional highlighting parameters to _call() as third argument in the $params array. You may want to change getRecord() and search() to add these parameters to the options/params array before calling _call(). Maybe you even need to add an additional parameter to search() and getRecord(). > In Solr.php I cannot find the code stripping off additional fields from > the result sets (the only output seems to be the contents of the > fullrecord-field) That's done by passing the Solr result through an XSLT sheet in _process() (which is called by _call()). Add the relevant parts of the Solr response to the XSLT sheet. Finally you will change the display controller in web/services/Record/Record.php (for full title views, short title views should be straightforward in the Smarty template), which is only parsing the full MARC record stored in Solr field fullrecord yet. I compiled some ideas how to make record display in VuFind MARC independent on http://www.vufind.org/wiki/other_than_marc. Maybe you want to implement that? :-) No, seriously: Maybe you have some comments on that. I'd really like to implement that in August, if nobody else is already working on that (or has done it already). Till -- http://twitter.com/tillk |
From: Andreas K. <And...@gm...> - 2009-07-30 13:03:12
|
Till, thank you very very much for your mail. With that help I managed to implement the highlighted snippets. With your knowledge, this was really rather straightforward. Anyway, I will send my Code and some comments how the implementation works - in case anyone is interested. In the end I have some - personal - comments about the other_than_marc-page and the concept of indexing arbitrary data. Attached you find 5 files (I think differences are seen best with some diff-tool): - Solr.php got some new parameters to activate and configure highlighting - solr-convert.xsl got a new rule just copying the highlighting-element to the output - /web/Record/view.tpl : here you can find the call for my new Smarty-Plugin - function.showTeaser.php: The plugin itself. Last but not least I modified styles.css: span.highlighting{font-weight:bold;} Besides that, I added (admittedly without much understanding) one line of code to modifier.highlight.php. To make it split multiterm queries and highlight those in the metadata, too. Back to function.showTeaser.php The implementation is done for displaying teasers in Record view. If you intend to display highlighting in hitlists, some work needs possibly to be done to identify the snippets and attach them to the correct hits. At first, I instantiated a $solr = new Solr("http://localhost:8090") - sorry, URL hardcoded for now. 'global $solr' as shown in www.vufind.org/wiki/building_a_plugin did not work for me. The global seems not to be found inside the plugin. On the building_a_plugin-site, it would have been very helpful if the calling tag for the function would have been also displayed (as a PHP newbie it takes some time to understand the syntax without an example) After that, the user's query is extracted from the lookfor-parameter. Finally I build up a query limited to my TOC-field cecontents AND id:<docid> to obtain highlighting and only one result to parse. Output is returned as raw XML and highlighting is extracted via xpath. If you have any questions about the code, feel free to ask. Secondly, @Tillk, my ideas about your page in the wiki: (Admittedly, this is perhaps somehow an outsider's view not taking into account things I still need to learn about. ) Why do we think about record-formats like MARC or something else when displaying data from an index? MARC and others are great to quickly get data into a basic index, but after that we could be free from that and use our own conventions according to local user needs. In my opinion, it would be easier to display all data directly from separate index fields - without any fullrecord-field in any format. E.g. titles could be indexed in two fields: one for searching possibly with some fancy term expansion etc., and the other for displaying containing a human readable title for display (e.g. title-search, title-view; subject-search, subject-view ...). With that there is no need for any special classes reading special formats, and adding fields for fulltext or catalogue enrichment is even a bit more straightforward than now. I think, search servers like Solr are also a great tool for integrating several data sources in one search interface. For importing records, I need special classes and mappings. But after that there is only my single index definition and fieldset to handle, not MARC and other formats like DC, METS etc. I see the index as a custom abstraction layer to make any data format searchable and viewable. Thank you for your valuable help - a good community is sometimes worth much more than expensive commercial support. Andreas Till Kinstler schrieb: > Andreas Kahl schrieb: > >> thanks for your message. Using Solr's highlighting is exactly what I am >> planning to do. My main problem at the moment is to display the >> highlighting in vufind. And I am not sure if I should use Solr.php or an >> implementation of SolrJS. > > As far as I understand, using SolrJS requires opening access to your > Solr server HTTP interface for everyone (which allows updating, > deleting records, adding indexes/cores...). So I would highly > recommend some kind of filter between users and the Solr server that > allows only uncritical calls to be made. > > Adding the highlighting parameters to Solr.php should be > straightforward. Just pass the additional highlighting parameters to > _call() as third argument in the $params array. > You may want to change getRecord() and search() to add these > parameters to the options/params array before calling _call(). Maybe > you even need to add an additional parameter to search() and getRecord(). > >> In Solr.php I cannot find the code stripping off additional fields from >> the result sets (the only output seems to be the contents of the >> fullrecord-field) > > That's done by passing the Solr result through an XSLT sheet in > _process() (which is called by _call()). > Add the relevant parts of the Solr response to the XSLT sheet. Finally > you will change the display controller in > web/services/Record/Record.php (for full title views, short title > views should be straightforward in the Smarty template), which is only > parsing the full MARC record stored in Solr field fullrecord yet. > I compiled some ideas how to make record display in VuFind MARC > independent on http://www.vufind.org/wiki/other_than_marc. Maybe you > want to implement that? :-) No, seriously: Maybe you have some > comments on that. I'd really like to implement that in August, if > nobody else is already working on that (or has done it already). > > Till > |