This is where the MARC record format starts to drive people crazy. Since so much of the record is human-readable free text rather than specifically machine-encoded data, you run into a lot of these kinds of inconsistencies. Unfortunately, there is no simple solution. In general, the best answer is to fix your cataloging to be consistent... but that often isn't practical. If you can't just update the records to match, the next option is to use BeanShell scripts to extend the indexer with special cases. See:
http://code.google.com/p/solrmarc/wiki/WritingCustomScripts
So for your first example, you could write a script that extracts the contents of the appropriate field, checks to see if it begins with "Rowohlt", returns "Rowohlt" if it does, and otherwise returns the full field. You might also be able to more elegantly solve this problem through use of regular expressions:
http://code.google.com/p/solrmarc/wiki/IndexProperties#Defining_a_Pattern-Based_Translation_Map
Obviously, the real problem here is that your indexing rules can get very complicated if you need to do this for a lot of different cases. On the other hand, if it's only one or two publishers that are problematic, it shouldn't be too bad.
For the multiple editions problem, you could write a BeanShell script that extracts the edition field and then explodes it on semicolons to return multiple values... but if semicolons are used inconsistently between records, this could yield weird results.
I hope this is helpful -- let me know if you need more details.
- Demian
________________________________________
From: Stephanie Funk [stephaniefunk@...]
Sent: Wednesday, May 11, 2011 5:22 AM
To: vufind-general@...
Subject: Re: [VuFind-General] Content of facets with multiple words divided
Thanks a lot for help, now it's nearly perfect!
But I still have one question. Is it possible to control vocabulary of the
publisher facet?
For example I have one publisher called "Rowohlt". Rowohlt includes "Rowohlt
Berlin" and "Rowohlt-Taschenbuch-Verlag". After searching in VuFind it's
displayed like this:
Publisher
Ullstein (8)
Rowohlt (4)
Rowohlt-Taschenbuch-Verlag (2)
Rowohlt Berlin (1)
Is it possible to combine it like this?
Publisher
Ullstein (8)
Rowohlt (7)
I also have a related problem with the facet "Edition", because my data
contain more information like 1.st edition, 2.nd edition etc. Here is an
example from my catalogue:
Edition
1.st edition (1)
1.st edition ; 56. - 70. Tsd. (1)
56. - 70. Tsd. (1)
This are three different books, but it would be better if it was indicated
thus, so that I can find the book from the second line in both records:
Edition
1.st edition (2)
56. - 70. Tsd. (2)
Thanks again!
-Stephanie
Demian Katz wrote:
>
> Sorry for misspeaking about the string vs. textFacet field type for
> publisherStr. Both "string" and "textFacet" fields will work correctly as
> facet fields. The only difference is that "textFacet" fields strip off
> trailing punctuation (useful for removing periods from the ends of author
> names, for example) while "string" leaves the text completely unprocessed.
> I don't think it makes much difference which type you use for
> publisherStr, though depending on your data one option may offer a small
> advantage over the other.
>
> VuFind's topic indexing is a little confusing -- the way it works by
> default is that all topic terms are indexed into the "topic" field for
> keyword searching... but for faceting purposes, we split up different
> subject subfields into the topic_facet, genre_facet, era_facet and
> geographic_facet fields for more flexible faceting. This way, a subject
> heading like "History -- United States" gets faceted as both "History" and
> "United States" which is more flexible than "History United States". It
> may be that the best solution for your problem is to simply move the
> "topic_facet" setting from [ResultsTop] to [Results]. However, if you
> really need the full, non-fragmented subject heading as a facet, the
> solution you describe sounds like it should work (with one minor
> exception: rather than changing the existing <copyField> for topic, I
> would advise adding a second <copyField> -- you are allowed to have more
> than one for the same field). If you need it and it still isn't working,
> perhaps there is just a typo in your schema file. Feel free to send me a
> copy off-list and I'll take a look.
>
> - Demian
>
>> -----Original Message-----
>> From: Stephanie Funk [mailto:stephaniefunk@...]
>> Sent: Monday, May 09, 2011 10:55 AM
>> To: vufind-general@...
>> Subject: Re: [VuFind-General] Content of facets with multiple words
>> divided
>>
>>
>> No, I did not add custum fields to the Solr schema. Thanks for your
>> help,
>> first problem is solved.
>> I wrote publisherStr in file facets.ini and now it works!
>>
>> One question relative to the file schema.xml. The publischerStr field
>> was
>> defined as "string". Because you wrote the field is defined as
>> "textFacet" I
>> changed it from "string" into "textFacet" (it still works). Was this
>> right,
>> or do I have to copy the line and change it like this?
>>
>> <field name="publisherStr" type="string" indexed="true" stored="false"
>> multiValued="true"/>
>> <field name="publisherStr" type="textFacet" indexed="true"
>> stored="false"
>> multiValued="true"/>
>>
>> I also tried to adopt this to my subject-problem. I changed in my file
>> facets.ini the facet "topic = subject" into "topicStr = subject", but
>> it did
>> not work. Then I looked at schema.xml and recognized, that there is no
>> line
>> "topicStr", so I added it:
>>
>> <field name="topic" type="text" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="topicStr" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="topic_unstemmed" type="textProper" indexed="true"
>> stored="false" multiValued="true"/>
>> <field name="topic_facet" type="textFacet" indexed="true"
>> stored="true"
>> multiValued="true"/>
>> <field name="topic_browse" type="string" indexed="true"
>> stored="false"
>> multiValued="true"/>
>>
>> I also changed the line <copyField source="topic"
>> dest="topic_unstemmed"/>
>> into <copyField source="topic" dest="topicStr"/>
>>
>> It still does not work. Please help again.
>> Thanks a lot!
>> -Stephanie
>>
>>
>>
>> Demian Katz wrote:
>> >
>> > Did you add custom fields to the Solr schema
>> > (solr/biblio/conf/schema.xml)? Depending on the field type used in
>> the
>> > schema, Solr processes different fields differently. What appears to
>> be
>> > happening here is that your facet fields have been processed for
>> full-text
>> > searching, which causes them to behave incorrectly. For the
>> publisher,
>> > there is already a field that you can use for proper faceting --
>> > publisherStr instead of publisher. If you look in the schema, you
>> will
>> > see that the publisher field is defined as "textProper" (a type used
>> for
>> > full text searching with no stemming) while publisherStr is defined
>> as
>> > "textFacet" (a type used for faceting). There is a copyField
>> directive
>> > near the bottom of the file which ensures that every value processed
>> in
>> > the publisher field is also duplicated in the publisherStr field with
>> > different processing -- this lets you index a string once but then
>> have
>> > access to it for both searching and faceting purposes. You may need
>> to
>> > mimic this configuration if you are using a local custom subject
>> field.
>> >
>> > I hope this helps -- please let me know if you still have questions!
>> >
>> > - Demian
>> >
>> >> -----Original Message-----
>> >> From: Stephanie Funk [mailto:stephaniefunk@...]
>> >> Sent: Monday, May 09, 2011 7:07 AM
>> >> To: vufind-general@...
>> >> Subject: [VuFind-General] Content of facets with multiple words
>> divided
>> >>
>> >>
>> >> The content of my facets is not displayed correctly. I don't know
>> how
>> >> to
>> >> adjust it in the right way.
>> >>
>> >> For example my facet "Subject". One subject is "Belletristische
>> >> Darstellung". After searching it is displayed in this way
>> (everything
>> >> uncapitalized):
>> >>
>> >> belletristisch (16)
>> >> darstellung (16)
>> >>
>> >> Also the facet "Publisher". The publishers are for example
>> >> "Rowohlt-Taschenbuch-Verlag" and "Ullstein Verlag".
>> >> It is displayed like this:
>> >>
>> >> rowohlt (8)
>> >> ullstein (8)
>> >> rowohlttaschenbuchverlag (1)
>> >> taschenbuch (1)
>> >> verlag (1)
>> >>
>> >> What do I have to adjust, that VuFind does not divide the words and
>> >> shows
>> >> it, like it is?
>> >>
>> >> Many thanks for help!
>> >> -Stephanie
>> >> --
>> >> View this message in context: http://old.nabble.com/Content-of-
>> facets-
>> >> with-multiple-words-divided-tp31575814p31575814.html
>> >> Sent from the vufind-general mailing list archive at Nabble.com.
>> >>
>> >>
>> >> --------------------------------------------------------------------
>> ---
>> >> -------
>> >> WhatsUp Gold - Download Free Network Management Software
>> >> The most intuitive, comprehensive, and cost-effective network
>> >> management toolset available today. Delivers lowest initial
>> >> acquisition cost and overall TCO of any competing solution.
>> >> http://p.sf.net/sfu/whatsupgold-sd
>> >> _______________________________________________
>> >> VuFind-General mailing list
>> >> VuFind-General@...
>> >> https://lists.sourceforge.net/lists/listinfo/vufind-general
>> >
>> > ---------------------------------------------------------------------
>> ---------
>> > WhatsUp Gold - Download Free Network Management Software
>> > The most intuitive, comprehensive, and cost-effective network
>> > management toolset available today. Delivers lowest initial
>> > acquisition cost and overall TCO of any competing solution.
>> > http://p.sf.net/sfu/whatsupgold-sd
>> > _______________________________________________
>> > VuFind-General mailing list
>> > VuFind-General@...
>> > https://lists.sourceforge.net/lists/listinfo/vufind-general
>> >
>> >
>>
>> --
>> View this message in context: http://old.nabble.com/Content-of-facets-
>> with-multiple-words-divided-tp31575814p31577501.html
>> Sent from the vufind-general mailing list archive at Nabble.com.
>>
>>
>> -----------------------------------------------------------------------
>> -------
>> WhatsUp Gold - Download Free Network Management Software
>> The most intuitive, comprehensive, and cost-effective network
>> management toolset available today. Delivers lowest initial
>> acquisition cost and overall TCO of any competing solution.
>> http://p.sf.net/sfu/whatsupgold-sd
>> _______________________________________________
>> VuFind-General mailing list
>> VuFind-General@...
>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>
> ------------------------------------------------------------------------------
> WhatsUp Gold - Download Free Network Management Software
> The most intuitive, comprehensive, and cost-effective network
> management toolset available today. Delivers lowest initial
> acquisition cost and overall TCO of any competing solution.
> http://p.sf.net/sfu/whatsupgold-sd
> _______________________________________________
> VuFind-General mailing list
> VuFind-General@...
> https://lists.sourceforge.net/lists/listinfo/vufind-general
>
>
--
View this message in context: http://old.nabble.com/Content-of-facets-with-multiple-words-divided-tp31575814p31592644.html
Sent from the vufind-general mailing list archive at Nabble.com.
------------------------------------------------------------------------------
Achieve unprecedented app performance and reliability
What every C/C++ and Fortran developer should know.
Learn how Intel has extended the reach of its next-generation tools
to help boost performance applications - inlcuding clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
VuFind-General mailing list
VuFind-General@...
https://lists.sourceforge.net/lists/listinfo/vufind-general
|