Naomi - this seems quite valuable.  I ask why the need for 2 isbn and issn fields.  Couldn't the isbn/issn fields be stripped down to just the number codes and remove any erroneous information - such as whether the book is paperback, etc.?

 

Since the ISBN and ISSN numbers don't get displayed to the end user in the search results, I don't see the reason to have the "unmassaged" field in the index.

 

And Yes - we'd be interested in including your code in the solrmarc project.

 

Andrew

 

From: vufind-tech-bounces@lists.sourceforge.net [mailto:vufind-tech-bounces@lists.sourceforge.net] On Behalf Of Naomi Dushay
Sent: Monday, July 28, 2008 12:22 PM
To: vufind-tech@lists.sourceforge.net
Subject: [VuFind-Tech] solrmarc - tweaks to standard number indexing and extraction

 

Folks,

 

Using code from the solrmarc project, I've done some (test-driven!) local coding for standard numbers: ISBN, ISSN, OCLC and LCCN.  I thought I would share the information, FWIW; I am happy to share the code as well if folks are interested.  I also have the algorithms and a lot of relevant additional information in a Stanford-only wiki, but I can presumably get a PDF version or something that I could pass around (or possibly cut and paste the wiki text into another wiki somewhere).

 

I am of the belief that the indexing should take care of the massaging of data as necessary, not the UI code.  So stripping following text, prefixes and the like is done in my indexing code.

 

For ISBN and ISSN, our cataloging expert pointed out that we want to be as *inclusive* as possible for our users: when they are looking in *our* index, we should enable matching occurring in as many cases as possible (maximizing "recall"!).   On the other hand, when we are using these numbers for retrieving external resources (e.g. Google Book Search), we want the numbers that are most likely to get us a correct answer.  These are two different needs, and they require two different fields:

 

              <!-- isbn is for code to do external lookups by ISBN (e.g. Google Book Search) -->
             
<!-- TODO:  change isbn to isbn_store -->
             
<field name="isbn" type="string" indexed="false" stored="true" multiValued="true"/>
             
<!-- isbnUser_search is for end users to search our index via an ISBN -->
             
<field name="isbnUser_search" type="string" indexed="true" stored="false" multiValued="true"/> 
             
<!-- issn is for code to do external lookups by ISSN -->
             
<!-- TODO:  change isbn to issn_store -->
             
<field name="issn" type="string" indexed="false" stored="true" multiValued="true"/>
             
<!-- issnUser_search is for end users to search our index via an ISSN -->
             
<field name="issnUser_search" type="string" indexed="true" stored="false" multiValued="true"/> 

 

ISBN

------

a. multiple ISBN in a single marc bib record are allowed.

b. 10 or 13 digit number (last digit may also be "X").

c. Strip any following text.

 

isbnUser_search field (for end users to search our index):

----

1.  all 020 subfields a starting with an ISBN string - strip following text

2.  AND  all 020 subfields z starting with an ISBN string - strip following text

 

isbn (for external lookups)

----

1.  all 020 subfields a starting with an ISBN string - strip following text

2.  if none,  all 020 subfields z starting with an ISBN string - strip following text

 

ISSN

-----

a. multiple ISSN in a single marc bib record are allowed.

b. 4 digit number followed by hyphen followed by 4 digit number (last digit may also be "X").

 

issnUser_search field (for end users to search our index):

   I was able to implement this using a pattern map in our vufind.properties file.

----

1.  all 020 subfields a with ISSN

2.  AND  all 020 subfields "l" (letter "L") with ISSN

3.  AND  all 020 subfields m with ISSN

4.  AND  all 020 subfields y with ISSN

5.  AND  all 020 subfields z with ISSN

 

issn (for external lookups)

----

1.  all 020 subfields a with ISSN

5.  if none,  all 020 subfields z with ISSN

 

 

OCLC and LCCN are not exposed to end users, so we want to use the data that is most likely to get us correct retrieval ("precision"!) in external resources, such as OCLCWorldCat or Google Book Search.   Moreover, since this data does not need to be searched in our catalog by our users, it is not imperative to index these fields, though we must store them.  Choosing to index these fields would enable staff searches on these numbers, if that is desired.

 

solr/conf/schema.xml:

              <!-- lccn number for code to do external lookups -->
             
<field name="lccn_store" type="string" indexed="false" stored="true"/>
             
<!-- oclc number for google book search links and for oclc worldcat links -->
             
<field name="oclc_store" type="string" indexed="false" stored="true" multiValued="true"/>

 

OCLC:

------

a. multiple OCLC numbers in a single marc bib record are allowed.

 

1.  all 035 subfields a with *our local prefix* "(OCoLC-M)"

2.  if none, all 079 sufields a prefixed "ocm" or "ocn"

3.  if none of the above, all 035 subfields a prefixed "(OCoLC)"

 

LCCN :

-------    

a. at most one per marc bib record.

b. Strip following text, but not prefixes.  (Not sure this is correct, but that's what I did.)

c. I was able to implement this using a pattern map in our vufind.properties file.

 

1. 010 subfield a.

2.  if none, 010 subfield z.

 

 

Naomi Dushay

ndushay@stanford.edu