Re: [VuFind-Tech] MarcImporter indexing update

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I have a bunch of thoughts to share about the current indexing setup.

First of all:  THANK YOU for getting this code out there for us to use!

I think some of the issues raised are easily address with  
configuration files and documentation.  For example, default AND or OR  
is right there in the schema.xml file.  All that's needed is a little  
documentation to show how easy it is for institutions to select which  
they prefer.   I happen to agree with Jim that users expect AND ...  
but if the documentation told me VuFind's default was OR and how to  
change it, that's sufficient.  Alan's suggestions could also be  
approached as a documentation issue.  "Here is what one institution  
desired for MARC mappings to the SOLR fields/facets, and here is what  
they did to make it work."   Such a document would make it clear how  
other sites can configure VuFind to best suit their institution's  
desires.

What follows are some of my other notes pertaining mostly to analysis  
- and they are just my $.02   Again, I think it would be okay to leave  
some of these things alone if there is some documentation on how  
institutions can change the settings.  Some of the below might point  
to appropriate changes to the defaults.  My notes are based on poking  
around an index of some Stanford data out of Unicorn without changing  
either of the SOLR config files (schema.xml and solrconfig.xml).

Whitespace:  if you look at the underlying Lucene index with Luke (a  
tool available from the Lucene web site), there are field terms that  
are just whitespace.  This has no utility as a search term.

Case sensitivity:  I got different results searching for "book" and  
"Book".  I doubt this is desirable.

Punctuation sensitvity:  I got different results searching for "20th  
century" and 20th century."  I doubt this is desirable.  These also  
show up as separate facets.

Phrase searching:  out of the box, this doesn't appear to be a part of  
VuFind.  Quoting a phrase seems a defacto standard, due to google.   
Lucene supports this ... why not VuFind?

Single characters as terms:  again, using Luke, some of the most  
common terms in our index are "a" "1" and so on.  These single  
character terms aren't useful for searching ... they should probably  
be eliminated either through analysis or the stopwords list.   
(Stopwords:   easy to do, just add documentation)

Stopwords:  our data contains a great deal of foreign language  
materials.   Very common terms in our index, per Luke, are "de" "la"  
and so on.  This is simply a stopwords issue, but documentation might  
be useful to indicate how easy it is to address this.

---

The whitespace, case sensitivity and punctuation sensitivity are all  
analysis choices.  I think tokenizing most fields makes a great deal  
of sense.

I don't know enough about MARC data off the top of my head to know if  
there are identifiers or other strings that shouldn't be tokenized  
(URIs, ISBNs, etc.).  If there are such values and they have special  
characters or case sensitivity or significant whitespace, things  
become tricky.  URLs aren't so bad because there are normalization  
rules that help -- see the URI spec, or talk to me about this offline.

This issue is especially difficult because in order to be able to  
search unanalyzed fields properly, the special characters and case  
must be left alone in the search query -- the query string MUST be  
analyzed exactly the same as the fields to be searched.  In a single  
search box, you can't analyze some fields and not others unless you  
allow for fielded searching ... which is very advanced for a single  
search box.

If you read all the way down to this, congratulations!  You made it.   
Thanks for listening.

- Naomi

On Apr 14, 2008, at 9:24 AM, Andrew Nagy wrote:

> Alan - since you are making major fundamental changes to the way  
> vufind works - and we are currently evaluating making some  
> fundamental changes ourselves.  We are thinking about changing the  
> search term parsing to tokenize on spaces and search each field for  
> each word.  I can't just implement these changes and be on our merry  
> way.  I think these changes require some serious discussion and  
> evaluation.
>
> For example:
>
> 1. You are changing the default operator from "OR" to "AND" which  
> will have severe effects on vufind.  What is your reasoning for  
> changing this?
>
> 2. You are creating auth_author to a multivalued field, why?  This  
> is a difficult matter since according to marc there is only 1 main  
> author.  With that - this field was purposely defined to be a single  
> value field.
>
> 3. I like how you are adding other content fields into the title  
> field etc. but this should not be hard coded into the marcimporter -  
> this needs to be a part of the mapping file that Wayne is working on.
>
> Andrew
>
>
>> -----Original Message-----
>> From: vuf...@li... [mailto:vufind-tech-
>> bo...@li...] On Behalf Of Alan Rykhus
>> Sent: Friday, April 11, 2008 3:47 PM
>> To: vuf...@li...
>> Subject: [VuFind-Tech] MarcImporter indexing update
>>
>> Hello,
>>
>> I've started working on updating the MarcImporter.java code to
>> implement a more in-depth indexing. I've completed the updates for  
>> the
>> author and title fields.
>>
>> I realize there are some other changes that need to be done to  
>> actually
>> use these changes. The sys/SOLR.php needs to be modified to use the  
>> new
>> fields defined in the schema.
>>
>> I plan to continue working on this. Does the group feel that this  
>> is a
>> step in the right direction?
>>
>> al
>> --
>> Alan Rykhus
>> PALS, A Program of the Minnesota State Colleges and Universities
>> (507)389-1975
>> ala...@mn...
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save  
> $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> Vufind-tech mailing list
> Vuf...@li...
> https://lists.sourceforge.net/lists/listinfo/vufind-tech