From: Naomi D. <nd...@st...> - 2008-04-14 18:30:53
|
I have a bunch of thoughts to share about the current indexing setup. First of all: THANK YOU for getting this code out there for us to use! I think some of the issues raised are easily address with configuration files and documentation. For example, default AND or OR is right there in the schema.xml file. All that's needed is a little documentation to show how easy it is for institutions to select which they prefer. I happen to agree with Jim that users expect AND ... but if the documentation told me VuFind's default was OR and how to change it, that's sufficient. Alan's suggestions could also be approached as a documentation issue. "Here is what one institution desired for MARC mappings to the SOLR fields/facets, and here is what they did to make it work." Such a document would make it clear how other sites can configure VuFind to best suit their institution's desires. What follows are some of my other notes pertaining mostly to analysis - and they are just my $.02 Again, I think it would be okay to leave some of these things alone if there is some documentation on how institutions can change the settings. Some of the below might point to appropriate changes to the defaults. My notes are based on poking around an index of some Stanford data out of Unicorn without changing either of the SOLR config files (schema.xml and solrconfig.xml). Whitespace: if you look at the underlying Lucene index with Luke (a tool available from the Lucene web site), there are field terms that are just whitespace. This has no utility as a search term. Case sensitivity: I got different results searching for "book" and "Book". I doubt this is desirable. Punctuation sensitvity: I got different results searching for "20th century" and 20th century." I doubt this is desirable. These also show up as separate facets. Phrase searching: out of the box, this doesn't appear to be a part of VuFind. Quoting a phrase seems a defacto standard, due to google. Lucene supports this ... why not VuFind? Single characters as terms: again, using Luke, some of the most common terms in our index are "a" "1" and so on. These single character terms aren't useful for searching ... they should probably be eliminated either through analysis or the stopwords list. (Stopwords: easy to do, just add documentation) Stopwords: our data contains a great deal of foreign language materials. Very common terms in our index, per Luke, are "de" "la" and so on. This is simply a stopwords issue, but documentation might be useful to indicate how easy it is to address this. --- The whitespace, case sensitivity and punctuation sensitivity are all analysis choices. I think tokenizing most fields makes a great deal of sense. I don't know enough about MARC data off the top of my head to know if there are identifiers or other strings that shouldn't be tokenized (URIs, ISBNs, etc.). If there are such values and they have special characters or case sensitivity or significant whitespace, things become tricky. URLs aren't so bad because there are normalization rules that help -- see the URI spec, or talk to me about this offline. This issue is especially difficult because in order to be able to search unanalyzed fields properly, the special characters and case must be left alone in the search query -- the query string MUST be analyzed exactly the same as the fields to be searched. In a single search box, you can't analyze some fields and not others unless you allow for fielded searching ... which is very advanced for a single search box. If you read all the way down to this, congratulations! You made it. Thanks for listening. - Naomi On Apr 14, 2008, at 9:24 AM, Andrew Nagy wrote: > Alan - since you are making major fundamental changes to the way > vufind works - and we are currently evaluating making some > fundamental changes ourselves. We are thinking about changing the > search term parsing to tokenize on spaces and search each field for > each word. I can't just implement these changes and be on our merry > way. I think these changes require some serious discussion and > evaluation. > > For example: > > 1. You are changing the default operator from "OR" to "AND" which > will have severe effects on vufind. What is your reasoning for > changing this? > > 2. You are creating auth_author to a multivalued field, why? This > is a difficult matter since according to marc there is only 1 main > author. With that - this field was purposely defined to be a single > value field. > > 3. I like how you are adding other content fields into the title > field etc. but this should not be hard coded into the marcimporter - > this needs to be a part of the mapping file that Wayne is working on. > > Andrew > > >> -----Original Message----- >> From: vuf...@li... [mailto:vufind-tech- >> bo...@li...] On Behalf Of Alan Rykhus >> Sent: Friday, April 11, 2008 3:47 PM >> To: vuf...@li... >> Subject: [VuFind-Tech] MarcImporter indexing update >> >> Hello, >> >> I've started working on updating the MarcImporter.java code to >> implement a more in-depth indexing. I've completed the updates for >> the >> author and title fields. >> >> I realize there are some other changes that need to be done to >> actually >> use these changes. The sys/SOLR.php needs to be modified to use the >> new >> fields defined in the schema. >> >> I plan to continue working on this. Does the group feel that this >> is a >> step in the right direction? >> >> al >> -- >> Alan Rykhus >> PALS, A Program of the Minnesota State Colleges and Universities >> (507)389-1975 >> ala...@mn... > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save > $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech |