From: Naomi D. <nd...@st...> - 2008-04-15 00:14:53
|
Andrew, Thanks for your response; you are correct that I know Lucene reasonably well, but not SOLR. I'll try to contribute to the documentation wiki as I can. A few more comments: >> Phrase searching: out of the box, this doesn't appear to be a part >> of >> VuFind. Quoting a phrase seems a defacto standard, due to google. >> Lucene supports this ... why not VuFind? > > I think you might be confused - this is a very essential part of > VuFind. Phrase searching is defined by quotes. Can you elaborate > as to what is not working the way you are expecting? If I search our VuFind instance for "french horn" without quotes, I get results. If I search our VuFind instance for "french horn" with the quotes, I get: An error has occured Unable to process query Solr Returned:org.apache.solr.core.SolrException: Query parsing error: Cannot parse '(titleStr:""french horn""^15 OR (title:("french horn")^5 OR title2:("french horn")^2)^10 OR author:("french horn")^5 OR format: ("french horn") OR publishDate:("french horn") OR physical:("french horn") OR contents:("french horn") OR series:("french horn") OR topic: ("french horn")^2 OR geographic:("french horn")^2 OR genre:("french horn")^2 OR subject:("french horn")^2 )': Lexical error at line 1, column 12. Encountered: "\"" (34), after : "\"" Please contact the Library Reference Department for assistance nd...@st... [576] /usr/share/pear/PEAR.php [444] /usr/local/vufind/vufind-0.8/web/sys/SOLR.php [430] /usr/local/vufind/vufind-0.8/web/sys/SOLR.php [335] /usr/local/vufind/vufind-0.8/web/sys/SOLR.php [488] /usr/local/vufind/vufind-0.8/web/services/Search/Home.php [238] /usr/local/vufind/vufind-0.8/web/services/Search/Home.php [50] /usr/local/vufind/vufind-0.8/web/services/Search/Home.php [104] /usr/local/vufind/vufind-0.8/web/index.php We installed version 0.8.1 on April 3. >> Stopwords: our data contains a great deal of foreign language >> materials. Very common terms in our index, per Luke, are "de" "la" >> and so on. This is simply a stopwords issue, but documentation might >> be useful to indicate how easy it is to address this. > > This is something that I hope to investigate soon. MARC has a field > that tells the indexer what to ignore in the data string - such as > "the" "el" "de" , etc. Yes, I know about MARC non-filing characters ... but that's only at the beginning of the field, right? I think the easier and better solution is for institutions that have multilingual data to adjust their stopwords accordingly. The Snowball project (http://snowball.tartarus.org/ ) has stopword lists for a number of languages, but I had to find them with a Google "search within site" search. Wikipedia also has a few links to foreign language stopword lists: http://en.wikipedia.org/wiki/Stop_words . I'm sure there are other sources. >> The whitespace, case sensitivity and punctuation sensitivity are all >> analysis choices. I think tokenizing most fields makes a great deal >> of sense. > > And most are - however we have non-tokenized fields for faceting. >> Single characters as terms: again, using Luke, some of the most >> common terms in our index are "a" "1" and so on. These single >> character terms aren't useful for searching ... they should probably >> be eliminated either through analysis or the stopwords list. > > Single character fields are very helpful for faceting - that is why > they exist. Not everything in the index is soley used for searching I will learn more about SOLR and post any further questions/ observations after I've done my homework. Cheers, Naomi |