Re: [VuFind-Tech] MarcImporter indexing update

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Andrew,

Thanks for your response;  you are correct that I know Lucene  
reasonably well, but not SOLR.   I'll try to contribute to the  
documentation wiki as I can.

A few more comments:

>> Phrase searching:  out of the box, this doesn't appear to be a part  
>> of
>> VuFind.  Quoting a phrase seems a defacto standard, due to google.
>> Lucene supports this ... why not VuFind?
>
> I think you might be confused - this is a very essential part of  
> VuFind.  Phrase searching is defined by quotes.  Can you elaborate  
> as to what is not working the way you are expecting?

If I search our VuFind instance for "french horn" without quotes, I  
get results.

If I search our VuFind instance for "french horn"  with the quotes, I  
get:

An error has occured

Unable to process query
Solr Returned:org.apache.solr.core.SolrException: Query parsing error:  
Cannot parse '(titleStr:""french horn""^15 OR (title:("french horn")^5  
OR title2:("french horn")^2)^10 OR author:("french horn")^5 OR format: 
("french horn") OR publishDate:("french horn") OR physical:("french  
horn") OR contents:("french horn") OR series:("french horn") OR topic: 
("french horn")^2 OR geographic:("french horn")^2 OR genre:("french  
horn")^2 OR subject:("french horn")^2 )': Lexical error at line 1,  
column 12. Encountered: "\"" (34), after : "\""

Please contact the Library Reference Department for assistance
nd...@st...
[576] /usr/share/pear/PEAR.php
[444] /usr/local/vufind/vufind-0.8/web/sys/SOLR.php
[430] /usr/local/vufind/vufind-0.8/web/sys/SOLR.php
[335] /usr/local/vufind/vufind-0.8/web/sys/SOLR.php
[488] /usr/local/vufind/vufind-0.8/web/services/Search/Home.php
[238] /usr/local/vufind/vufind-0.8/web/services/Search/Home.php
[50] /usr/local/vufind/vufind-0.8/web/services/Search/Home.php
[104] /usr/local/vufind/vufind-0.8/web/index.php

We installed version 0.8.1 on April 3.

>> Stopwords:  our data contains a great deal of foreign language
>> materials.   Very common terms in our index, per Luke, are "de" "la"
>> and so on.  This is simply a stopwords issue, but documentation might
>> be useful to indicate how easy it is to address this.
>
> This is something that I hope to investigate soon.  MARC has a field  
> that tells the indexer what to ignore in the data string - such as  
> "the" "el" "de" , etc.

Yes, I know about MARC non-filing characters ... but that's only at  
the beginning of the field, right?  I think the easier and better  
solution is for institutions that have multilingual data to adjust  
their stopwords accordingly.   The Snowball project (http://snowball.tartarus.org/ 
  ) has stopword lists for a number of languages, but I had to find  
them with a Google "search within site" search.  Wikipedia also has a  
few links to foreign language stopword lists:  http://en.wikipedia.org/wiki/Stop_words 
  .   I'm sure there are other sources.

>> The whitespace, case sensitivity and punctuation sensitivity are all
>> analysis choices.  I think tokenizing most fields makes a great deal
>> of sense.
>
> And most are - however we have non-tokenized fields for faceting.

>> Single characters as terms:  again, using Luke, some of the most
>> common terms in our index are "a" "1" and so on.  These single
>> character terms aren't useful for searching ... they should probably
>> be eliminated either through analysis or the stopwords list.
>
> Single character fields are very helpful for faceting - that is why  
> they exist.  Not everything in the index is soley used for searching

I will learn more about SOLR and post any further questions/ 
observations after I've done my homework.

Cheers,
Naomi