PS: tried to undo but it was too late --- the solution I share above does not apply when ALL the terms being search are stopwords... but VuFind (or any other OPAC like search system) provides many ways to reach the wanted record (by author, browsing titles, etc., etc.).


On Sat, Feb 23, 2013 at 3:41 PM, Filipe MS Bento (UA) <fsb@ua.pt> wrote:
Hello all!

To complement what I've written bellow, responding to a similar question, hopping that I not sending something that is somehow placed as comments (by Demian et al.) in http://vufind.org/jira/browse/VUFIND-417), a solution to this may be found here:


(<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>)

Have a great weekend,

Filipe


---------- Forwarded message ----------
From: Filipe MS Bento (UA) <fsb@ua.pt>
Date: Thu, Feb 21, 2013 at 3:05 PM
Subject: Re: [VuFind-Tech] Searching for terms with apostrophes
To: Demian Katz <demian.katz@villanova.edu>, Karla Smith <smith@winnefox.org>, "vufind-tech@lists.sourceforge.net" <vufind-tech@lists.sourceforge.net>


Hi!

 

I guess stopwords and Language Analysis (http://wiki.apache.org/solr/LanguageAnalysis until SOLR v3.6 [we are using 3.5], and becoming obsolete in favor of Analyzers, Tokenizers, and Token Filters, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters)  is quiet a sensitive, trade-of matter, a luxury one might say; if have the means to afford it or it is vital to have those terms searchable: empty the stopwords list.

 

Let me explain, from the little I know of: if you don’t have stopwords and have a huge index performance will suffer performing queries with terms that 99% of the time are no relevant for the search itself. If you use CommonGrams, for instances, feed with 1000common.txt you may find yourself in lots of these situations.

 

I build up a solr/biblio/conf/stopwords.txt list taking out a lot of terms from the several languages common words lists, the terms that might be more problematic and woul result in excluding relevant resources. Please bear in mind that SOLR is, as they put it an “Enterprise search platform” where most of the exact searches we need in “our” world do not apply most of the times.

 

Below are my personal notes about it (have to place all of this in a one of my inactive blogs):

 

                A sample of solr/biblio/conf/stopwords.txt

 

a

about

above

across

after

afterwards

again

against

all

almost

alone

along

already

also

although

(…) > several languages

 

solr/conf/schema.xml

 

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">

<analyzer type="index">

 

(added) 
[note: please do not forget to place 
AutoGeneratePhraseQueries="false" in the <field> parameters]

 

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.SnowballPorterFilterFactory" language="Portuguese" />

<filter class="solr.ASCIIFoldingFilterFactory"/>

<filter class="solr.SnowballPorterFilterFactory" language="German" /> <filter class="solr.ElisionFilterFactory"/>

<!-- do word delimiter, etc here -->

<filter class="solr.SnowballPorterFilterFactory" language="French" /> <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />

 

Idem for

<analyzer type="query">

 

Version 3.6+: see entries in

http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml

 

 

Example: Portuguese

 

<!-- Portuguese -->

<fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" format="snowball" enablePositionIncrements="true"/>

<filter class="solr.PortugueseLightStemFilterFactory"/>

<!-- less aggressive: <filter class="solr.PortugueseMinimalStemFilterFactory"/> --> <!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory" language="Portuguese"/> -->

<!-- most aggressive: <filter class="solr.PortugueseStemFilterFactory"/> -->

      </analyzer>

    </fieldType>

 

 

Further reading:

 

Posts from Haiti Trust (VuFind based), example:


Slow Queries and Common Words (Part 2)

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

 

All the best and if this message may seem a little bit of topic, but then again may have some interest for the ones dealing with bigger indexes and find themselves with performance issues,

 

Filipe

--------------------------
Filipe Manuel S. Bento  |  http://about.filipebento.pt/
Computer Science Specialist * PhD Researcher (UAveiro/UPorto/CETAC.Media), grant by FCT - Portuguese Foundation for Science and Technology
President/Chair of USE.pt Steering Committee (Portuguese Ex Libris Users’ National Association, hosted by Portuguese Parliament's, Palácio de S. Bento, Lisbon, http://www.USEpt.org, Oct 2010 - )



On Sat, Feb 23, 2013 at 12:59 PM, Demian Katz <demian.katz@villanova.edu> wrote:
There is a JIRA ticket which discusses some of these issues:

http://vufind.org/jira/browse/VUFIND-417

Disabling stopwords is the easy answer, but there's also a link to this page that suggests some more sophisticated approaches:

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

If you come up with something that works well for you, I'd love to hear about it -- I haven't had time to tackle this issue in detail, and it would be nice to recommend a best practice on the ticket.

- Demian

From: Weston, Paige [weston1@uillinois.edu]
Sent: Friday, February 22, 2013 5:04 PM
To: vufind-general@lists.sourceforge.net
Subject: [VuFind-General] when all words are stopwords

Hello, all. Our VuFind queries and indexes run through the StopFilterFactory, which strips out non-significant words. The result for a title like No There There (OCLC#ocm81453667) is that it's not retrievable. Can anyone suggest a workaround, short of disabling the filter and reindexing? Is there a way to say, for a particular would-be index entry, "If they're all stopwords then none of them is a stopword"? Thanks.

-- 
E. Paige Weston           email:  weston1@uillinois.edu
Library Systems Coordinator                  
Consortium of Academic & Research Libraries in Illinois (CARLI)
100 Trade Centre Drive, Suite 303
Champaign, IL 61820-7233
voice:  217-244-7593  toll-free:  866-904-5843  fax:  217-244-7596

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
VuFind-General mailing list
VuFind-General@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-general