Whoosh offers a number of different analysers: RegexTokenizer LowercaseFiler StopFilter StemFilter and BiWordFilter (see analysis module in whoosh)
It would be good if we could switch these on and off in the params file when indexing.
This sounds like a good idea - is this the pipeline you're thinking about? So in a config file, you could imagine this?
[analysers] regextokenizer = false stopfilter = true stemfilter = true
so it would pass through stopfiler and stemfilter but not regextokenizer? or would you want a number to provide some form of order to the components?
What do you think?
Log in to post a comment.
This sounds like a good idea - is this the pipeline you're thinking about?
So in a config file, you could imagine this?
[analysers]
regextokenizer = false
stopfilter = true
stemfilter = true
so it would pass through stopfiler and stemfilter but not regextokenizer? or would you want a number to provide some form of order to the components?
What do you think?
Last edit: David Maxwell 2012-11-02