From: Greg P. <Gre...@us...> - 2009-10-02 05:14:01
|
I've been playing with spellcheck today on the laptop after finally getting it working from Till's contribution (I made some dumb mistakes :)). For our internal doco I had made a few notes which I've included below. I'm curious if anyone more familiar with solr spellcheck can see a way around them: * Solr does not return suggestions if your query got hits until you turn on 'onlyMorePopular'. Seems to be confusingly named, and it also means you can't get spelling suggestions on queries that have smaller hit counts if your search terms were actually correct. Example from our catalogue. If you search for 'behavior' (US spelling) you get 6,841 hits and no suggestions, but if you search for 'behaviour' (the REAL spelling ;P) you get 1,838 hits and the suggestion to try 'behavior' and 'behavioral'. * To get the number of hits each term would return you need to turn on 'extendedResults', however this causes parsing problems in json_decode() because of repeated use of the associative array index 'suggestion'. The repeated use means you only get the last suggestion from the list of returned values. * Spelling suggestions will be context free with regards to hit counts anyway unless you build context sensitive dictionaries. ie. 'rowling' in the author fields as a search is currently compared against 'rowling' in allfields because it is the origin of the dictionary. This point is somewhat moot given the json_decode() issue with extended results anyway (for now). Ta, Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Demian K. <dem...@vi...> - 2009-10-02 12:41:31
|
Thanks for sharing these notes -- I've attached them to the spellchecker JIRA issue so they don't get forgotten: http://www.vufind.org/jira/browse/VUFIND-13 Regarding the JSON problem with extendedResults, if I understand correctly, this sounds to me like a bug in Solr. Have you checked to see if the Solr project is aware of this issue? - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, October 02, 2009 1:14 AM To: 'vuf...@li...' Subject: [VuFind-Tech] Idle thoughts on spellcheck I've been playing with spellcheck today on the laptop after finally getting it working from Till's contribution (I made some dumb mistakes :)). For our internal doco I had made a few notes which I've included below. I'm curious if anyone more familiar with solr spellcheck can see a way around them: * Solr does not return suggestions if your query got hits until you turn on 'onlyMorePopular'. Seems to be confusingly named, and it also means you can't get spelling suggestions on queries that have smaller hit counts if your search terms were actually correct. Example from our catalogue. If you search for 'behavior' (US spelling) you get 6,841 hits and no suggestions, but if you search for 'behaviour' (the REAL spelling ;P) you get 1,838 hits and the suggestion to try 'behavior' and 'behavioral'. * To get the number of hits each term would return you need to turn on 'extendedResults', however this causes parsing problems in json_decode() because of repeated use of the associative array index 'suggestion'. The repeated use means you only get the last suggestion from the list of returned values. * Spelling suggestions will be context free with regards to hit counts anyway unless you build context sensitive dictionaries. ie. 'rowling' in the author fields as a search is currently compared against 'rowling' in allfields because it is the origin of the dictionary. This point is somewhat moot given the json_decode() issue with extended results anyway (for now). Ta, Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Greg P. <Gre...@us...> - 2009-10-04 23:49:43
|
My gut reaction is it's a bug with php's json_decode(). You'd think it would turn the 'suggestion' index into an array of responses with appropriate testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...] Sent: Friday, 2 October 2009 10:38 PM To: Greg Pendlebury; 'vuf...@li...' Subject: RE: Idle thoughts on spellcheck Thanks for sharing these notes -- I've attached them to the spellchecker JIRA issue so they don't get forgotten: http://www.vufind.org/jira/browse/VUFIND-13 Regarding the JSON problem with extendedResults, if I understand correctly, this sounds to me like a bug in Solr. Have you checked to see if the Solr project is aware of this issue? - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, October 02, 2009 1:14 AM To: 'vuf...@li...' Subject: [VuFind-Tech] Idle thoughts on spellcheck I've been playing with spellcheck today on the laptop after finally getting it working from Till's contribution (I made some dumb mistakes :)). For our internal doco I had made a few notes which I've included below. I'm curious if anyone more familiar with solr spellcheck can see a way around them: * Solr does not return suggestions if your query got hits until you turn on 'onlyMorePopular'. Seems to be confusingly named, and it also means you can't get spelling suggestions on queries that have smaller hit counts if your search terms were actually correct. Example from our catalogue. If you search for 'behavior' (US spelling) you get 6,841 hits and no suggestions, but if you search for 'behaviour' (the REAL spelling ;P) you get 1,838 hits and the suggestion to try 'behavior' and 'behavioral'. * To get the number of hits each term would return you need to turn on 'extendedResults', however this causes parsing problems in json_decode() because of repeated use of the associative array index 'suggestion'. The repeated use means you only get the last suggestion from the list of returned values. * Spelling suggestions will be context free with regards to hit counts anyway unless you build context sensitive dictionaries. ie. 'rowling' in the author fields as a search is currently compared against 'rowling' in allfields because it is the origin of the dictionary. This point is somewhat moot given the json_decode() issue with extended results anyway (for now). Ta, Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Greg P. <Gre...@us...> - 2009-10-04 23:53:00
|
Clearly I shouldn't listen to my gut: https://issues.apache.org/jira/browse/SOLR-1071 I'll try one of the latest nightly builds. I think from looking at the structure it will require a couple of code tweaks. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...] Sent: Monday, 5 October 2009 9:32 AM To: 'Demian Katz'; 'vuf...@li...' Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck My gut reaction is it's a bug with php's json_decode(). You'd think it would turn the 'suggestion' index into an array of responses with appropriate testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...] Sent: Friday, 2 October 2009 10:38 PM To: Greg Pendlebury; 'vuf...@li...' Subject: RE: Idle thoughts on spellcheck Thanks for sharing these notes -- I've attached them to the spellchecker JIRA issue so they don't get forgotten: http://www.vufind.org/jira/browse/VUFIND-13 Regarding the JSON problem with extendedResults, if I understand correctly, this sounds to me like a bug in Solr. Have you checked to see if the Solr project is aware of this issue? - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, October 02, 2009 1:14 AM To: 'vuf...@li...' Subject: [VuFind-Tech] Idle thoughts on spellcheck I've been playing with spellcheck today on the laptop after finally getting it working from Till's contribution (I made some dumb mistakes :)). For our internal doco I had made a few notes which I've included below. I'm curious if anyone more familiar with solr spellcheck can see a way around them: * Solr does not return suggestions if your query got hits until you turn on 'onlyMorePopular'. Seems to be confusingly named, and it also means you can't get spelling suggestions on queries that have smaller hit counts if your search terms were actually correct. Example from our catalogue. If you search for 'behavior' (US spelling) you get 6,841 hits and no suggestions, but if you search for 'behaviour' (the REAL spelling ;P) you get 1,838 hits and the suggestion to try 'behavior' and 'behavioral'. * To get the number of hits each term would return you need to turn on 'extendedResults', however this causes parsing problems in json_decode() because of repeated use of the associative array index 'suggestion'. The repeated use means you only get the last suggestion from the list of returned values. * Spelling suggestions will be context free with regards to hit counts anyway unless you build context sensitive dictionaries. ie. 'rowling' in the author fields as a search is currently compared against 'rowling' in allfields because it is the origin of the dictionary. This point is somewhat moot given the json_decode() issue with extended results anyway (for now). Ta, Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Demian K. <dem...@vi...> - 2009-10-05 13:38:51
|
Thanks for the update -- I've added another comment to JIRA with this link. Let us know how things work out for you! - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Sunday, October 04, 2009 7:53 PM To: Greg Pendlebury; Demian Katz; 'vuf...@li...' Subject: RE: Idle thoughts on spellcheck Clearly I shouldn't listen to my gut: https://issues.apache.org/jira/browse/SOLR-1071 I'll try one of the latest nightly builds. I think from looking at the structure it will require a couple of code tweaks. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...] Sent: Monday, 5 October 2009 9:32 AM To: 'Demian Katz'; 'vuf...@li...' Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck My gut reaction is it's a bug with php's json_decode(). You'd think it would turn the 'suggestion' index into an array of responses with appropriate testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...] Sent: Friday, 2 October 2009 10:38 PM To: Greg Pendlebury; 'vuf...@li...' Subject: RE: Idle thoughts on spellcheck Thanks for sharing these notes -- I've attached them to the spellchecker JIRA issue so they don't get forgotten: http://www.vufind.org/jira/browse/VUFIND-13 Regarding the JSON problem with extendedResults, if I understand correctly, this sounds to me like a bug in Solr. Have you checked to see if the Solr project is aware of this issue? - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, October 02, 2009 1:14 AM To: 'vuf...@li...' Subject: [VuFind-Tech] Idle thoughts on spellcheck I've been playing with spellcheck today on the laptop after finally getting it working from Till's contribution (I made some dumb mistakes :)). For our internal doco I had made a few notes which I've included below. I'm curious if anyone more familiar with solr spellcheck can see a way around them: * Solr does not return suggestions if your query got hits until you turn on 'onlyMorePopular'. Seems to be confusingly named, and it also means you can't get spelling suggestions on queries that have smaller hit counts if your search terms were actually correct. Example from our catalogue. If you search for 'behavior' (US spelling) you get 6,841 hits and no suggestions, but if you search for 'behaviour' (the REAL spelling ;P) you get 1,838 hits and the suggestion to try 'behavior' and 'behavioral'. * To get the number of hits each term would return you need to turn on 'extendedResults', however this causes parsing problems in json_decode() because of repeated use of the associative array index 'suggestion'. The repeated use means you only get the last suggestion from the list of returned values. * Spelling suggestions will be context free with regards to hit counts anyway unless you build context sensitive dictionaries. ie. 'rowling' in the author fields as a search is currently compared against 'rowling' in allfields because it is the origin of the dictionary. This point is somewhat moot given the json_decode() issue with extended results anyway (for now). Ta, Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Greg P. <Gre...@us...> - 2009-10-05 22:43:11
|
The nightly build (latest) fixed my second point. The first point is still a big hurdle though. Can't see the librarians being happy with our catalogue suggesting the users switching to US spelling. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...] Sent: Monday, 5 October 2009 11:39 PM To: Greg Pendlebury; 'vuf...@li...' Subject: RE: Idle thoughts on spellcheck Thanks for the update -- I've added another comment to JIRA with this link. Let us know how things work out for you! - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Sunday, October 04, 2009 7:53 PM To: Greg Pendlebury; Demian Katz; 'vuf...@li...' Subject: RE: Idle thoughts on spellcheck Clearly I shouldn't listen to my gut: https://issues.apache.org/jira/browse/SOLR-1071 I'll try one of the latest nightly builds. I think from looking at the structure it will require a couple of code tweaks. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...] Sent: Monday, 5 October 2009 9:32 AM To: 'Demian Katz'; 'vuf...@li...' Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck My gut reaction is it's a bug with php's json_decode(). You'd think it would turn the 'suggestion' index into an array of responses with appropriate testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...] Sent: Friday, 2 October 2009 10:38 PM To: Greg Pendlebury; 'vuf...@li...' Subject: RE: Idle thoughts on spellcheck Thanks for sharing these notes -- I've attached them to the spellchecker JIRA issue so they don't get forgotten: http://www.vufind.org/jira/browse/VUFIND-13 Regarding the JSON problem with extendedResults, if I understand correctly, this sounds to me like a bug in Solr. Have you checked to see if the Solr project is aware of this issue? - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, October 02, 2009 1:14 AM To: 'vuf...@li...' Subject: [VuFind-Tech] Idle thoughts on spellcheck I've been playing with spellcheck today on the laptop after finally getting it working from Till's contribution (I made some dumb mistakes :)). For our internal doco I had made a few notes which I've included below. I'm curious if anyone more familiar with solr spellcheck can see a way around them: * Solr does not return suggestions if your query got hits until you turn on 'onlyMorePopular'. Seems to be confusingly named, and it also means you can't get spelling suggestions on queries that have smaller hit counts if your search terms were actually correct. Example from our catalogue. If you search for 'behavior' (US spelling) you get 6,841 hits and no suggestions, but if you search for 'behaviour' (the REAL spelling ;P) you get 1,838 hits and the suggestion to try 'behavior' and 'behavioral'. * To get the number of hits each term would return you need to turn on 'extendedResults', however this causes parsing problems in json_decode() because of repeated use of the associative array index 'suggestion'. The repeated use means you only get the last suggestion from the list of returned values. * Spelling suggestions will be context free with regards to hit counts anyway unless you build context sensitive dictionaries. ie. 'rowling' in the author fields as a search is currently compared against 'rowling' in allfields because it is the origin of the dictionary. This point is somewhat moot given the json_decode() issue with extended results anyway (for now). Ta, Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Eoghan Ó C. <eog...@gm...> - 2009-10-06 00:04:01
|
Hi Greg, Have you thought about using the Solr's synonym filter as an alternative solution to this problem? These spelling variations aren't really synonyms, but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory>you could ensure that your index contains only the AU/UK version (i.e. replace the US version), or both the AU/UK and US version etc. In either case, I suppose the user wouldn't be presented with any "Did you mean?", but hopefully they'd find what they're looking for. I found it surprisingly hard to get a comprehensive list of these spelling variations (misguided search strategy, perhaps), but these links provide a starting point: - https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution - http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> - http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences - http://answers.google.com/answers/threadview/id/211486.html I haven't had a chance to try this out myself, but some of the journal articles indexed in our sources site contain lots of medical terms where ae/e and oe/e appear both in usages, so I'm interested in finding a solution to this problem too. Eoghan PS - the following lists may also be of interest. They aren't spelling variations, more "American - British" synonym lists: - http://www.bg-map.com/us-uk.html#1 - http://esl.about.com/library/vocabulary/blbritam.htm 2009/10/5 Greg Pendlebury <Gre...@us...> > The nightly build (latest) fixed my second point. The first point is > still a big hurdle though. Can’t see the librarians being happy with our > catalogue suggesting the users switching to US spelling. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Monday, 5 October 2009 11:39 PM > > *To:* Greg Pendlebury; 'vuf...@li...' > *Subject:* RE: Idle thoughts on spellcheck > > > > Thanks for the update -- I've added another comment to JIRA with this > link. Let us know how things work out for you! > > > > - Demian > > > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Sunday, October 04, 2009 7:53 PM > *To:* Greg Pendlebury; Demian Katz; 'vuf...@li...' > *Subject:* RE: Idle thoughts on spellcheck > > > > Clearly I shouldn’t listen to my gut: > > https://issues.apache.org/jira/browse/SOLR-1071 > > > > I’ll try one of the latest nightly builds. I think from looking at the > structure it will require a couple of code tweaks. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Monday, 5 October 2009 9:32 AM > *To:* 'Demian Katz'; 'vuf...@li...' > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > My gut reaction is it’s a bug with php’s json_decode(). You’d think it > would turn the ‘suggestion’ index into an array of responses with > appropriate testing. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Friday, 2 October 2009 10:38 PM > *To:* Greg Pendlebury; 'vuf...@li...' > *Subject:* RE: Idle thoughts on spellcheck > > > > Thanks for sharing these notes -- I've attached them to the spellchecker > JIRA issue so they don't get forgotten: > > > > http://www.vufind.org/jira/browse/VUFIND-13 > > > > Regarding the JSON problem with extendedResults, if I understand correctly, > this sounds to me like a bug in Solr. Have you checked to see if the Solr > project is aware of this issue? > > > > - Demian > > > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Friday, October 02, 2009 1:14 AM > *To:* 'vuf...@li...' > *Subject:* [VuFind-Tech] Idle thoughts on spellcheck > > > > I’ve been playing with spellcheck today on the laptop after finally getting > it working from Till’s contribution (I made some dumb mistakes J). For our > internal doco I had made a few notes which I’ve included below. I’m curious > if anyone more familiar with solr spellcheck can see a way around them: > > > > > > * Solr does not return suggestions if your query got hits until you turn on > 'onlyMorePopular'. Seems to be confusingly named, and it also means you > can't get spelling suggestions on queries that have smaller hit counts if > your search terms were actually correct. Example from our catalogue. If you > search for 'behavior' (US spelling) you get 6,841 hits and no suggestions, > but if you search for 'behaviour' (the REAL spelling ;P) you get 1,838 hits > and the suggestion to try 'behavior' and 'behavioral'. > * To get the number of hits each term would return you need to turn on > 'extendedResults', however this causes parsing problems in json_decode() > because of repeated use of the associative array index 'suggestion'. The > repeated use means you only get the last suggestion from the list of > returned values. > * Spelling suggestions will be context free with regards to hit counts > anyway unless you build context sensitive dictionaries. ie. 'rowling' in the > author fields as a search is currently compared against 'rowling' in > allfields because it is the origin of the dictionary. This point is somewhat > moot given the json_decode() issue with extended results anyway (for now). > > > > Ta, > > > > *Greg Pendlebury * > > Electronic Services Officer (Systems Team) > > Division of Academic Information Services > > University of Southern Queensland > > Phone: +61 7 4631 1501 > > Fax: +61 7 4631 1841 > > > > > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > ------------------------------ > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry® Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9-12, 2009. Register now! > http://p.sf.net/sfu/devconf > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > |
From: Greg P. <Gre...@us...> - 2009-10-28 03:01:03
|
I was reconsidering this today in light of wanting a reasonably functional dictionary in RC2. The rebuilt solr was to "...allow the shingle filter to return single word tokens at query time if no shingles could be made". It was my draft 1, but then I added the single word dictionary as a fallback... so perhaps the patch is unnecessary. What's the feeling on 'value for money' in a two dictionary approach in RC2 if the patch is not needed? Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:54 PM To: 'vuf...@li...' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck I'll try that again. Should have thought that posting a 4mb file to the list was bad :) It's here : http://www.usq.edu.au/library/ereserve/solr.war And the patch is attached. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:39 PM To: vuf...@li... Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Ok, this is experimental, but I like the behaviour. Drawback is solr can't currently handle it. Attached is a rebuilt version of solr from the current trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow the shingle filter to return single word tokens at query time if no shingles could be made. The vufind patch also attached makes use of this to build a shingle spelling dictionary. Solr.php has been adjusted to allow the dictionary to be changed by SearchObject.php and the search object now looks for phrase matches (two word phrases only so far) in the shingle dictionary, before falling basic to the basic dictionary (submits a second search). It generally improves the issue of individual word suggestions breaking a phrase (by giving priority to phrase suggestions), but still allows for the fact that individual words could be wrong (if they weren't part of a phrase). Aside from feedback and experimentation (obviously why I'm posting) I'm quite happy with this method and think I'll shelve it as our (USQ) solution once those patches make it into release. I haven't done any performance benchmarking on the impacts of this technique, so they could of course be a killer to the idea. I'm not aware of a way to get suggestions from both dictionaries with one query... but I haven't looked into it either. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, 23 October 2009 9:13 AM To: 'Osullivan L.'; Eoghan Ó Carragáin Cc: vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I like the idea. I think the current dictionary is not the way to go though. Eg. I do a search for 'harry poter' and it suggests 'barry potter' because it tries to fix both words. I have a proof-of-concept shingle based dictionary online now which is appropriately suggesting 'harry potter' because the whole phrase is more common but I need to play some more. I suspect both dictionaries have a use, particularly if you want to display the interface you've suggested. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Osullivan L. [mailto:L.O...@sw...] Sent: Thursday, 22 October 2009 8:31 PM To: Greg Pendlebury; Eoghan Ó Carragáin Cc: vuf...@li... Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Greetings All, Another thought: How about single word spelling variations, plus a "Did you mean" phrase option? E.g. Iresh Landskape would return: Did you mean Irish Landscape? Perhaps you should try some spelling variations: Iresh >> Irish, Fresh, Ires Landskape >> Landscape, Landscapes, Landspae Kind Regards, Luke PS: I have it working on my test install but the code is very messy and not suitable for addition ________________________________ From: Greg Pendlebury [mailto:Gre...@us...] Sent: 22 October 2009 07:26 To: 'Eoghan Ó Carragáin' Cc: vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hmmm, I was looking in the solr 1.4 book that was advertised on the list a while back and think it's worth revisiting phonetic filters. I was using PhoneticFilterFactory and changing encoding (as per : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory) The book mentions a specific DoubleMetaphoneFilterFactory that can be given a parameter for 'maxCodeLength'. It defaults to four and could possibly account for the appearance of inbuilt stemming in my original testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...] Sent: Tuesday, 6 October 2009 6:35 PM To: Greg Pendlebury Cc: Demian Katz; vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi, If phonetic filtering is reliable and doesn't produce too many unexpected results (as stemming sometimes can), then it would certainly be easier to maintain than a synonym list. Thanks for the data. Eoghan 2009/10/6 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> We considered synonyms but discarded the idea some time ago. It is awkward to create and maintain, and it doesn't cover other issues where a spelling suggestion on lower hit rate items would also help (rowling vs rolling vs rowing). I think phonetic filtering would be the answer to international spelling variations, but our testing has indicated the four phonetic filters in solr currently can't handle what we need. I can't remember whether I'd attached this before but some basic test data is attached. Refined Soundex is the closest we got, but it has some issues with S and Z comparison. The others all have stemming built into them on top of phonetics it would seem... most annoying. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Tuesday, 6 October 2009 10:04 AM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi Greg, Have you thought about using the Solr's synonym filter as an alternative solution to this problem? These spelling variations aren't really synonyms, but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory> you could ensure that your index contains only the AU/UK version (i.e. replace the US version), or both the AU/UK and US version etc. In either case, I suppose the user wouldn't be presented with any "Did you mean?", but hopefully they'd find what they're looking for. I found it surprisingly hard to get a comprehensive list of these spelling variations (misguided search strategy, perhaps), but these links provide a starting point: * https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution * http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> * http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences * http://answers.google.com/answers/threadview/id/211486.html I haven't had a chance to try this out myself, but some of the journal articles indexed in our sources site contain lots of medical terms where ae/e and oe/e appear both in usages, so I'm interested in finding a solution to this problem too. Eoghan PS - the following lists may also be of interest. They aren't spelling variations, more "American - British" synonym lists: * http://www.bg-map.com/us-uk.html#1 * http://esl.about.com/library/vocabulary/blbritam.htm 2009/10/5 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> The nightly build (latest) fixed my second point. The first point is still a big hurdle though. Can't see the librarians being happy with our catalogue suggesting the users switching to US spelling. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...<mailto:dem...@vi...>] Sent: Monday, 5 October 2009 11:39 PM To: Greg Pendlebury; 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: Idle thoughts on spellcheck Thanks for the update -- I've added another comment to JIRA with this link. Let us know how things work out for you! - Demian This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Demian K. <dem...@vi...> - 2009-10-28 14:45:52
|
One possibility would be to set up the two-dictionary approach but prefix the relevant settings with a comment indicating that the configuration can be simplified by applying the appropriate patch. - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Tuesday, October 27, 2009 11:01 PM To: 'vuf...@li...' Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I was reconsidering this today in light of wanting a reasonably functional dictionary in RC2. The rebuilt solr was to "...allow the shingle filter to return single word tokens at query time if no shingles could be made". It was my draft 1, but then I added the single word dictionary as a fallback... so perhaps the patch is unnecessary. What's the feeling on 'value for money' in a two dictionary approach in RC2 if the patch is not needed? Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:54 PM To: 'vuf...@li...' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck I'll try that again. Should have thought that posting a 4mb file to the list was bad :) It's here : http://www.usq.edu.au/library/ereserve/solr.war And the patch is attached. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:39 PM To: vuf...@li... Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Ok, this is experimental, but I like the behaviour. Drawback is solr can't currently handle it. Attached is a rebuilt version of solr from the current trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow the shingle filter to return single word tokens at query time if no shingles could be made. The vufind patch also attached makes use of this to build a shingle spelling dictionary. Solr.php has been adjusted to allow the dictionary to be changed by SearchObject.php and the search object now looks for phrase matches (two word phrases only so far) in the shingle dictionary, before falling basic to the basic dictionary (submits a second search). It generally improves the issue of individual word suggestions breaking a phrase (by giving priority to phrase suggestions), but still allows for the fact that individual words could be wrong (if they weren't part of a phrase). Aside from feedback and experimentation (obviously why I'm posting) I'm quite happy with this method and think I'll shelve it as our (USQ) solution once those patches make it into release. I haven't done any performance benchmarking on the impacts of this technique, so they could of course be a killer to the idea. I'm not aware of a way to get suggestions from both dictionaries with one query... but I haven't looked into it either. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, 23 October 2009 9:13 AM To: 'Osullivan L.'; Eoghan Ó Carragáin Cc: vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I like the idea. I think the current dictionary is not the way to go though. Eg. I do a search for 'harry poter' and it suggests 'barry potter' because it tries to fix both words. I have a proof-of-concept shingle based dictionary online now which is appropriately suggesting 'harry potter' because the whole phrase is more common but I need to play some more. I suspect both dictionaries have a use, particularly if you want to display the interface you've suggested. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Osullivan L. [mailto:L.O...@sw...] Sent: Thursday, 22 October 2009 8:31 PM To: Greg Pendlebury; Eoghan Ó Carragáin Cc: vuf...@li... Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Greetings All, Another thought: How about single word spelling variations, plus a "Did you mean" phrase option? E.g. Iresh Landskape would return: Did you mean Irish Landscape? Perhaps you should try some spelling variations: Iresh >> Irish, Fresh, Ires Landskape >> Landscape, Landscapes, Landspae Kind Regards, Luke PS: I have it working on my test install but the code is very messy and not suitable for addition ________________________________ From: Greg Pendlebury [mailto:Gre...@us...] Sent: 22 October 2009 07:26 To: 'Eoghan Ó Carragáin' Cc: vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hmmm, I was looking in the solr 1.4 book that was advertised on the list a while back and think it's worth revisiting phonetic filters. I was using PhoneticFilterFactory and changing encoding (as per : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory) The book mentions a specific DoubleMetaphoneFilterFactory that can be given a parameter for 'maxCodeLength'. It defaults to four and could possibly account for the appearance of inbuilt stemming in my original testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...] Sent: Tuesday, 6 October 2009 6:35 PM To: Greg Pendlebury Cc: Demian Katz; vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi, If phonetic filtering is reliable and doesn't produce too many unexpected results (as stemming sometimes can), then it would certainly be easier to maintain than a synonym list. Thanks for the data. Eoghan 2009/10/6 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> We considered synonyms but discarded the idea some time ago. It is awkward to create and maintain, and it doesn't cover other issues where a spelling suggestion on lower hit rate items would also help (rowling vs rolling vs rowing). I think phonetic filtering would be the answer to international spelling variations, but our testing has indicated the four phonetic filters in solr currently can't handle what we need. I can't remember whether I'd attached this before but some basic test data is attached. Refined Soundex is the closest we got, but it has some issues with S and Z comparison. The others all have stemming built into them on top of phonetics it would seem... most annoying. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Tuesday, 6 October 2009 10:04 AM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi Greg, Have you thought about using the Solr's synonym filter as an alternative solution to this problem? These spelling variations aren't really synonyms, but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory> you could ensure that your index contains only the AU/UK version (i.e. replace the US version), or both the AU/UK and US version etc. In either case, I suppose the user wouldn't be presented with any "Did you mean?", but hopefully they'd find what they're looking for. I found it surprisingly hard to get a comprehensive list of these spelling variations (misguided search strategy, perhaps), but these links provide a starting point: * https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution * http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> * http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences * http://answers.google.com/answers/threadview/id/211486.html I haven't had a chance to try this out myself, but some of the journal articles indexed in our sources site contain lots of medical terms where ae/e and oe/e appear both in usages, so I'm interested in finding a solution to this problem too. Eoghan PS - the following lists may also be of interest. They aren't spelling variations, more "American - British" synonym lists: * http://www.bg-map.com/us-uk.html#1 * http://esl.about.com/library/vocabulary/blbritam.htm 2009/10/5 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> The nightly build (latest) fixed my second point. The first point is still a big hurdle though. Can't see the librarians being happy with our catalogue suggesting the users switching to US spelling. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...<mailto:dem...@vi...>] Sent: Monday, 5 October 2009 11:39 PM To: Greg Pendlebury; 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: Idle thoughts on spellcheck Thanks for the update -- I've added another comment to JIRA with this link. Let us know how things work out for you! - Demian ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Greg P. <Gre...@us...> - 2009-10-29 00:42:09
|
Fair enough. I forgot to mention last time some basic stats I took on the size and build time: Normal dictionary against a 1.12gb index of almost 400k records was 30mb and took 30s to build. Shingle dictionary against a 1.23gb index (same records + shingles) was 600mb and took 15mins to build. The shingles weren't built against 'allfields' because of the issue of shingles incorrectly being created across two fields, just the major text fields like title, author, topic etc. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...] Sent: Thursday, 29 October 2009 12:46 AM To: Greg Pendlebury; 'vuf...@li...' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck One possibility would be to set up the two-dictionary approach but prefix the relevant settings with a comment indicating that the configuration can be simplified by applying the appropriate patch. - Demian From: Greg Pendlebury [mailto:Gre...@us...] Sent: Tuesday, October 27, 2009 11:01 PM To: 'vuf...@li...' Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I was reconsidering this today in light of wanting a reasonably functional dictionary in RC2. The rebuilt solr was to "...allow the shingle filter to return single word tokens at query time if no shingles could be made". It was my draft 1, but then I added the single word dictionary as a fallback... so perhaps the patch is unnecessary. What's the feeling on 'value for money' in a two dictionary approach in RC2 if the patch is not needed? Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:54 PM To: 'vuf...@li...' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck I'll try that again. Should have thought that posting a 4mb file to the list was bad :) It's here : http://www.usq.edu.au/library/ereserve/solr.war And the patch is attached. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:39 PM To: vuf...@li... Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Ok, this is experimental, but I like the behaviour. Drawback is solr can't currently handle it. Attached is a rebuilt version of solr from the current trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow the shingle filter to return single word tokens at query time if no shingles could be made. The vufind patch also attached makes use of this to build a shingle spelling dictionary. Solr.php has been adjusted to allow the dictionary to be changed by SearchObject.php and the search object now looks for phrase matches (two word phrases only so far) in the shingle dictionary, before falling basic to the basic dictionary (submits a second search). It generally improves the issue of individual word suggestions breaking a phrase (by giving priority to phrase suggestions), but still allows for the fact that individual words could be wrong (if they weren't part of a phrase). Aside from feedback and experimentation (obviously why I'm posting) I'm quite happy with this method and think I'll shelve it as our (USQ) solution once those patches make it into release. I haven't done any performance benchmarking on the impacts of this technique, so they could of course be a killer to the idea. I'm not aware of a way to get suggestions from both dictionaries with one query... but I haven't looked into it either. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...] Sent: Friday, 23 October 2009 9:13 AM To: 'Osullivan L.'; Eoghan Ó Carragáin Cc: vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I like the idea. I think the current dictionary is not the way to go though. Eg. I do a search for 'harry poter' and it suggests 'barry potter' because it tries to fix both words. I have a proof-of-concept shingle based dictionary online now which is appropriately suggesting 'harry potter' because the whole phrase is more common but I need to play some more. I suspect both dictionaries have a use, particularly if you want to display the interface you've suggested. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Osullivan L. [mailto:L.O...@sw...] Sent: Thursday, 22 October 2009 8:31 PM To: Greg Pendlebury; Eoghan Ó Carragáin Cc: vuf...@li... Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Greetings All, Another thought: How about single word spelling variations, plus a "Did you mean" phrase option? E.g. Iresh Landskape would return: Did you mean Irish Landscape? Perhaps you should try some spelling variations: Iresh >> Irish, Fresh, Ires Landskape >> Landscape, Landscapes, Landspae Kind Regards, Luke PS: I have it working on my test install but the code is very messy and not suitable for addition ________________________________ From: Greg Pendlebury [mailto:Gre...@us...] Sent: 22 October 2009 07:26 To: 'Eoghan Ó Carragáin' Cc: vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hmmm, I was looking in the solr 1.4 book that was advertised on the list a while back and think it's worth revisiting phonetic filters. I was using PhoneticFilterFactory and changing encoding (as per : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory) The book mentions a specific DoubleMetaphoneFilterFactory that can be given a parameter for 'maxCodeLength'. It defaults to four and could possibly account for the appearance of inbuilt stemming in my original testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...] Sent: Tuesday, 6 October 2009 6:35 PM To: Greg Pendlebury Cc: Demian Katz; vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi, If phonetic filtering is reliable and doesn't produce too many unexpected results (as stemming sometimes can), then it would certainly be easier to maintain than a synonym list. Thanks for the data. Eoghan 2009/10/6 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> We considered synonyms but discarded the idea some time ago. It is awkward to create and maintain, and it doesn't cover other issues where a spelling suggestion on lower hit rate items would also help (rowling vs rolling vs rowing). I think phonetic filtering would be the answer to international spelling variations, but our testing has indicated the four phonetic filters in solr currently can't handle what we need. I can't remember whether I'd attached this before but some basic test data is attached. Refined Soundex is the closest we got, but it has some issues with S and Z comparison. The others all have stemming built into them on top of phonetics it would seem... most annoying. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Tuesday, 6 October 2009 10:04 AM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi Greg, Have you thought about using the Solr's synonym filter as an alternative solution to this problem? These spelling variations aren't really synonyms, but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory> you could ensure that your index contains only the AU/UK version (i.e. replace the US version), or both the AU/UK and US version etc. In either case, I suppose the user wouldn't be presented with any "Did you mean?", but hopefully they'd find what they're looking for. I found it surprisingly hard to get a comprehensive list of these spelling variations (misguided search strategy, perhaps), but these links provide a starting point: * https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution * http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> * http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences * http://answers.google.com/answers/threadview/id/211486.html I haven't had a chance to try this out myself, but some of the journal articles indexed in our sources site contain lots of medical terms where ae/e and oe/e appear both in usages, so I'm interested in finding a solution to this problem too. Eoghan PS - the following lists may also be of interest. They aren't spelling variations, more "American - British" synonym lists: * http://www.bg-map.com/us-uk.html#1 * http://esl.about.com/library/vocabulary/blbritam.htm 2009/10/5 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> The nightly build (latest) fixed my second point. The first point is still a big hurdle though. Can't see the librarians being happy with our catalogue suggesting the users switching to US spelling. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...<mailto:dem...@vi...>] Sent: Monday, 5 October 2009 11:39 PM To: Greg Pendlebury; 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: Idle thoughts on spellcheck Thanks for the update -- I've added another comment to JIRA with this link. Let us know how things work out for you! - Demian ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Eoghan Ó C. <eog...@gm...> - 2009-10-29 01:28:59
|
Hi Greg, Interesting figures. Is this a shingle dictionary with a maxShingleSize of 2? Looks like anything higher would produce very large dictionaries. Do you have a sense of how well the spellcheck peforms against the larger index, i.e. any perceptible delay in returning suggestions? Thanks, Eoghan 2009/10/29 Greg Pendlebury <Gre...@us...> > Fair enough. I forgot to mention last time some basic stats I took on the > size and build time: > > > > Normal dictionary against a 1.12gb index of almost 400k records was 30mb > and took 30s to build. > > Shingle dictionary against a 1.23gb index (same records + shingles) was > 600mb and took 15mins to build. > > > > The shingles weren’t built against ‘allfields’ because of the issue of > shingles incorrectly being created across two fields, just the major text > fields like title, author, topic etc. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Thursday, 29 October 2009 12:46 AM > > *To:* Greg Pendlebury; 'vuf...@li...' > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > One possibility would be to set up the two-dictionary approach but prefix > the relevant settings with a comment indicating that the configuration can > be simplified by applying the appropriate patch. > > > > - Demian > > > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Tuesday, October 27, 2009 11:01 PM > *To:* 'vuf...@li...' > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > I was reconsidering this today in light of wanting a reasonably functional > dictionary in RC2. The rebuilt solr was to “…allow the shingle filter to > return single word tokens at query time if no shingles could be made”. It > was my draft 1, but then I added the single word dictionary as a fallback… > so perhaps the patch is unnecessary. > > > > What’s the feeling on ‘value for money’ in a two dictionary approach in RC2 > if the patch is not needed? > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury > *Sent:* Friday, 23 October 2009 3:54 PM > *To:* 'vuf...@li...' > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > I’ll try that again. Should have thought that posting a 4mb file to the > list was bad J > > > > It’s here : http://www.usq.edu.au/library/ereserve/solr.war > > > > And the patch is attached. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury > *Sent:* Friday, 23 October 2009 3:39 PM > *To:* vuf...@li... > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > Ok, this is experimental, but I like the behaviour. Drawback is solr can’t > currently handle it. Attached is a rebuilt version of solr from the current > trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 > It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : > http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow > the shingle filter to return single word tokens at query time if no shingles > could be made. > > > > The vufind patch also attached makes use of this to build a shingle > spelling dictionary. Solr.php has been adjusted to allow the dictionary to > be changed by SearchObject.php and the search object now looks for phrase > matches (two word phrases only so far) in the shingle dictionary, before > falling basic to the basic dictionary (submits a second search). > > > > It generally improves the issue of individual word suggestions breaking a > phrase (by giving priority to phrase suggestions), but still allows for the > fact that individual words could be wrong (if they weren’t part of a > phrase). Aside from feedback and experimentation (obviously why I’m posting) > I’m quite happy with this method and think I’ll shelve it as our (USQ) > solution once those patches make it into release. > > > > I haven’t done any performance benchmarking on the impacts of this > technique, so they could of course be a killer to the idea. I’m not aware of > a way to get suggestions from both dictionaries with one query… but I > haven’t looked into it either. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Friday, 23 October 2009 9:13 AM > *To:* 'Osullivan L.'; Eoghan Ó Carragáin > *Cc:* vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > I like the idea. I think the current dictionary is not the way to go > though. > > > > Eg. I do a search for ‘harry poter’ and it suggests ‘barry potter’ because > it tries to fix both words. I have a proof-of-concept shingle based > dictionary online now which is appropriately suggesting ‘harry potter’ > because the whole phrase is more common but I need to play some more. I > suspect both dictionaries have a use, particularly if you want to display > the interface you’ve suggested. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Osullivan L. [mailto:L.O...@sw...] > *Sent:* Thursday, 22 October 2009 8:31 PM > *To:* Greg Pendlebury; Eoghan Ó Carragáin > *Cc:* vuf...@li... > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > Greetings All, > > > > Another thought: How about single word spelling variations, plus a “Did you > mean” phrase option? > > > > E.g. Iresh Landskape would return: > > > > Did you mean Irish Landscape? > > > > Perhaps you should try some spelling variations: > > Iresh >> Irish, Fresh, Ires > > Landskape >> Landscape, Landscapes, Landspae > > > > Kind Regards, > > > > Luke > > > > PS: I have it working on my test install but the code is very messy and not > suitable for addition > > > ------------------------------ > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* 22 October 2009 07:26 > *To:* 'Eoghan Ó Carragáin' > *Cc:* vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hmmm, I was looking in the solr 1.4 book that was advertised on the list a > while back and think it’s worth revisiting phonetic filters. > > > > I was using PhoneticFilterFactory and changing encoding (as per : > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory > ) > > > > The book mentions a specific DoubleMetaphoneFilterFactory that can be given > a parameter for ‘maxCodeLength’. It defaults to four and could possibly > account for the appearance of inbuilt stemming in my original testing. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Tuesday, 6 October 2009 6:35 PM > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi, > If phonetic filtering is reliable and doesn't produce too many unexpected > results (as stemming sometimes can), then it would certainly be easier to > maintain than a synonym list. Thanks for the data. > Eoghan > > 2009/10/6 Greg Pendlebury <Gre...@us...> > > We considered synonyms but discarded the idea some time ago. It is awkward > to create and maintain, and it doesn’t cover other issues where a spelling > suggestion on lower hit rate items would also help (rowling vs rolling vs > rowing). > > > > I think phonetic filtering would be the answer to international spelling > variations, but our testing has indicated the four phonetic filters in solr > currently can’t handle what we need. I can’t remember whether I’d attached > this before but some basic test data is attached. Refined Soundex is the > closest we got, but it has some issues with S and Z comparison. The others > all have stemming built into them on top of phonetics it would seem… most > annoying. > > > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Tuesday, 6 October 2009 10:04 AM > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > > > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi Greg, > Have you thought about using the Solr's synonym filter as an alternative > solution to this problem? These spelling variations aren't really synonyms, > but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory>you could ensure that your index contains only the AU/UK version (i.e. > replace the US version), or both the AU/UK and US version etc. In either > case, I suppose the user wouldn't be presented with any "Did you mean?", but > hopefully they'd find what they're looking for. > > I found it surprisingly hard to get a comprehensive list of these spelling > variations (misguided search strategy, perhaps), but these links provide a > starting point: > > - https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution > - > http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> > - > http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences > - http://answers.google.com/answers/threadview/id/211486.html > > I haven't had a chance to try this out myself, but some of the journal > articles indexed in our sources site contain lots of medical terms where > ae/e and oe/e appear both in usages, so I'm interested in finding a solution > to this problem too. > > Eoghan > > PS - the following lists may also be of interest. They aren't spelling > variations, more "American - British" synonym lists: > > - http://www.bg-map.com/us-uk.html#1 > - http://esl.about.com/library/vocabulary/blbritam.htm > > > > 2009/10/5 Greg Pendlebury <Gre...@us...> > > The nightly build (latest) fixed my second point. The first point is still > a big hurdle though. Can’t see the librarians being happy with our catalogue > suggesting the users switching to US spelling. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Monday, 5 October 2009 11:39 PM > > > *To:* Greg Pendlebury; 'vuf...@li...' > *Subject:* RE: Idle thoughts on spellcheck > > > > Thanks for the update -- I've added another comment to JIRA with this > link. Let us know how things work out for you! > > > > - Demian > > > > > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > ------------------------------ > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > |
From: Greg P. <Gre...@us...> - 2009-10-29 02:05:24
|
Yes, that's with max 2. Unfortunately don't have any benchmarking tools setup as yet, but I would also be interested in those results. I don't notice any degradations, but solr is so fast I don't think anything but proper benchmark testing would give an accurate indication. There must be some degradations obviously because of the extra query being forced. How index size would impact there I could only guess. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...] Sent: Thursday, 29 October 2009 11:29 AM To: Greg Pendlebury Cc: Demian Katz; vuf...@li... Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi Greg, Interesting figures. Is this a shingle dictionary with a maxShingleSize of 2? Looks like anything higher would produce very large dictionaries. Do you have a sense of how well the spellcheck peforms against the larger index, i.e. any perceptible delay in returning suggestions? Thanks, Eoghan 2009/10/29 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> Fair enough. I forgot to mention last time some basic stats I took on the size and build time: Normal dictionary against a 1.12gb index of almost 400k records was 30mb and took 30s to build. Shingle dictionary against a 1.23gb index (same records + shingles) was 600mb and took 15mins to build. The shingles weren't built against 'allfields' because of the issue of shingles incorrectly being created across two fields, just the major text fields like title, author, topic etc. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...<mailto:dem...@vi...>] Sent: Thursday, 29 October 2009 12:46 AM To: Greg Pendlebury; 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck One possibility would be to set up the two-dictionary approach but prefix the relevant settings with a comment indicating that the configuration can be simplified by applying the appropriate patch. - Demian From: Greg Pendlebury [mailto:Gre...@us...<mailto:Gre...@us...>] Sent: Tuesday, October 27, 2009 11:01 PM To: 'vuf...@li...<mailto:vuf...@li...>' Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I was reconsidering this today in light of wanting a reasonably functional dictionary in RC2. The rebuilt solr was to "...allow the shingle filter to return single word tokens at query time if no shingles could be made". It was my draft 1, but then I added the single word dictionary as a fallback... so perhaps the patch is unnecessary. What's the feeling on 'value for money' in a two dictionary approach in RC2 if the patch is not needed? Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:54 PM To: 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck I'll try that again. Should have thought that posting a 4mb file to the list was bad :) It's here : http://www.usq.edu.au/library/ereserve/solr.war And the patch is attached. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:39 PM To: vuf...@li...<mailto:vuf...@li...> Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Ok, this is experimental, but I like the behaviour. Drawback is solr can't currently handle it. Attached is a rebuilt version of solr from the current trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow the shingle filter to return single word tokens at query time if no shingles could be made. The vufind patch also attached makes use of this to build a shingle spelling dictionary. Solr.php has been adjusted to allow the dictionary to be changed by SearchObject.php and the search object now looks for phrase matches (two word phrases only so far) in the shingle dictionary, before falling basic to the basic dictionary (submits a second search). It generally improves the issue of individual word suggestions breaking a phrase (by giving priority to phrase suggestions), but still allows for the fact that individual words could be wrong (if they weren't part of a phrase). Aside from feedback and experimentation (obviously why I'm posting) I'm quite happy with this method and think I'll shelve it as our (USQ) solution once those patches make it into release. I haven't done any performance benchmarking on the impacts of this technique, so they could of course be a killer to the idea. I'm not aware of a way to get suggestions from both dictionaries with one query... but I haven't looked into it either. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...<mailto:Gre...@us...>] Sent: Friday, 23 October 2009 9:13 AM To: 'Osullivan L.'; Eoghan Ó Carragáin Cc: vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I like the idea. I think the current dictionary is not the way to go though. Eg. I do a search for 'harry poter' and it suggests 'barry potter' because it tries to fix both words. I have a proof-of-concept shingle based dictionary online now which is appropriately suggesting 'harry potter' because the whole phrase is more common but I need to play some more. I suspect both dictionaries have a use, particularly if you want to display the interface you've suggested. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Osullivan L. [mailto:L.O...@sw...<mailto:L.O...@sw...>] Sent: Thursday, 22 October 2009 8:31 PM To: Greg Pendlebury; Eoghan Ó Carragáin Cc: vuf...@li...<mailto:vuf...@li...> Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Greetings All, Another thought: How about single word spelling variations, plus a "Did you mean" phrase option? E.g. Iresh Landskape would return: Did you mean Irish Landscape? Perhaps you should try some spelling variations: Iresh >> Irish, Fresh, Ires Landskape >> Landscape, Landscapes, Landspae Kind Regards, Luke PS: I have it working on my test install but the code is very messy and not suitable for addition ________________________________ From: Greg Pendlebury [mailto:Gre...@us...<mailto:Gre...@us...>] Sent: 22 October 2009 07:26 To: 'Eoghan Ó Carragáin' Cc: vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hmmm, I was looking in the solr 1.4 book that was advertised on the list a while back and think it's worth revisiting phonetic filters. I was using PhoneticFilterFactory and changing encoding (as per : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory) The book mentions a specific DoubleMetaphoneFilterFactory that can be given a parameter for 'maxCodeLength'. It defaults to four and could possibly account for the appearance of inbuilt stemming in my original testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Tuesday, 6 October 2009 6:35 PM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi, If phonetic filtering is reliable and doesn't produce too many unexpected results (as stemming sometimes can), then it would certainly be easier to maintain than a synonym list. Thanks for the data. Eoghan 2009/10/6 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> We considered synonyms but discarded the idea some time ago. It is awkward to create and maintain, and it doesn't cover other issues where a spelling suggestion on lower hit rate items would also help (rowling vs rolling vs rowing). I think phonetic filtering would be the answer to international spelling variations, but our testing has indicated the four phonetic filters in solr currently can't handle what we need. I can't remember whether I'd attached this before but some basic test data is attached. Refined Soundex is the closest we got, but it has some issues with S and Z comparison. The others all have stemming built into them on top of phonetics it would seem... most annoying. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Tuesday, 6 October 2009 10:04 AM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi Greg, Have you thought about using the Solr's synonym filter as an alternative solution to this problem? These spelling variations aren't really synonyms, but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory> you could ensure that your index contains only the AU/UK version (i.e. replace the US version), or both the AU/UK and US version etc. In either case, I suppose the user wouldn't be presented with any "Did you mean?", but hopefully they'd find what they're looking for. I found it surprisingly hard to get a comprehensive list of these spelling variations (misguided search strategy, perhaps), but these links provide a starting point: * https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution * http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> * http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences * http://answers.google.com/answers/threadview/id/211486.html I haven't had a chance to try this out myself, but some of the journal articles indexed in our sources site contain lots of medical terms where ae/e and oe/e appear both in usages, so I'm interested in finding a solution to this problem too. Eoghan PS - the following lists may also be of interest. They aren't spelling variations, more "American - British" synonym lists: * http://www.bg-map.com/us-uk.html#1 * http://esl.about.com/library/vocabulary/blbritam.htm 2009/10/5 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> The nightly build (latest) fixed my second point. The first point is still a big hurdle though. Can't see the librarians being happy with our catalogue suggesting the users switching to US spelling. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...<mailto:dem...@vi...>] Sent: Monday, 5 October 2009 11:39 PM To: Greg Pendlebury; 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: Idle thoughts on spellcheck Thanks for the update -- I've added another comment to JIRA with this link. Let us know how things work out for you! - Demian ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Vufind-tech mailing list Vuf...@li...<mailto:Vuf...@li...> https://lists.sourceforge.net/lists/listinfo/vufind-tech This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Eoghan Ó C. <eog...@gm...> - 2009-10-29 02:22:48
|
Thanks. Can you say a little about how a shingle dictionary with max 2 works? e.g.: - what happens when the user enters a phrase with more than two words (is the shingle dictionary ignored, or is the query somehow tokenised into shingles?)? - if it is not ignored when there is more than two words, what happens when a user enters three words with a correctly spelled middle word separating the other two misspelled words (does this fail, and default to the single word suggestion?)? Sorry, I haven't got your latest spell check enhancements working, and I'm trying to get my head around where it's heading. It all sounds really promising. Do you think there is any point considering a max number higher than 2? Cheers, Eoghan 2009/10/29 Greg Pendlebury <Gre...@us...> > Yes, that’s with max 2. Unfortunately don’t have any benchmarking tools > setup as yet, but I would also be interested in those results. I don’t > notice any degradations, but solr is so fast I don’t think anything but > proper benchmark testing would give an accurate indication. There must be > some degradations obviously because of the extra query being forced. How > index size would impact there I could only guess. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Thursday, 29 October 2009 11:29 AM > > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi Greg, > Interesting figures. Is this a shingle dictionary with a maxShingleSize of > 2? Looks like anything higher would produce very large dictionaries. > > Do you have a sense of how well the spellcheck peforms against the larger > index, i.e. any perceptible delay in returning suggestions? > > Thanks, > Eoghan > > 2009/10/29 Greg Pendlebury <Gre...@us...> > > Fair enough. I forgot to mention last time some basic stats I took on the > size and build time: > > > > Normal dictionary against a 1.12gb index of almost 400k records was 30mb > and took 30s to build. > > Shingle dictionary against a 1.23gb index (same records + shingles) was > 600mb and took 15mins to build. > > > > The shingles weren’t built against ‘allfields’ because of the issue of > shingles incorrectly being created across two fields, just the major text > fields like title, author, topic etc. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Thursday, 29 October 2009 12:46 AM > > > *To:* Greg Pendlebury; 'vuf...@li...' > > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > One possibility would be to set up the two-dictionary approach but prefix > the relevant settings with a comment indicating that the configuration can > be simplified by applying the appropriate patch. > > > > - Demian > > > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Tuesday, October 27, 2009 11:01 PM > *To:* 'vuf...@li...' > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > I was reconsidering this today in light of wanting a reasonably functional > dictionary in RC2. The rebuilt solr was to “…allow the shingle filter to > return single word tokens at query time if no shingles could be made”. It > was my draft 1, but then I added the single word dictionary as a fallback… > so perhaps the patch is unnecessary. > > > > What’s the feeling on ‘value for money’ in a two dictionary approach in RC2 > if the patch is not needed? > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury > *Sent:* Friday, 23 October 2009 3:54 PM > *To:* 'vuf...@li...' > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > I’ll try that again. Should have thought that posting a 4mb file to the > list was bad J > > > > It’s here : http://www.usq.edu.au/library/ereserve/solr.war > > > > And the patch is attached. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury > *Sent:* Friday, 23 October 2009 3:39 PM > *To:* vuf...@li... > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > Ok, this is experimental, but I like the behaviour. Drawback is solr can’t > currently handle it. Attached is a rebuilt version of solr from the current > trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 > It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : > http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow > the shingle filter to return single word tokens at query time if no shingles > could be made. > > > > The vufind patch also attached makes use of this to build a shingle > spelling dictionary. Solr.php has been adjusted to allow the dictionary to > be changed by SearchObject.php and the search object now looks for phrase > matches (two word phrases only so far) in the shingle dictionary, before > falling basic to the basic dictionary (submits a second search). > > > > It generally improves the issue of individual word suggestions breaking a > phrase (by giving priority to phrase suggestions), but still allows for the > fact that individual words could be wrong (if they weren’t part of a > phrase). Aside from feedback and experimentation (obviously why I’m posting) > I’m quite happy with this method and think I’ll shelve it as our (USQ) > solution once those patches make it into release. > > > > I haven’t done any performance benchmarking on the impacts of this > technique, so they could of course be a killer to the idea. I’m not aware of > a way to get suggestions from both dictionaries with one query… but I > haven’t looked into it either. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Friday, 23 October 2009 9:13 AM > *To:* 'Osullivan L.'; Eoghan Ó Carragáin > *Cc:* vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > I like the idea. I think the current dictionary is not the way to go > though. > > > > Eg. I do a search for ‘harry poter’ and it suggests ‘barry potter’ because > it tries to fix both words. I have a proof-of-concept shingle based > dictionary online now which is appropriately suggesting ‘harry potter’ > because the whole phrase is more common but I need to play some more. I > suspect both dictionaries have a use, particularly if you want to display > the interface you’ve suggested. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Osullivan L. [mailto:L.O...@sw...] > *Sent:* Thursday, 22 October 2009 8:31 PM > *To:* Greg Pendlebury; Eoghan Ó Carragáin > *Cc:* vuf...@li... > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > Greetings All, > > > > Another thought: How about single word spelling variations, plus a “Did you > mean” phrase option? > > > > E.g. Iresh Landskape would return: > > > > Did you mean Irish Landscape? > > > > Perhaps you should try some spelling variations: > > Iresh >> Irish, Fresh, Ires > > Landskape >> Landscape, Landscapes, Landspae > > > > Kind Regards, > > > > Luke > > > > PS: I have it working on my test install but the code is very messy and not > suitable for addition > > > ------------------------------ > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* 22 October 2009 07:26 > *To:* 'Eoghan Ó Carragáin' > *Cc:* vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hmmm, I was looking in the solr 1.4 book that was advertised on the list a > while back and think it’s worth revisiting phonetic filters. > > > > I was using PhoneticFilterFactory and changing encoding (as per : > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory > ) > > > > The book mentions a specific DoubleMetaphoneFilterFactory that can be given > a parameter for ‘maxCodeLength’. It defaults to four and could possibly > account for the appearance of inbuilt stemming in my original testing. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Tuesday, 6 October 2009 6:35 PM > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi, > If phonetic filtering is reliable and doesn't produce too many unexpected > results (as stemming sometimes can), then it would certainly be easier to > maintain than a synonym list. Thanks for the data. > Eoghan > > 2009/10/6 Greg Pendlebury <Gre...@us...> > > We considered synonyms but discarded the idea some time ago. It is awkward > to create and maintain, and it doesn’t cover other issues where a spelling > suggestion on lower hit rate items would also help (rowling vs rolling vs > rowing). > > > > I think phonetic filtering would be the answer to international spelling > variations, but our testing has indicated the four phonetic filters in solr > currently can’t handle what we need. I can’t remember whether I’d attached > this before but some basic test data is attached. Refined Soundex is the > closest we got, but it has some issues with S and Z comparison. The others > all have stemming built into them on top of phonetics it would seem… most > annoying. > > > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Tuesday, 6 October 2009 10:04 AM > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > > > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi Greg, > Have you thought about using the Solr's synonym filter as an alternative > solution to this problem? These spelling variations aren't really synonyms, > but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory>you could ensure that your index contains only the AU/UK version (i.e. > replace the US version), or both the AU/UK and US version etc. In either > case, I suppose the user wouldn't be presented with any "Did you mean?", but > hopefully they'd find what they're looking for. > > I found it surprisingly hard to get a comprehensive list of these spelling > variations (misguided search strategy, perhaps), but these links provide a > starting point: > > - https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution > - > http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> > - > http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences > - http://answers.google.com/answers/threadview/id/211486.html > > I haven't had a chance to try this out myself, but some of the journal > articles indexed in our sources site contain lots of medical terms where > ae/e and oe/e appear both in usages, so I'm interested in finding a solution > to this problem too. > > Eoghan > > PS - the following lists may also be of interest. They aren't spelling > variations, more "American - British" synonym lists: > > - http://www.bg-map.com/us-uk.html#1 > - http://esl.about.com/library/vocabulary/blbritam.htm > > > > 2009/10/5 Greg Pendlebury <Gre...@us...> > > The nightly build (latest) fixed my second point. The first point is still > a big hurdle though. Can’t see the librarians being happy with our catalogue > suggesting the users switching to US spelling. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Monday, 5 October 2009 11:39 PM > > > *To:* Greg Pendlebury; 'vuf...@li...' > *Subject:* RE: Idle thoughts on spellcheck > > > > Thanks for the update -- I've added another comment to JIRA with this > link. Let us know how things work out for you! > > > > - Demian > > > > > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > ------------------------------ > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > |
From: Greg P. <Gre...@us...> - 2009-10-29 03:03:35
|
Some responses are based on my edumacated guesses only. I have NOT tested all of these: * what happens when the user enters a phrase with more than two words (is the shingle dictionary ignored, or is the query somehow tokenised into shingles?)? The phrase "term1 term2 term3" gets shingled into two tokens => "term1 term2" and "term2 term3" If we up'd max shingle to 3 you'd get three tokens, the two I just mentioned, and a shingle containing all three words. The filter would also be returning each individual word as a shingle except I disabled that in the schema (it defeats the purpose in the context of a dictionary). Originally I thought I could do it using ONLY a shingle dictionary so I patched solr to enable a new parameter which returned single words if it couldn't make a shingle. It just wasn't enough though, so I went to using the single dictionary as well. The patch is unnecessary now I think. * if it is not ignored when there is more than two words, what happens when a user enters three words with a correctly spelled middle word separating the other two misspelled words (does this fail, and default to the single word suggestion?)? Assuming the three words are a phrase that validly appears in your data (or are more popular in your data) you'd get two shingle suggested corrections because it found two better shingle matches. Ie. "term1 term2" and "term2 term3". At this stage you'd need two clicks to correct your search though, one phrase at a time. If the words aren't ALL a phrase, but some of them are. Eg. "harry poter rowlling" you only get one suggestion from the shingle dictionary: "harry potter", but you also get a suggestion from the normal dictionary for "rowlling" (in my case "rolling" and "rowling"). This is because the search object filters out normal suggestions if it detects that they are already present inside one of the shingle suggestions. Do you think there is any point considering a max number higher than 2? I wouldn't rule it out without experimentation, but I suspect it would be horribly confusing, both to code and to represent in the interface. I think larger shingles work as a searchable hidden index because more data in there can lead to better hits, but once you need to present/filter that data there's a lot of replication to handle (and confuse your user), and UI space consumed. It WOULD resolve the two click process I mentioned above though, which is why I wouldn't discount it out of hand. 2009/10/29 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> Yes, that's with max 2. Unfortunately don't have any benchmarking tools setup as yet, but I would also be interested in those results. I don't notice any degradations, but solr is so fast I don't think anything but proper benchmark testing would give an accurate indication. There must be some degradations obviously because of the extra query being forced. How index size would impact there I could only guess. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Thursday, 29 October 2009 11:29 AM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi Greg, Interesting figures. Is this a shingle dictionary with a maxShingleSize of 2? Looks like anything higher would produce very large dictionaries. Do you have a sense of how well the spellcheck peforms against the larger index, i.e. any perceptible delay in returning suggestions? Thanks, Eoghan 2009/10/29 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> Fair enough. I forgot to mention last time some basic stats I took on the size and build time: Normal dictionary against a 1.12gb index of almost 400k records was 30mb and took 30s to build. Shingle dictionary against a 1.23gb index (same records + shingles) was 600mb and took 15mins to build. The shingles weren't built against 'allfields' because of the issue of shingles incorrectly being created across two fields, just the major text fields like title, author, topic etc. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...<mailto:dem...@vi...>] Sent: Thursday, 29 October 2009 12:46 AM To: Greg Pendlebury; 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck One possibility would be to set up the two-dictionary approach but prefix the relevant settings with a comment indicating that the configuration can be simplified by applying the appropriate patch. - Demian From: Greg Pendlebury [mailto:Gre...@us...<mailto:Gre...@us...>] Sent: Tuesday, October 27, 2009 11:01 PM To: 'vuf...@li...<mailto:vuf...@li...>' Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I was reconsidering this today in light of wanting a reasonably functional dictionary in RC2. The rebuilt solr was to "...allow the shingle filter to return single word tokens at query time if no shingles could be made". It was my draft 1, but then I added the single word dictionary as a fallback... so perhaps the patch is unnecessary. What's the feeling on 'value for money' in a two dictionary approach in RC2 if the patch is not needed? Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:54 PM To: 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck I'll try that again. Should have thought that posting a 4mb file to the list was bad :) It's here : http://www.usq.edu.au/library/ereserve/solr.war And the patch is attached. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury Sent: Friday, 23 October 2009 3:39 PM To: vuf...@li...<mailto:vuf...@li...> Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Ok, this is experimental, but I like the behaviour. Drawback is solr can't currently handle it. Attached is a rebuilt version of solr from the current trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow the shingle filter to return single word tokens at query time if no shingles could be made. The vufind patch also attached makes use of this to build a shingle spelling dictionary. Solr.php has been adjusted to allow the dictionary to be changed by SearchObject.php and the search object now looks for phrase matches (two word phrases only so far) in the shingle dictionary, before falling basic to the basic dictionary (submits a second search). It generally improves the issue of individual word suggestions breaking a phrase (by giving priority to phrase suggestions), but still allows for the fact that individual words could be wrong (if they weren't part of a phrase). Aside from feedback and experimentation (obviously why I'm posting) I'm quite happy with this method and think I'll shelve it as our (USQ) solution once those patches make it into release. I haven't done any performance benchmarking on the impacts of this technique, so they could of course be a killer to the idea. I'm not aware of a way to get suggestions from both dictionaries with one query... but I haven't looked into it either. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Greg Pendlebury [mailto:Gre...@us...<mailto:Gre...@us...>] Sent: Friday, 23 October 2009 9:13 AM To: 'Osullivan L.'; Eoghan Ó Carragáin Cc: vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck I like the idea. I think the current dictionary is not the way to go though. Eg. I do a search for 'harry poter' and it suggests 'barry potter' because it tries to fix both words. I have a proof-of-concept shingle based dictionary online now which is appropriately suggesting 'harry potter' because the whole phrase is more common but I need to play some more. I suspect both dictionaries have a use, particularly if you want to display the interface you've suggested. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Osullivan L. [mailto:L.O...@sw...<mailto:L.O...@sw...>] Sent: Thursday, 22 October 2009 8:31 PM To: Greg Pendlebury; Eoghan Ó Carragáin Cc: vuf...@li...<mailto:vuf...@li...> Subject: RE: [VuFind-Tech] Idle thoughts on spellcheck Greetings All, Another thought: How about single word spelling variations, plus a "Did you mean" phrase option? E.g. Iresh Landskape would return: Did you mean Irish Landscape? Perhaps you should try some spelling variations: Iresh >> Irish, Fresh, Ires Landskape >> Landscape, Landscapes, Landspae Kind Regards, Luke PS: I have it working on my test install but the code is very messy and not suitable for addition ________________________________ From: Greg Pendlebury [mailto:Gre...@us...<mailto:Gre...@us...>] Sent: 22 October 2009 07:26 To: 'Eoghan Ó Carragáin' Cc: vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hmmm, I was looking in the solr 1.4 book that was advertised on the list a while back and think it's worth revisiting phonetic filters. I was using PhoneticFilterFactory and changing encoding (as per : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory) The book mentions a specific DoubleMetaphoneFilterFactory that can be given a parameter for 'maxCodeLength'. It defaults to four and could possibly account for the appearance of inbuilt stemming in my original testing. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Tuesday, 6 October 2009 6:35 PM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi, If phonetic filtering is reliable and doesn't produce too many unexpected results (as stemming sometimes can), then it would certainly be easier to maintain than a synonym list. Thanks for the data. Eoghan 2009/10/6 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> We considered synonyms but discarded the idea some time ago. It is awkward to create and maintain, and it doesn't cover other issues where a spelling suggestion on lower hit rate items would also help (rowling vs rolling vs rowing). I think phonetic filtering would be the answer to international spelling variations, but our testing has indicated the four phonetic filters in solr currently can't handle what we need. I can't remember whether I'd attached this before but some basic test data is attached. Refined Soundex is the closest we got, but it has some issues with S and Z comparison. The others all have stemming built into them on top of phonetics it would seem... most annoying. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Eoghan Ó Carragáin [mailto:eog...@gm...<mailto:eog...@gm...>] Sent: Tuesday, 6 October 2009 10:04 AM To: Greg Pendlebury Cc: Demian Katz; vuf...@li...<mailto:vuf...@li...> Subject: Re: [VuFind-Tech] Idle thoughts on spellcheck Hi Greg, Have you thought about using the Solr's synonym filter as an alternative solution to this problem? These spelling variations aren't really synonyms, but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory> you could ensure that your index contains only the AU/UK version (i.e. replace the US version), or both the AU/UK and US version etc. In either case, I suppose the user wouldn't be presented with any "Did you mean?", but hopefully they'd find what they're looking for. I found it surprisingly hard to get a comprehensive list of these spelling variations (misguided search strategy, perhaps), but these links provide a starting point: * https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution * http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> * http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences * http://answers.google.com/answers/threadview/id/211486.html I haven't had a chance to try this out myself, but some of the journal articles indexed in our sources site contain lots of medical terms where ae/e and oe/e appear both in usages, so I'm interested in finding a solution to this problem too. Eoghan PS - the following lists may also be of interest. They aren't spelling variations, more "American - British" synonym lists: * http://www.bg-map.com/us-uk.html#1 * http://esl.about.com/library/vocabulary/blbritam.htm 2009/10/5 Greg Pendlebury <Gre...@us...<mailto:Gre...@us...>> The nightly build (latest) fixed my second point. The first point is still a big hurdle though. Can't see the librarians being happy with our catalogue suggesting the users switching to US spelling. Greg Pendlebury Electronic Services Officer (Systems Team) Division of Academic Information Services University of Southern Queensland Phone: +61 7 4631 1501 Fax: +61 7 4631 1841 From: Demian Katz [mailto:dem...@vi...<mailto:dem...@vi...>] Sent: Monday, 5 October 2009 11:39 PM To: Greg Pendlebury; 'vuf...@li...<mailto:vuf...@li...>' Subject: RE: Idle thoughts on spellcheck Thanks for the update -- I've added another comment to JIRA with this link. Let us know how things work out for you! - Demian ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Vufind-tech mailing list Vuf...@li...<mailto:Vuf...@li...> https://lists.sourceforge.net/lists/listinfo/vufind-tech ________________________________ This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) |
From: Eoghan Ó C. <eog...@gm...> - 2009-10-29 08:47:17
|
Hi Greg, Thanks for the examples and explanation. From the sounds of it the current combination of 2 word shingle and single word dictionary offers the user a lot (I like that the single word suggestions don't duplicate the shingle suggestions). As you say, a 3 word could be useful but perhaps not worth it... Thanks, Eoghan 2009/10/29 Greg Pendlebury <Gre...@us...> > Some responses are based on my edumacated guesses only. I have NOT tested > all of these: > > > > - what happens when the user enters a phrase with more than two words > (is the shingle dictionary ignored, or is the query somehow tokenised into > shingles?)? > > > > The phrase “term1 term2 term3” gets shingled into two tokens => “term1 > term2” and “term2 term3” If we up’d max shingle to 3 you’d get three tokens, > the two I just mentioned, and a shingle containing all three words. The > filter would also be returning each individual word as a shingle except I > disabled that in the schema (it defeats the purpose in the context of a > dictionary). > > > > Originally I thought I could do it using ONLY a shingle dictionary so I > patched solr to enable a new parameter which returned single words if it > couldn’t make a shingle. It just wasn’t enough though, so I went to using > the single dictionary as well. The patch is unnecessary now I think. > > > > - if it is not ignored when there is more than two words, what happens > when a user enters three words with a correctly spelled middle word > separating the other two misspelled words (does this fail, and default to > the single word suggestion?)? > > > > Assuming the three words are a phrase that validly appears in your data (or > are more popular in your data) you’d get two shingle suggested corrections > because it found two better shingle matches. Ie. “term1 term2” and “term2 > term3”. At this stage you’d need two clicks to correct your search though, > one phrase at a time. > > > > If the words aren’t ALL a phrase, but some of them are. Eg. “harry poter > rowlling” you only get one suggestion from the shingle dictionary: “harry > potter”, but you also get a suggestion from the normal dictionary for > “rowlling” (in my case “rolling” and “rowling”). This is because the search > object filters out normal suggestions if it detects that they are already > present inside one of the shingle suggestions. > > > > Do you think there is any point considering a max number higher than 2? > > I wouldn’t rule it out without experimentation, but I suspect it would be > horribly confusing, both to code and to represent in the interface. I think > larger shingles work as a searchable hidden index because more data in there > can lead to better hits, but once you need to present/filter that data > there’s a lot of replication to handle (and confuse your user), and UI space > consumed. It WOULD resolve the two click process I mentioned above though, > which is why I wouldn’t discount it out of hand. > > > > > > > > 2009/10/29 Greg Pendlebury <Gre...@us...> > > Yes, that’s with max 2. Unfortunately don’t have any benchmarking tools > setup as yet, but I would also be interested in those results. I don’t > notice any degradations, but solr is so fast I don’t think anything but > proper benchmark testing would give an accurate indication. There must be > some degradations obviously because of the extra query being forced. How > index size would impact there I could only guess. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Thursday, 29 October 2009 11:29 AM > > > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi Greg, > Interesting figures. Is this a shingle dictionary with a maxShingleSize of > 2? Looks like anything higher would produce very large dictionaries. > > Do you have a sense of how well the spellcheck peforms against the larger > index, i.e. any perceptible delay in returning suggestions? > > Thanks, > Eoghan > > 2009/10/29 Greg Pendlebury <Gre...@us...> > > Fair enough. I forgot to mention last time some basic stats I took on the > size and build time: > > > > Normal dictionary against a 1.12gb index of almost 400k records was 30mb > and took 30s to build. > > Shingle dictionary against a 1.23gb index (same records + shingles) was > 600mb and took 15mins to build. > > > > The shingles weren’t built against ‘allfields’ because of the issue of > shingles incorrectly being created across two fields, just the major text > fields like title, author, topic etc. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Thursday, 29 October 2009 12:46 AM > > > *To:* Greg Pendlebury; 'vuf...@li...' > > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > One possibility would be to set up the two-dictionary approach but prefix > the relevant settings with a comment indicating that the configuration can > be simplified by applying the appropriate patch. > > > > - Demian > > > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Tuesday, October 27, 2009 11:01 PM > *To:* 'vuf...@li...' > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > I was reconsidering this today in light of wanting a reasonably functional > dictionary in RC2. The rebuilt solr was to “…allow the shingle filter to > return single word tokens at query time if no shingles could be made”. It > was my draft 1, but then I added the single word dictionary as a fallback… > so perhaps the patch is unnecessary. > > > > What’s the feeling on ‘value for money’ in a two dictionary approach in RC2 > if the patch is not needed? > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury > *Sent:* Friday, 23 October 2009 3:54 PM > *To:* 'vuf...@li...' > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > I’ll try that again. Should have thought that posting a 4mb file to the > list was bad J > > > > It’s here : http://www.usq.edu.au/library/ereserve/solr.war > > > > And the patch is attached. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury > *Sent:* Friday, 23 October 2009 3:39 PM > *To:* vuf...@li... > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > Ok, this is experimental, but I like the behaviour. Drawback is solr can’t > currently handle it. Attached is a rebuilt version of solr from the current > trunk with SOLR-744 applied: http://issues.apache.org/jira/browse/SOLR-744 > It was rebuilt against lucene v2.9 stable with LUCENE-1370 applied : > http://issues.apache.org/jira/browse/LUCENE-1370. These two patches allow > the shingle filter to return single word tokens at query time if no shingles > could be made. > > > > The vufind patch also attached makes use of this to build a shingle > spelling dictionary. Solr.php has been adjusted to allow the dictionary to > be changed by SearchObject.php and the search object now looks for phrase > matches (two word phrases only so far) in the shingle dictionary, before > falling basic to the basic dictionary (submits a second search). > > > > It generally improves the issue of individual word suggestions breaking a > phrase (by giving priority to phrase suggestions), but still allows for the > fact that individual words could be wrong (if they weren’t part of a > phrase). Aside from feedback and experimentation (obviously why I’m posting) > I’m quite happy with this method and think I’ll shelve it as our (USQ) > solution once those patches make it into release. > > > > I haven’t done any performance benchmarking on the impacts of this > technique, so they could of course be a killer to the idea. I’m not aware of > a way to get suggestions from both dictionaries with one query… but I > haven’t looked into it either. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* Friday, 23 October 2009 9:13 AM > *To:* 'Osullivan L.'; Eoghan Ó Carragáin > *Cc:* vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > I like the idea. I think the current dictionary is not the way to go > though. > > > > Eg. I do a search for ‘harry poter’ and it suggests ‘barry potter’ because > it tries to fix both words. I have a proof-of-concept shingle based > dictionary online now which is appropriately suggesting ‘harry potter’ > because the whole phrase is more common but I need to play some more. I > suspect both dictionaries have a use, particularly if you want to display > the interface you’ve suggested. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Osullivan L. [mailto:L.O...@sw...] > *Sent:* Thursday, 22 October 2009 8:31 PM > *To:* Greg Pendlebury; Eoghan Ó Carragáin > *Cc:* vuf...@li... > *Subject:* RE: [VuFind-Tech] Idle thoughts on spellcheck > > > > Greetings All, > > > > Another thought: How about single word spelling variations, plus a “Did you > mean” phrase option? > > > > E.g. Iresh Landskape would return: > > > > Did you mean Irish Landscape? > > > > Perhaps you should try some spelling variations: > > Iresh >> Irish, Fresh, Ires > > Landskape >> Landscape, Landscapes, Landspae > > > > Kind Regards, > > > > Luke > > > > PS: I have it working on my test install but the code is very messy and not > suitable for addition > > > ------------------------------ > > *From:* Greg Pendlebury [mailto:Gre...@us...] > *Sent:* 22 October 2009 07:26 > *To:* 'Eoghan Ó Carragáin' > *Cc:* vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hmmm, I was looking in the solr 1.4 book that was advertised on the list a > while back and think it’s worth revisiting phonetic filters. > > > > I was using PhoneticFilterFactory and changing encoding (as per : > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory > ) > > > > The book mentions a specific DoubleMetaphoneFilterFactory that can be given > a parameter for ‘maxCodeLength’. It defaults to four and could possibly > account for the appearance of inbuilt stemming in my original testing. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Tuesday, 6 October 2009 6:35 PM > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi, > If phonetic filtering is reliable and doesn't produce too many unexpected > results (as stemming sometimes can), then it would certainly be easier to > maintain than a synonym list. Thanks for the data. > Eoghan > > 2009/10/6 Greg Pendlebury <Gre...@us...> > > We considered synonyms but discarded the idea some time ago. It is awkward > to create and maintain, and it doesn’t cover other issues where a spelling > suggestion on lower hit rate items would also help (rowling vs rolling vs > rowing). > > > > I think phonetic filtering would be the answer to international spelling > variations, but our testing has indicated the four phonetic filters in solr > currently can’t handle what we need. I can’t remember whether I’d attached > this before but some basic test data is attached. Refined Soundex is the > closest we got, but it has some issues with S and Z comparison. The others > all have stemming built into them on top of phonetics it would seem… most > annoying. > > > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Eoghan Ó Carragáin [mailto:eog...@gm...] > *Sent:* Tuesday, 6 October 2009 10:04 AM > *To:* Greg Pendlebury > *Cc:* Demian Katz; vuf...@li... > > > *Subject:* Re: [VuFind-Tech] Idle thoughts on spellcheck > > > > Hi Greg, > Have you thought about using the Solr's synonym filter as an alternative > solution to this problem? These spelling variations aren't really synonyms, > but it might work as a hack. Depending on how you use the synonym filter<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory>you could ensure that your index contains only the AU/UK version (i.e. > replace the US version), or both the AU/UK and US version etc. In either > case, I suppose the user wouldn't be presented with any "Did you mean?", but > hopefully they'd find what they're looking for. > > I found it surprisingly hard to get a comprehensive list of these spelling > variations (misguided search strategy, perhaps), but these links provide a > starting point: > > - https://wiki.ubuntu.com/EnglishTranslation/WordSubstitution > - > http://web.archive.org/web/20040611050511/www.peak.org/~jeremy/dictionary/tables/<http://web.archive.org/web/20040611050511/www.peak.org/%7Ejeremy/dictionary/tables/> > - > http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences > - http://answers.google.com/answers/threadview/id/211486.html > > I haven't had a chance to try this out myself, but some of the journal > articles indexed in our sources site contain lots of medical terms where > ae/e and oe/e appear both in usages, so I'm interested in finding a solution > to this problem too. > > Eoghan > > PS - the following lists may also be of interest. They aren't spelling > variations, more "American - British" synonym lists: > > - http://www.bg-map.com/us-uk.html#1 > - http://esl.about.com/library/vocabulary/blbritam.htm > > > > 2009/10/5 Greg Pendlebury <Gre...@us...> > > The nightly build (latest) fixed my second point. The first point is still > a big hurdle though. Can’t see the librarians being happy with our catalogue > suggesting the users switching to US spelling. > > > > *Greg Pendlebury* > Electronic Services Officer (Systems Team) > Division of Academic Information Services > University of Southern Queensland > Phone: +61 7 4631 1501 > Fax: +61 7 4631 1841 > > *From:* Demian Katz [mailto:dem...@vi...] > *Sent:* Monday, 5 October 2009 11:39 PM > > > *To:* Greg Pendlebury; 'vuf...@li...' > *Subject:* RE: Idle thoughts on spellcheck > > > > Thanks for the update -- I've added another comment to JIRA with this > link. Let us know how things work out for you! > > > > - Demian > > > > > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > > ------------------------------ > > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > > > ------------------------------ > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, > as a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government (CRICOS Institution Code No's. QLD 00244B / > NSW 02225M) > |