From: Demian K. <dem...@vi...> - 2010-08-31 12:28:55
|
Glad to help. I'm always happy to receive enhancement patches if you find things you think are worth sharing. Also, how are you managing your software versions? Upgrading a customized 1.0 to 1.0.1 should actually be pretty easy if you use something like Subversion vendor branching. I'll share details if you're interested. - Demian > -----Original Message----- > From: sol...@go... [mailto:solrmarc- > te...@go...] On Behalf Of mjk...@go... > Sent: Tuesday, August 31, 2010 7:14 AM > To: Demian Katz > Cc: sol...@go...; vuf...@li... > Subject: Re: [solrmarc-tech] Re: [VuFind-Tech] umlauts > > Dear Demian, > > thank you very much for pointing this out. I will switch to 1.0.1 soon, > but made so many small changes to 1.0 that it will become an expensive > task to move them all to a new source. > > Maybe I'll try to ask you to include one or another if I think it could > make sense for others. > > Best, > Leander > > Am 30.08.2010 19:44, schrieb Demian Katz: > > Bob, > > > > I just ran your sample string ("Denkmäler Denkmäler Denkmaler") > through analysis on my VuFind test server and observed a very > interesting result. The middle Denkmäler (the non-combining > representation) does not get stemmed by the > SnowballPorterFilterFactory, but the other two representations do get > stemmed. The bottom of my analysis result looks like this: > > > > denkmal denkmaler denkmal > > > > I can see how this could result in failed searches -- if your search > query normalizes to "denkmaler" but the indexed term normalized to > "denkmal," there won't be a match! > > > > I wonder if this is a bug in the SnowballPorterFilterFactory -- do > you think we should report it to the solr-user list? > > > > Leander, > > > > We recently (post-1.0.1 release) made some changes to VuFind to > ensure that unstemmed matches rank higher than stemmed matches. I > believe that this adjustment actually introduces a workaround for your > problem... at least, the "Denkmäler Denkmäler Denkmaler" search fails > on the 1.0.1 demo server but succeeds on my post-1.0.1 test server. If > you want to try adjusting your configuration accordingly, see this JIRA > ticket: > > > > http://vufind.org/jira/browse/VUFIND-259 > > > > - Demian > > > >> -----Original Message----- > >> From: sol...@go... [mailto:solrmarc- > >> te...@go...] On Behalf Of Robert Haschart > >> Sent: Monday, August 30, 2010 12:42 PM > >> To: sol...@go... > >> Cc: MJKL Seige > >> Subject: Re: [solrmarc-tech] Re: [VuFind-Tech] umlauts > >> > >> Jonathan, > >> > >> The ASCIIFoldingFilterFactory did exist at the time I started > creating > >> the UnicodeNormalizationFilterFactory, however the version that > existed > >> then really didn't work very well. IIRC all it did was delete all > >> combining accent marks, which would cause the exact problem MJKL > Seige > >> seems to be having. > >> > >> MJKL Seige, > >> > >> I tried using the solr analysis page to see how the > >> UnicodeNormalizationFilterFactory works when given the two different > >> representations of the a with umlaut, and with the settings I am > using, > >> both of the ways of encoding the 'a with umlaut' are mapped to a > bare > >> naked 'a'. > >> Go to the Solr admin page for your index, and click on the > [ANALYSIS] > >> link near the top of the page. > >> > >> Enter the string text in the Field box, enter "Musikalische > Denkmäler > >> Denkmäler Denkmaler" in both the Field value (Index) box, and the > Field > >> value (Query) box, check both "verbose output" checkboxes, and then > >> click the "Analyze" button. > >> > >> The results I see are show below. They show that for both the forms > of > >> "a with umlaut" UnicodeNormalizationFilterFactory replaces the "a > with > >> umlaut" with an 'a'. It could be that either you are giving the > >> UnicodeNormalizationFilterFactory different parameters or that it is > >> not > >> actually included in the Analyzer sequence you are using. > >> > >> > >> Index Analyzer > >> org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory {} > >> term position 1 2 3 4 > >> term text Musikalische Denkmäler Denkmäler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> schema.UnicodeNormalizationFilterFactory {composed=false, > >> remove_modifiers=true, fold=true, version=icu4j, > >> remove_diacritics=true} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> schema.CJKFilterFactory {bigrams=false} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, > >> ignoreCase=true, enablePositionIncrements=true} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.WordDelimiterFilterFactory > >> {generateNumberParts=1, catenateWords=1, generateWordParts=1, > >> catenateAll=0, catenateNumbers=1} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.LowerCaseFilterFactory {} > >> term position 1 2 3 4 > >> term text musikalische denkmaler denkmaler denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.EnglishPorterFilterFactory > >> {protected=protwords.txt} > >> term position 1 2 3 4 > >> term text musikalisch denkmal denkmal denkmal > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} > >> term position 1 2 3 4 > >> term text musikalisch denkmal denkmal denkmal > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> > >> > >> Query Analyzer > >> org.apache.solr.analysis.WhitespaceTokenizerFactory {} > >> term position 1 2 3 4 > >> term text Musikalische Denkmäler Denkmäler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> schema.UnicodeNormalizationFilterFactory {composed=false, > >> remove_modifiers=true, fold=true, version=icu4j, > >> remove_diacritics=true} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> schema.CJKFilterFactory {bigrams=false} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.SynonymFilterFactory > {synonyms=synonyms.txt, > >> expand=true, ignoreCase=true} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, > >> ignoreCase=true, enablePositionIncrements=true} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.WordDelimiterFilterFactory > >> {generateNumberParts=1, catenateWords=0, generateWordParts=1, > >> catenateAll=0, catenateNumbers=0} > >> term position 1 2 3 4 > >> term text Musikalische Denkmaler Denkmaler Denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.LowerCaseFilterFactory {} > >> term position 1 2 3 4 > >> term text musikalische denkmaler denkmaler denkmaler > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.EnglishPorterFilterFactory > >> {protected=protwords.txt} > >> term position 1 2 3 4 > >> term text musikalisch denkmal denkmal denkmal > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} > >> term position 1 2 3 4 > >> term text musikalisch denkmal denkmal denkmal > >> term type word word word word > >> source start,end 0,12 13,23 25,34 35,44 > >> payload > >> > >> > >> > >> Jonathan Rochkind wrote: > >> > >>> Are you completely sure that both of those are actually valid > escaped > >>> UTF-8 encodings for ä ? > >>> > >>> It looks like they are, and the %61%CC%88 is the combining diaresis > >>> version? > >>> > >>> If that's all true, then, sorry, you're right, that would appear to > >> be > >>> a bug in the unicode normalization filter, I guess? Either wait for > >>> Bob to chime in, or dig in to the source and try to fix it > yourself. > >>> I'm not sure where the source is, or if it's even public though, so > >>> maybe we need Bob here. > >>> > >>> Alternately, you could abandon Bob's unicode normalization filter > >>> entirely, and use the one that's actually part of lucene/solr now > >>> (which didn't exist when Bob wrote his). > >>> > >> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIF > >> oldingFilterFactory > >>> > >>> Not entirely sure what the differences are, but that one may work > >>> better for you, and is standard solr instead of custom Bob, might > >> want > >>> to try that. > >>> > >>> The fact that Bob's filter temporarily converts the MARC8 to > >>> non-combined UTF-8 doesnt' matter if it ultimately converts it to > >> just > >>> plain ascii though, right? That's just an intermediate step, never > >>> makes it to the index, what makes it to the index is just ascii > >>> without diaresis at all, due to the filter. So first question is if > >>> you want to fold everything to ASCII or not -- it works well for > many > >>> interfaces that will be used primarily with English and by English > >>> speakers, but may not be what you want in some circumstances. > >>> Depending on if it's what you want or not, you have different > options > >>> available. > >>> > >>> Jonathan > >>> > >>> MJKL Seige wrote: > >>> > >>>> Dear Jonathan, > >>>> > >>>> thank you for your answer but my question was a little bit > >> different. > >>>> There are obviously two ways to encode an ä in utf-8, %C3%A4 and > >>>> %61%CC%88. The Unicode Normalization Filter does normalize > %61%CC%88 > >>>> to %61, that is ä to a. This is correct. But it obviously does not > >>>> normalize %C3%A4 to %61. Unfortunately browsers seem to send > %C3%A4 > >>>> in formdata. Now, %C3%A4 doesn't match the normalized %61. I would > >>>> expect the UnicodeNormalizationFilter to normalize %C3%A4 to %61. > >>>> Hacking the "normalization" %C3%A4 to %61 into vufinds Solr.php > does > >>>> produce the expected results. > >>>> > >>>> My questions was why the UnicodeNormalizationFilter doesnt do that > >>>> respectively where I can read about its functionality and > >>>> configuration options. > >>>> > >>>> Another thing: during the import of the MARC data these are > >> converted > >>>> from MARC8 to UTF8, the ä from %E8%61 to %61%CC%88. I guess this > is > >>>> because %E8 is "combining diaresis" in MARC8 and %CC%88 is > >> "combining > >>>> diaresis" in UTF8, so this conversion basically makes sense. On > the > >>>> other hand converting %E8%61 to %C3%A4 would be correct as well in > >> my > >>>> opinion. > >>>> > >>>> Regards, > >>>> mjkl > >>>> > >>>> > >>>> > >>>> > >>>> Am 28.08.10 00:51, schrieb Jonathan Rochkind: > >>>> > >>>>> The unicode normalization filter indeed does change Denkmäler to > >>>>> just Denkmaler on indexing. This is in fact the whole point of > the > >>>>> filter. It should also be applied in your solr schema on query, > so > >> a > >>>>> query on Denkmäler is also changed to Denkmaler. The point of > this > >>>>> is that a query for "Denkmäler" should match "Denkmäler" or > >>>>> "Denkmaler" in the source, and so should a query for Denkmaler. > >>>>> > >>>>> If you don't want that normalization behavior, simply remove the > >>>>> unicode normalization filter from your field definition, no > >> problem. > >>>>> That behavior is it's purpose. If you remove it, then a query on > >>>>> "Denkmaler" won't match "Denkmäler" anymore, you'll need to query > >> on > >>>>> the _exact_ string with diacrticis and all to match. You'd > probably > >>>>> want to replace it with some filter (not sure what is available) > >>>>> that makes sure all your unicode is in Unicode Normalization Form > >> KC > >>>>> though -- without that, even some queries on "Denkmäler" won't > >> match > >>>>> "Denkmäler", depending on exactly how the unicode is formed. > >> Unicode > >>>>> gets tricky in search matching. (Bob could say if maybe there's > an > >>>>> option to his unicode normalization filter that just uses NFKC, > and > >>>>> doesn't flatten to ascii?). > >>>>> > >>>>> Now, as to why Denkmäler isn't matching for you -- somehow your > >>>>> unicode is getting corrupted by VuFind/PHP in between the browser > >>>>> and the Solr request. The behavior works as expected in my > >>>>> Blacklight installation -- although it didn't when I had an > >>>>> incorrect/missing configuration in my Tomcat installation, which > >>>>> doesn't apply to you with VuFind so I won't get into details, but > >>>>> it's an example of how it's definitely possible to have encodings > >>>>> messed up somewhere in your toolchain before the query hits Solr. > >>>>> Getting to the bottom of that is a VuFind issue, and not related > to > >>>>> the unicode normalization filter. Even if you remove the unicode > >>>>> normalization filter you will probably still have this problem, > >>>>> since it appears that the problem is your unicode somehow getting > >>>>> corrupted before it gets to the solr request. > >>>>> > >>>>> Jonathan > >>>>> ________________________________________ > >>>>> From: sol...@go... > >>>>> [sol...@go...] On Behalf Of Demian Katz > >>>>> [dem...@vi...] > >>>>> Sent: Friday, August 27, 2010 3:41 PM > >>>>> To: MJKL Seige; vuf...@li... > >>>>> Cc: sol...@go... > >>>>> Subject: [solrmarc-tech] RE: [VuFind-Tech] umlauts > >>>>> > >>>>> First of all, I am copying this message to the solrmarc-tech list > - > >> - > >>>>> this way it should reach the attention of Bob Haschart, who wrote > >>>>> the UnicodeNormalizationFilter and can say more about its > behavior > >>>>> than I could. > >>>>> > >>>>> Regarding your other point about breadcrumb generation in the > >>>>> default theme, you are right -- it's currently done rather > >> sloppily. > >>>>> I haven't made a big effort to clean it up since I use the > classic > >>>>> theme myself... but if you have any ideas you want to share, I'm > >>>>> definitely open to improving the trunk. You might at least be > >>>>> interested in looking at some of the multi-byte-capable ucwords() > >>>>> variations discussed here: > >>>>> > >>>>> http://php.net/manual/en/function.ucwords.php > >>>>> > >>>>> - Demian > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: MJKL Seige [mailto:mjk...@go...] > >>>>>> Sent: Friday, August 27, 2010 3:33 PM > >>>>>> To: vuf...@li... > >>>>>> Subject: [VuFind-Tech] umlauts > >>>>>> > >>>>>> Dear all, > >>>>>> > >>>>>> I have a little issue with umlauts and I'd like to describe it > >> here > >>>>>> and > >>>>>> hope someone can give me a hint on how to solve it: > >>>>>> > >>>>>> I import MARC records with MARC-8 encoding, our example word is > >>>>>> Denkmäler and the "ä" in it is encoded as the hexadecimal > sequence > >> E8 > >>>>>> 61. > >>>>>> > >>>>>> After importing this successfully I try to search it by typing > the > >>>>>> word > >>>>>> "Denkmäler" in the search form. Nothing is found. > >>>>>> I check the html source and find out that the "ä" is encoded as > C3 > >> A4, > >>>>>> this is the UTF-8 "ä". > >>>>>> > >>>>>> Now I search again with "Denkmaler", some "Denkmäler" are found. > >> These > >>>>>> "ä" are encoded as 61 CC 88, this is the UTF-8 "a" with > combining > >>>>>> diaresis following. > >>>>>> > >>>>>> By modifying the url to ?lookfor=denkm%61%CC%88ler vufind shows > me > >> the > >>>>>> same results as for "Denkmaler". I think the > >>>>>> UnicodeNormalizationFilter > >>>>>> changed 61 88 CC to just 61, that is: "Denkmaler". > >>>>>> > >>>>>> So, the first thing is that vufind (solrmarc?) converts the > MARC-8 > >> "E8 > >>>>>> 61" to "61 CC 88", not "C3 A4". I don't want to decide which way > >> is > >>>>>> the > >>>>>> correct encoding for an ä: "C3 A4" or "61 CC 88". But I think > "C3 > >> A4" > >>>>>> should be normalized to "a" as well? Unfortunately I couldn't > find > >>>>>> much > >>>>>> documentation about the filters in schema.xml and their args, > >> where > >>>>>> can > >>>>>> I read about which filters are available and what these options > >>>>>> (composed="", remove_diacritics="" ...) exactly do? > >>>>>> > >>>>>> Or shouldnt "61 CC 88" and "C3 A4" be normalized to "ae"? > >>>>>> > >>>>>> Another funny thing is that in vufinds breadcrumbs the search > term > >> is > >>>>>> written "DenkmäLer", the L is upper-cased. Why? I guess it is > >> because > >>>>>> vufind seems to detect a word boundary at the umlaut, this is > >> wrong of > >>>>>> course. (Btw upper-casing the search term is wrong anyway I > think, > >> but > >>>>>> I > >>>>>> wouldn't have found this funny thing otherwise ;-) > >>>>>> > >>>>>> Best wishes, > >>>>>> mjkl > >>>>>> > >>>>>> ---------------------------------------------------------------- > -- > >> ----- > >>>>>> ------- > >>>>>> Sell apps to millions through the Intel(R) Atom(Tm) Developer > >> Program > >>>>>> Be part of this innovative community and reach millions of > netbook > >>>>>> users > >>>>>> worldwide. Take advantage of special opportunities to increase > >> revenue > >>>>>> and > >>>>>> speed time-to-market. Join now, and jumpstart your future. > >>>>>> http://p.sf.net/sfu/intel-atom-d2d > >>>>>> _______________________________________________ > >>>>>> Vufind-tech mailing list > >>>>>> Vuf...@li... > >>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-tech > >>>>> -- > >>>>> You received this message because you are subscribed to the > Google > >>>>> Groups "solrmarc-tech" group. > >>>>> To post to this group, send email to solrmarc- > >> te...@go.... > >>>>> To unsubscribe from this group, send email to > >>>>> sol...@go.... > >>>>> For more options, visit this group at > >>>>> http://groups.google.com/group/solrmarc-tech?hl=en. > >>>>> > >> -- > >> You received this message because you are subscribed to the Google > >> Groups "solrmarc-tech" group. > >> To post to this group, send email to sol...@go.... > >> To unsubscribe from this group, send email to solrmarc- > >> tec...@go.... > >> For more options, visit this group at > >> http://groups.google.com/group/solrmarc-tech?hl=en. > > -- > You received this message because you are subscribed to the Google > Groups "solrmarc-tech" group. > To post to this group, send email to sol...@go.... > To unsubscribe from this group, send email to solrmarc- > tec...@go.... > For more options, visit this group at > http://groups.google.com/group/solrmarc-tech?hl=en. |