Re: [VuFind-Tech] [solrmarc-tech] Re: umlauts

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Glad to help.  I'm always happy to receive enhancement patches if you find things you think are worth sharing.

Also, how are you managing your software versions?  Upgrading a customized 1.0 to 1.0.1 should actually be pretty easy if you use something like Subversion vendor branching.  I'll share details if you're interested.

- Demian

> -----Original Message-----
> From: sol...@go... [mailto:solrmarc-
> te...@go...] On Behalf Of mjk...@go...
> Sent: Tuesday, August 31, 2010 7:14 AM
> To: Demian Katz
> Cc: sol...@go...; vuf...@li...
> Subject: Re: [solrmarc-tech] Re: [VuFind-Tech] umlauts
>
>   Dear Demian,
>
> thank you very much for pointing this out. I will switch to 1.0.1 soon,
> but made so many small changes to 1.0 that it will become an expensive
> task to move them all to a new source.
>
> Maybe I'll try to ask you to include one or another if I think it could
> make sense for others.
>
> Best,
> Leander
>
> Am 30.08.2010 19:44, schrieb Demian Katz:
> > Bob,
> >
> > I just ran your sample string ("Denkmäler Denkmäler Denkmaler")
> through analysis on my VuFind test server and observed a very
> interesting result.  The middle Denkmäler (the non-combining
> representation) does not get stemmed by the
> SnowballPorterFilterFactory, but the other two representations do get
> stemmed.  The bottom of my analysis result looks like this:
> >
> > denkmal denkmaler       denkmal
> >
> > I can see how this could result in failed searches -- if your search
> query normalizes to "denkmaler" but the indexed term normalized to
> "denkmal," there won't be a match!
> >
> > I wonder if this is a bug in the SnowballPorterFilterFactory -- do
> you think we should report it to the solr-user list?
> >
> > Leander,
> >
> > We recently (post-1.0.1 release) made some changes to VuFind to
> ensure that unstemmed matches rank higher than stemmed matches.  I
> believe that this adjustment actually introduces a workaround for your
> problem...  at least, the "Denkmäler Denkmäler Denkmaler" search fails
> on the 1.0.1 demo server but succeeds on my post-1.0.1 test server.  If
> you want to try adjusting your configuration accordingly, see this JIRA
> ticket:
> >
> > http://vufind.org/jira/browse/VUFIND-259
> >
> > - Demian
> >
> >> -----Original Message-----
> >> From: sol...@go... [mailto:solrmarc-
> >> te...@go...] On Behalf Of Robert Haschart
> >> Sent: Monday, August 30, 2010 12:42 PM
> >> To: sol...@go...
> >> Cc: MJKL Seige
> >> Subject: Re: [solrmarc-tech] Re: [VuFind-Tech] umlauts
> >>
> >> Jonathan,
> >>
> >> The ASCIIFoldingFilterFactory did exist at the time I started
> creating
> >> the UnicodeNormalizationFilterFactory, however the version that
> existed
> >> then really didn't work very well. IIRC all it did was delete all
> >> combining accent marks, which would cause the exact problem MJKL
> Seige
> >> seems to be having.
> >>
> >> MJKL Seige,
> >>
> >> I tried using the solr analysis page to see how the
> >> UnicodeNormalizationFilterFactory works when given the two different
> >> representations of the a with umlaut, and with the settings I am
> using,
> >> both of the ways of encoding the 'a with umlaut' are mapped to a
> bare
> >> naked 'a'.
> >> Go to the Solr admin page for your index, and click on the
> [ANALYSIS]
> >> link near the top of the page.
> >>
> >> Enter the string text in the Field box, enter "Musikalische
> Denkmäler
> >> Denkmäler Denkmaler" in both the Field value (Index) box, and the
> Field
> >> value (Query) box, check both "verbose output" checkboxes, and then
> >> click the "Analyze" button.
> >>
> >> The results I see are show below. They show that for both the forms
> of
> >> "a with umlaut" UnicodeNormalizationFilterFactory replaces the "a
> with
> >> umlaut" with an 'a'. It could be that either you are giving the
> >> UnicodeNormalizationFilterFactory different parameters or that it is
> >> not
> >> actually included in the Analyzer sequence you are using.
> >>
> >>
> >> Index Analyzer
> >> org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory {}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmäler Denkmäler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> schema.UnicodeNormalizationFilterFactory {composed=false,
> >> remove_modifiers=true, fold=true, version=icu4j,
> >> remove_diacritics=true}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> schema.CJKFilterFactory {bigrams=false}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> >> ignoreCase=true, enablePositionIncrements=true}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.WordDelimiterFilterFactory
> >> {generateNumberParts=1, catenateWords=1, generateWordParts=1,
> >> catenateAll=0, catenateNumbers=1}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.LowerCaseFilterFactory {}
> >> term position 1 2 3 4
> >> term text musikalische denkmaler denkmaler denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.EnglishPorterFilterFactory
> >> {protected=protwords.txt}
> >> term position 1 2 3 4
> >> term text musikalisch denkmal denkmal denkmal
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> >> term position 1 2 3 4
> >> term text musikalisch denkmal denkmal denkmal
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >>
> >>
> >> Query Analyzer
> >> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmäler Denkmäler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> schema.UnicodeNormalizationFilterFactory {composed=false,
> >> remove_modifiers=true, fold=true, version=icu4j,
> >> remove_diacritics=true}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> schema.CJKFilterFactory {bigrams=false}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.SynonymFilterFactory
> {synonyms=synonyms.txt,
> >> expand=true, ignoreCase=true}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> >> ignoreCase=true, enablePositionIncrements=true}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.WordDelimiterFilterFactory
> >> {generateNumberParts=1, catenateWords=0, generateWordParts=1,
> >> catenateAll=0, catenateNumbers=0}
> >> term position 1 2 3 4
> >> term text Musikalische Denkmaler Denkmaler Denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.LowerCaseFilterFactory {}
> >> term position 1 2 3 4
> >> term text musikalische denkmaler denkmaler denkmaler
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.EnglishPorterFilterFactory
> >> {protected=protwords.txt}
> >> term position 1 2 3 4
> >> term text musikalisch denkmal denkmal denkmal
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> >> term position 1 2 3 4
> >> term text musikalisch denkmal denkmal denkmal
> >> term type word word word word
> >> source start,end 0,12 13,23 25,34 35,44
> >> payload
> >>
> >>
> >>
> >> Jonathan Rochkind wrote:
> >>
> >>> Are you completely sure that both of those are actually valid
> escaped
> >>> UTF-8 encodings for ä ?
> >>>
> >>> It looks like they are, and the %61%CC%88 is the combining diaresis
> >>> version?
> >>>
> >>> If that's all true, then, sorry, you're right, that would appear to
> >> be
> >>> a bug in the unicode normalization filter, I guess? Either wait for
> >>> Bob to chime in, or dig in to the source and try to fix it
> yourself.
> >>> I'm not sure where the source is, or if it's even public though, so
> >>> maybe we need Bob here.
> >>>
> >>> Alternately, you could abandon Bob's unicode normalization filter
> >>> entirely, and use the one that's actually part of lucene/solr now
> >>> (which didn't exist when Bob wrote his).
> >>>
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIF
> >> oldingFilterFactory
> >>>
> >>> Not entirely sure what the differences are, but that one may work
> >>> better for you, and is standard solr instead of custom Bob, might
> >> want
> >>> to try that.
> >>>
> >>> The fact that Bob's filter temporarily converts the MARC8 to
> >>> non-combined UTF-8 doesnt' matter if it ultimately converts it to
> >> just
> >>> plain ascii though, right? That's just an intermediate step, never
> >>> makes it to the index, what makes it to the index is just ascii
> >>> without diaresis at all, due to the filter. So first question is if
> >>> you want to fold everything to ASCII or not -- it works well for
> many
> >>> interfaces that will be used primarily with English and by English
> >>> speakers, but may not be what you want in some circumstances.
> >>> Depending on if it's what you want or not, you have different
> options
> >>> available.
> >>>
> >>> Jonathan
> >>>
> >>> MJKL Seige wrote:
> >>>
> >>>> Dear Jonathan,
> >>>>
> >>>> thank you for your answer but my question was a little bit
> >> different.
> >>>> There are obviously two ways to encode an ä in utf-8, %C3%A4 and
> >>>> %61%CC%88. The Unicode Normalization Filter does normalize
> %61%CC%88
> >>>> to %61, that is ä to a. This is correct. But it obviously does not
> >>>> normalize %C3%A4 to %61. Unfortunately browsers seem to send
> %C3%A4
> >>>> in formdata. Now, %C3%A4 doesn't match the normalized %61. I would
> >>>> expect the UnicodeNormalizationFilter to normalize %C3%A4 to %61.
> >>>> Hacking the "normalization" %C3%A4 to %61 into vufinds Solr.php
> does
> >>>> produce the expected results.
> >>>>
> >>>> My questions was why the UnicodeNormalizationFilter doesnt do that
> >>>> respectively where I can read about its functionality and
> >>>> configuration options.
> >>>>
> >>>> Another thing: during the import of the MARC data these are
> >> converted
> >>>> from MARC8 to UTF8, the ä from %E8%61 to %61%CC%88. I guess this
> is
> >>>> because %E8 is "combining diaresis" in MARC8 and %CC%88 is
> >> "combining
> >>>> diaresis" in UTF8, so this conversion basically makes sense. On
> the
> >>>> other hand converting %E8%61 to %C3%A4 would be correct as well in
> >> my
> >>>> opinion.
> >>>>
> >>>> Regards,
> >>>> mjkl
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Am 28.08.10 00:51, schrieb Jonathan Rochkind:
> >>>>
> >>>>> The unicode normalization filter indeed does change Denkmäler to
> >>>>> just Denkmaler on indexing. This is in fact the whole point of
> the
> >>>>> filter. It should also be applied in your solr schema on query,
> so
> >> a
> >>>>> query on Denkmäler is also changed to Denkmaler. The point of
> this
> >>>>> is that a query for "Denkmäler" should match "Denkmäler" or
> >>>>> "Denkmaler" in the source, and so should a query for Denkmaler.
> >>>>>
> >>>>> If you don't want that normalization behavior, simply remove the
> >>>>> unicode normalization filter from your field definition, no
> >> problem.
> >>>>> That behavior is it's purpose. If you remove it, then a query on
> >>>>> "Denkmaler" won't match "Denkmäler" anymore, you'll need to query
> >> on
> >>>>> the _exact_ string with diacrticis and all to match. You'd
> probably
> >>>>> want to replace it with some filter (not sure what is available)
> >>>>> that makes sure all your unicode is in Unicode Normalization Form
> >> KC
> >>>>> though -- without that, even some queries on "Denkmäler" won't
> >> match
> >>>>> "Denkmäler", depending on exactly how the unicode is formed.
> >> Unicode
> >>>>> gets tricky in search matching. (Bob could say if maybe there's
> an
> >>>>> option to his unicode normalization filter that just uses NFKC,
> and
> >>>>> doesn't flatten to ascii?).
> >>>>>
> >>>>> Now, as to why Denkmäler isn't matching for you -- somehow your
> >>>>> unicode is getting corrupted by VuFind/PHP in between the browser
> >>>>> and the Solr request. The behavior works as expected in my
> >>>>> Blacklight installation -- although it didn't when I had an
> >>>>> incorrect/missing configuration in my Tomcat installation, which
> >>>>> doesn't apply to you with VuFind so I won't get into details, but
> >>>>> it's an example of how it's definitely possible to have encodings
> >>>>> messed up somewhere in your toolchain before the query hits Solr.
> >>>>> Getting to the bottom of that is a VuFind issue, and not related
> to
> >>>>> the unicode normalization filter. Even if you remove the unicode
> >>>>> normalization filter you will probably still have this problem,
> >>>>> since it appears that the problem is your unicode somehow getting
> >>>>> corrupted before it gets to the solr request.
> >>>>>
> >>>>> Jonathan
> >>>>> ________________________________________
> >>>>> From: sol...@go...
> >>>>> [sol...@go...] On Behalf Of Demian Katz
> >>>>> [dem...@vi...]
> >>>>> Sent: Friday, August 27, 2010 3:41 PM
> >>>>> To: MJKL Seige; vuf...@li...
> >>>>> Cc: sol...@go...
> >>>>> Subject: [solrmarc-tech] RE: [VuFind-Tech] umlauts
> >>>>>
> >>>>> First of all, I am copying this message to the solrmarc-tech list
> -
> >> -
> >>>>> this way it should reach the attention of Bob Haschart, who wrote
> >>>>> the UnicodeNormalizationFilter and can say more about its
> behavior
> >>>>> than I could.
> >>>>>
> >>>>> Regarding your other point about breadcrumb generation in the
> >>>>> default theme, you are right -- it's currently done rather
> >> sloppily.
> >>>>> I haven't made a big effort to clean it up since I use the
> classic
> >>>>> theme myself... but if you have any ideas you want to share, I'm
> >>>>> definitely open to improving the trunk. You might at least be
> >>>>> interested in looking at some of the multi-byte-capable ucwords()
> >>>>> variations discussed here:
> >>>>>
> >>>>> http://php.net/manual/en/function.ucwords.php
> >>>>>
> >>>>> - Demian
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: MJKL Seige [mailto:mjk...@go...]
> >>>>>> Sent: Friday, August 27, 2010 3:33 PM
> >>>>>> To: vuf...@li...
> >>>>>> Subject: [VuFind-Tech] umlauts
> >>>>>>
> >>>>>> Dear all,
> >>>>>>
> >>>>>> I have a little issue with umlauts and I'd like to describe it
> >> here
> >>>>>> and
> >>>>>> hope someone can give me a hint on how to solve it:
> >>>>>>
> >>>>>> I import MARC records with MARC-8 encoding, our example word is
> >>>>>> Denkmäler and the "ä" in it is encoded as the hexadecimal
> sequence
> >> E8
> >>>>>> 61.
> >>>>>>
> >>>>>> After importing this successfully I try to search it by typing
> the
> >>>>>> word
> >>>>>> "Denkmäler" in the search form. Nothing is found.
> >>>>>> I check the html source and find out that the "ä" is encoded as
> C3
> >> A4,
> >>>>>> this is the UTF-8 "ä".
> >>>>>>
> >>>>>> Now I search again with "Denkmaler", some "Denkmäler" are found.
> >> These
> >>>>>> "ä" are encoded as 61 CC 88, this is the UTF-8 "a" with
> combining
> >>>>>> diaresis following.
> >>>>>>
> >>>>>> By modifying the url to ?lookfor=denkm%61%CC%88ler vufind shows
> me
> >> the
> >>>>>> same results as for "Denkmaler". I think the
> >>>>>> UnicodeNormalizationFilter
> >>>>>> changed 61 88 CC to just 61, that is: "Denkmaler".
> >>>>>>
> >>>>>> So, the first thing is that vufind (solrmarc?) converts the
> MARC-8
> >> "E8
> >>>>>> 61" to "61 CC 88", not "C3 A4". I don't want to decide which way
> >> is
> >>>>>> the
> >>>>>> correct encoding for an ä: "C3 A4" or "61 CC 88". But I think
> "C3
> >> A4"
> >>>>>> should be normalized to "a" as well? Unfortunately I couldn't
> find
> >>>>>> much
> >>>>>> documentation about the filters in schema.xml and their args,
> >> where
> >>>>>> can
> >>>>>> I read about which filters are available and what these options
> >>>>>> (composed="", remove_diacritics="" ...) exactly do?
> >>>>>>
> >>>>>> Or shouldnt "61 CC 88" and "C3 A4" be normalized to "ae"?
> >>>>>>
> >>>>>> Another funny thing is that in vufinds breadcrumbs the search
> term
> >> is
> >>>>>> written "DenkmäLer", the L is upper-cased. Why? I guess it is
> >> because
> >>>>>> vufind seems to detect a word boundary at the umlaut, this is
> >> wrong of
> >>>>>> course. (Btw upper-casing the search term is wrong anyway I
> think,
> >> but
> >>>>>> I
> >>>>>> wouldn't have found this funny thing otherwise ;-)
> >>>>>>
> >>>>>> Best wishes,
> >>>>>> mjkl
> >>>>>>
> >>>>>> ----------------------------------------------------------------
> --
> >> -----
> >>>>>> -------
> >>>>>> Sell apps to millions through the Intel(R) Atom(Tm) Developer
> >> Program
> >>>>>> Be part of this innovative community and reach millions of
> netbook
> >>>>>> users
> >>>>>> worldwide. Take advantage of special opportunities to increase
> >> revenue
> >>>>>> and
> >>>>>> speed time-to-market. Join now, and jumpstart your future.
> >>>>>> http://p.sf.net/sfu/intel-atom-d2d
> >>>>>> _______________________________________________
> >>>>>> Vufind-tech mailing list
> >>>>>> Vuf...@li...
> >>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-tech
> >>>>> --
> >>>>> You received this message because you are subscribed to the
> Google
> >>>>> Groups "solrmarc-tech" group.
> >>>>> To post to this group, send email to solrmarc-
> >> te...@go....
> >>>>> To unsubscribe from this group, send email to
> >>>>> sol...@go....
> >>>>> For more options, visit this group at
> >>>>> http://groups.google.com/group/solrmarc-tech?hl=en.
> >>>>>
> >> --
> >> You received this message because you are subscribed to the Google
> >> Groups "solrmarc-tech" group.
> >> To post to this group, send email to sol...@go....
> >> To unsubscribe from this group, send email to solrmarc-
> >> tec...@go....
> >> For more options, visit this group at
> >> http://groups.google.com/group/solrmarc-tech?hl=en.
>
> --
> You received this message because you are subscribed to the Google
> Groups "solrmarc-tech" group.
> To post to this group, send email to sol...@go....
> To unsubscribe from this group, send email to solrmarc-
> tec...@go....
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.