From: Joe W. <jo...@gm...> - 2013-01-14 14:36:53
|
Hi Jonathan, Thanks for your reply! I am, indeed, after a way to strip diacritical marks from text. I don't think your characterization captures all that normalize-unicode() does (perhaps these are post-2.0 developments?). Based on http://unicode.org/reports/tr15/#Norm_Forms, I believe the NFKD normalization form *decomposes* characters into their constituent parts. So, for example, as http://unicode.org/reports/tr15/#Compatibility_Composite_Figure illustrates: > In the NFKC and NFKD forms, many formatting distinctions are removed, as shown in Figure 6. The “fi” ligature changes into its components “f” and “i”, the superscript formatting is removed from the “5”, and the long “s” is changed into a normal “s”. So I would expect normalize-unicode('ffi', 'NFKD') to return 'ffi', but it returns the original character. I hope but am not sure that I should be able to apply the same function to strings with diacritics, and then use other functions like replace() to strip away the diacritic characters, as suggested on xquery-talk (http://www.stylusstudio.com/xquerytalk/201106/003547.html). Joe On Mon, Jan 14, 2013 at 8:11 AM, Jonathan Rowell <big...@ho...> wrote: > Hi, > > doesn't normalize-unicode ensure that the encodings for diacritic characters > are > in cannonical form? That is that the diacritic character consisting of a > basic character and one or more non-spacing modifiers are mapped to either > precomposed characters or are backwards modifing and in thge correct order?* > > What you need in fact is a transliteration, such that diacritics are dropped > and dipthongs separated into basic characters. > > Jonathan > > * Section 5.9 in Unicode Standard (my book version is 2.0) > >> From: jo...@gm... >> Date: Mon, 14 Jan 2013 07:56:58 -0500 >> To: wol...@ex... >> CC: exi...@li... >> Subject: Re: [Exist-open] stripping diacritics with fn:normalize-unicode() > >> >> Hi all, >> >> I'm not an expert in unicode normalization, but I think I have a >> reproducible test showing normalize-unicode() isn't performing as >> expected. It should apply the listed normalization forms to the >> string, but in fact, the results are coming back identical to the >> input. >> >> let $string := 'ffi' (: note that this is a single unicode character:) >> let $normalization-forms := ('NFKC', 'NFKD') >> for $normalization-form in $normalization-forms >> return >> normalize-unicode($string, $normalization-form) >> >> ==> should return 3-letter form 'ffi' but returns single character as >> in input string >> >> For a demonstration of how this string should be normalized, see: >> - http://unicode.org/cldr/utility/transform.jsp?a=Any-NFKC&b=ffi >> - http://unicode.org/cldr/utility/transform.jsp?a=Any-NFKD&b=ffi >> >> See also the icu4j documentation discussing this example: >> - http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html >> >> On a related note, the function docs describe a normalization form >> 'FULLY-NORMALIZED', but using this results in an error: >> >> 2013-01-14 06:53:44,549 [eXistThread-48] ERROR >> (FunNormalizeUnicode.java [eval]:153) - err:FOCH0003: unknown >> normalization form >> 2013-01-14 06:53:44,552 [eXistThread-48] ERROR >> (FunNormalizeUnicode.java [eval]:174) - Can not find the ICU4J library >> in the classpath err:FOCH0003 unknown normalization form [at line 7, >> column 5, source: String] >> >> This 2nd line is strange because the icu4j jar is in >> $EXIST_HOME/lib/user/icu4j-4_8_1_1.jar - where "build >> download-additional-jars" put it. >> >> Joe >> >> On Sat, Jan 12, 2013 at 12:15 PM, Joe Wicentowski <jo...@gm...> >> wrote: >> > Hi all, >> > >> > I've been trying to strip diacritics from a string of text and found >> > this code on xquery-talk [1]: >> > >> > replace(normalize-unicode('abcdëf', 'NFD'), '[\p{M}]', '') >> > >> > This is supposed to turn abcdëf into abcdef (no umlaut). But this >> > isn't working for me in trunk. I've built with the required icu4j >> > jars (build download-additional-jars). >> > >> > Has anyone used normalize-unicode() to strip diacritics? >> > >> > Thanks, >> > Joe >> > >> > [1] http://www.stylusstudio.com/xquerytalk/201106/003547.html >> >> >> ------------------------------------------------------------------------------ >> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, >> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current >> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft >> MVPs and experts. SALE $99.99 this month only -- learn more at: >> http://p.sf.net/sfu/learnmore_122412 >> _______________________________________________ >> Exist-open mailing list >> Exi...@li... >> https://lists.sourceforge.net/lists/listinfo/exist-open |