Re: [Regexkit-discussion] PCRE Support for International Text
Status: Beta
Brought to you by:
jengelhart
From: John E. <joh...@gm...> - 2008-03-05 02:09:53
|
On Mar 4, 2008, at 5:54 PM, Jonathan Dann wrote: > Hi All, > > I found RegexKit after being pointed to it by a reply on the cocoa-dev > list, its ace. I was then informed that PCRE may not yet support > proper word breaks (sorry if that's not the correct terminology) in > scripts like Kanji. Is this still the case? I'm expecting to support > more languages other than English in my app an this may end up being a > headache for me. Unicode is pretty complex, and I'm an English only speaker, so I'm at a disadvantage as to what a "word break" means in this precise context. I'm pretty sure it's not "space or tab" :). PCRE has a build time option of supporting UTF-8 Unicode and optionally Unicode Properties (the \p{} & \P{} syntax). The PCRE built in to RegexKit includes both, so it's as Unicode enabled as PCRE can get. Foundation uses UTF-16 as it's abstract representation of text, even though internally it may keep the text in any format that happens to be the most convenient for it. PCRE uses only UTF-8 to represent text, however, so things like NSRange values for a piece of text can have to wildly different values depending on whether or not they are for the UTF-8 or UTF-16 representation. RegexKit tries to hide these differences from you and "do the right thing", which generally translates in to "If has to do with Foundation (NSString, etc), then all values are in UTF-16, otherwise raw, low level byte buffer access is in UTF-8." More specific details are covered in the RKRegex class documentation. I'd recommend looking at http://regexkit.sourceforge.net/Documentation/pcre/pcresyntax.html#SEC5 and http://regexkit.sourceforge.net/Documentation/pcre/pcrepattern.html#SEC3 (the section regarding Unicode). There is probably a Unicode property that covers "word break", or something like it. For example, \p{L} matches any Unicode letter, so something like \p{L}+ could be used to match all 'words', but again this is from an English only speaker. The other operator that is typically used for matching "word breaks" is the \b (zero-width non-word character) and \B (logical opposite). However, under PCRE, 'word character' is ASCII word characters only, but I would think that the equivalent could be fashioned out of the \p{} Unicode properties. The ICU documentation is unclear on what \b matches precisely, other than 'word to non-word character transition' and 'seems' to point to a different ICU API for doing 'better word boundary analysis'. http://www.icu-project.org/userguide/regexp.html and http://www.icu-project.org/userguide/boundaryAnalysis.html . A zero order approximation gives me the impression that both PCRE and ICU are both equally capable at finding 'simple' words (those composed of letters), but neither is capable of complex 'word breaking' by themselves, such as word breaking something like "that's", in which the ' is clearly a part of the word. I'd recommend the PCRE mailing list for a more informed opinion, which can be found at http://lists.exim.org/mailman/listinfo/pcre-dev There you'll find the developers and likely someone who can give you a more authoritative answer than my speculation. Towards the future, the latest release started a move to generalize what regex pattern matching library is used with the first obvious candidate being the ICU library that ships with Mac OS X. I have ICU pattern matching working in extremely rough form now, but I'm not terribly happy with the way it's going. The ICU library presents a number of problems, ones that I didn't really take in to consideration when I first started this project. The first is the ICU regex API was clearly not designed with multi-threading in mind. Using the C API, when a regex is compiled, a "regex matcher" is returned. You then "set" the string (which /must/ be UTF-16) which the regex matches. This mixes the state of what's being matched, and how far in to the string the current match is with that compiled regex. This requires "compiling" the regex for each thread, and an awful lot of overhead and per thread information. It's a massive inconvenience, and right now I'm not sure it's actually worth all the effort. There's also the fact that every string needs to be converted in to UTF-16 before it can be matched by ICU. While PCRE requires everything to be converted to UTF-8, I've found that in practice this isn't actually a problem as most of the time strings seem to be kept in UTF-8 or a UTF-8 compatible encoding (ie, ascii). RegexKit expends a lot of effort to get access to the raw NSString buffer to avoid the constant allocation and destruction of temporary strings for a one time match, but that raw buffer obviously has to be in a UTF-8 compatible encoding for that to happen. It would seem that using ICU would require an almost constant conversion of strings to UTF-16 for a one time match, but this is very much application and usage sensitive. On top of those issues, the ICU library that ships with OS X is 'technically' not supported for developer use, it's mostly there for apples internal usage. So by using it, you could essentially be considered to be using an 'unpublished, private API'. The regex syntax of PCRE is also much richer than the regex syntax provided by ICU. Examples include Named Subcapture (not just $1 numbers, but $name for a subpattern, very handy), conditional subpatterns, recursive subpatterns, subpattern subroutines, etc. From what I can tell, PCRE has every feature of the ICU regex pattern matcher, and a lot more. The only area I'm not entirely sure on is the particulars regarding the \p{} Unicode property support, the ICU documentation only says that it supports it, but gives no other details as to what exactly is supported. In general, the PCRE regex engine tends to be one of the fastest regex matchers to boot. I've attached my consolidated ICU Regex C API header file in case the ICU library is a drop dead requirement for you. This is for the C API of just the ICU Regex matcher only, nothing else. It's an all in one file, nothing else needs to be included (I think, at least not from the ICU headers at least). You'll need to link against /usr/lib/ libicu.dylib (typically just a -licu to the compiler/linker). The ICU documentation can be found at http://www.icu-project.org/apiref/icu4c/uregex_8h.html Hopefully it's enough to give you a head start if that's the route you need to go. Again, I'm not sure if the next version of RegexKit will include ICU support even though I've got it limping along right now due to some other problems I outlined above. It probably will make it in eventually, but it's sort of iffy for the next release. |