Re: [Regexkit-discussion] PCRE Support for International Text

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Mar 4, 2008, at 5:54 PM, Jonathan Dann wrote:
> Hi All,
>
> I found RegexKit after being pointed to it by a reply on the cocoa-dev
> list, its ace.  I was then informed that PCRE may not yet support
> proper word breaks (sorry if that's not the correct terminology) in
> scripts like Kanji.  Is this still the case?  I'm expecting to support
> more languages other than English in my app an this may end up being a
> headache for me.

Unicode is pretty complex, and I'm an English only speaker, so I'm at  
a disadvantage as to what a "word break" means in this precise  
context.  I'm pretty sure it's not "space or tab" :).

PCRE has a build time option of supporting UTF-8 Unicode and  
optionally Unicode Properties (the \p{} & \P{} syntax).  The PCRE  
built in to RegexKit includes both, so it's as Unicode enabled as PCRE  
can get.  Foundation uses UTF-16 as it's abstract representation of  
text, even though internally it may keep the text in any format that  
happens to be the most convenient for it.  PCRE uses only UTF-8 to  
represent text, however, so things like NSRange values for a piece of  
text can have to wildly different values depending on whether or not  
they are for the UTF-8 or UTF-16 representation.  RegexKit tries to  
hide these differences from you and "do the right thing", which  
generally translates in to "If has to do with Foundation (NSString,  
etc), then all values are in UTF-16, otherwise raw, low level byte  
buffer access is in UTF-8."  More specific details are covered in the  
RKRegex class documentation.

I'd recommend looking at http://regexkit.sourceforge.net/Documentation/pcre/pcresyntax.html#SEC5 
  and http://regexkit.sourceforge.net/Documentation/pcre/pcrepattern.html#SEC3 
  (the section regarding Unicode).  There is probably a Unicode  
property that covers "word break", or something like it.  For example,  
\p{L} matches any Unicode letter, so something like \p{L}+ could be  
used to match all 'words', but again this is from an English only  
speaker.

The other operator that is typically used for matching "word breaks"  
is the \b (zero-width non-word character) and \B (logical opposite).   
However, under PCRE, 'word character' is ASCII word characters only,  
but I would think that the equivalent could be fashioned out of the  
\p{} Unicode properties.  The ICU documentation is unclear on what \b  
matches precisely, other than 'word to non-word character transition'  
and 'seems' to point to a different ICU API for doing 'better word  
boundary analysis'.  http://www.icu-project.org/userguide/regexp.html   
and http://www.icu-project.org/userguide/boundaryAnalysis.html .

A zero order approximation gives me the impression that both PCRE and  
ICU are both equally capable at finding 'simple' words (those composed  
of letters), but neither is capable of complex 'word breaking' by  
themselves, such as word breaking something like "that's", in which  
the ' is clearly a part of the word.

I'd recommend the PCRE mailing list for a more informed opinion, which  
can be found at http://lists.exim.org/mailman/listinfo/pcre-dev  There  
you'll find the developers and likely someone who can give you a more  
authoritative answer than my speculation.

Towards the future, the latest release started a move to generalize  
what regex pattern matching library is used with the first obvious  
candidate being the ICU library that ships with Mac OS X.  I have ICU  
pattern matching working in extremely rough form now, but I'm not  
terribly happy with the way it's going.  The ICU library presents a  
number of problems, ones that I didn't really take in to consideration  
when I first started this project.  The first is the ICU regex API was  
clearly not designed with multi-threading in mind.  Using the C API,  
when a regex is compiled, a "regex matcher" is returned.  You then  
"set" the string (which /must/ be UTF-16) which the regex matches.   
This mixes the state of what's being matched, and how far in to the  
string the current match is with that compiled regex.  This requires  
"compiling" the regex for each thread, and an awful lot of overhead  
and per thread information.  It's a massive inconvenience, and right  
now I'm not sure it's actually worth all the effort.  There's also the  
fact that every string needs to be converted in to UTF-16 before it  
can be matched by ICU.  While PCRE requires everything to be converted  
to UTF-8, I've found that in practice this isn't actually a problem as  
most of the time strings seem to be kept in UTF-8 or a UTF-8  
compatible encoding (ie, ascii).  RegexKit expends a lot of effort to  
get access to the raw NSString buffer to avoid the constant allocation  
and destruction of temporary strings for a one time match, but that  
raw buffer obviously has to be in a UTF-8 compatible encoding for that  
to happen.  It would seem that using ICU would require an almost  
constant conversion of strings to UTF-16 for a one time match, but  
this is very much application and usage sensitive.

On top of those issues, the ICU library that ships with OS X is  
'technically' not supported for developer use, it's mostly there for  
apples internal usage.  So by using it, you could essentially be  
considered to be using an 'unpublished, private API'.  The regex  
syntax of PCRE is also much richer than the regex syntax provided by  
ICU.  Examples include Named Subcapture (not just $1 numbers, but  
$name for a subpattern, very handy), conditional subpatterns,  
recursive subpatterns, subpattern subroutines, etc.  From what I can  
tell, PCRE has every feature of the ICU regex pattern matcher, and a  
lot more.  The only area I'm not entirely sure on is the particulars  
regarding the \p{} Unicode property support, the ICU documentation  
only says that it supports it, but gives no other details as to what  
exactly is supported. In general, the PCRE regex engine tends to be  
one of the fastest regex matchers to boot.

I've attached my consolidated ICU Regex C API header file in case the  
ICU library is a drop dead requirement for you.  This is for the C API  
of just the ICU Regex matcher only, nothing else.  It's an all in one  
file, nothing else needs to be included (I think, at least not from  
the ICU headers at least).  You'll need to link against /usr/lib/ 
libicu.dylib (typically just a -licu to the compiler/linker).  The ICU  
documentation can be found at http://www.icu-project.org/apiref/icu4c/uregex_8h.html 
   Hopefully it's enough to give you a head start if that's the route  
you need to go.  Again, I'm not sure if the next version of RegexKit  
will include ICU support even though I've got it limping along right  
now due to some other problems I outlined above.  It probably will  
make it in eventually, but it's sort of iffy for the next release.