Re: [Regexkit-discussion] PCRE Support for International Text
Status: Beta
Brought to you by:
jengelhart
|
From: John E. <joh...@gm...> - 2008-03-05 02:09:53
|
On Mar 4, 2008, at 5:54 PM, Jonathan Dann wrote:
> Hi All,
>
> I found RegexKit after being pointed to it by a reply on the cocoa-dev
> list, its ace. I was then informed that PCRE may not yet support
> proper word breaks (sorry if that's not the correct terminology) in
> scripts like Kanji. Is this still the case? I'm expecting to support
> more languages other than English in my app an this may end up being a
> headache for me.
Unicode is pretty complex, and I'm an English only speaker, so I'm at
a disadvantage as to what a "word break" means in this precise
context. I'm pretty sure it's not "space or tab" :).
PCRE has a build time option of supporting UTF-8 Unicode and
optionally Unicode Properties (the \p{} & \P{} syntax). The PCRE
built in to RegexKit includes both, so it's as Unicode enabled as PCRE
can get. Foundation uses UTF-16 as it's abstract representation of
text, even though internally it may keep the text in any format that
happens to be the most convenient for it. PCRE uses only UTF-8 to
represent text, however, so things like NSRange values for a piece of
text can have to wildly different values depending on whether or not
they are for the UTF-8 or UTF-16 representation. RegexKit tries to
hide these differences from you and "do the right thing", which
generally translates in to "If has to do with Foundation (NSString,
etc), then all values are in UTF-16, otherwise raw, low level byte
buffer access is in UTF-8." More specific details are covered in the
RKRegex class documentation.
I'd recommend looking at http://regexkit.sourceforge.net/Documentation/pcre/pcresyntax.html#SEC5
and http://regexkit.sourceforge.net/Documentation/pcre/pcrepattern.html#SEC3
(the section regarding Unicode). There is probably a Unicode
property that covers "word break", or something like it. For example,
\p{L} matches any Unicode letter, so something like \p{L}+ could be
used to match all 'words', but again this is from an English only
speaker.
The other operator that is typically used for matching "word breaks"
is the \b (zero-width non-word character) and \B (logical opposite).
However, under PCRE, 'word character' is ASCII word characters only,
but I would think that the equivalent could be fashioned out of the
\p{} Unicode properties. The ICU documentation is unclear on what \b
matches precisely, other than 'word to non-word character transition'
and 'seems' to point to a different ICU API for doing 'better word
boundary analysis'. http://www.icu-project.org/userguide/regexp.html
and http://www.icu-project.org/userguide/boundaryAnalysis.html .
A zero order approximation gives me the impression that both PCRE and
ICU are both equally capable at finding 'simple' words (those composed
of letters), but neither is capable of complex 'word breaking' by
themselves, such as word breaking something like "that's", in which
the ' is clearly a part of the word.
I'd recommend the PCRE mailing list for a more informed opinion, which
can be found at http://lists.exim.org/mailman/listinfo/pcre-dev There
you'll find the developers and likely someone who can give you a more
authoritative answer than my speculation.
Towards the future, the latest release started a move to generalize
what regex pattern matching library is used with the first obvious
candidate being the ICU library that ships with Mac OS X. I have ICU
pattern matching working in extremely rough form now, but I'm not
terribly happy with the way it's going. The ICU library presents a
number of problems, ones that I didn't really take in to consideration
when I first started this project. The first is the ICU regex API was
clearly not designed with multi-threading in mind. Using the C API,
when a regex is compiled, a "regex matcher" is returned. You then
"set" the string (which /must/ be UTF-16) which the regex matches.
This mixes the state of what's being matched, and how far in to the
string the current match is with that compiled regex. This requires
"compiling" the regex for each thread, and an awful lot of overhead
and per thread information. It's a massive inconvenience, and right
now I'm not sure it's actually worth all the effort. There's also the
fact that every string needs to be converted in to UTF-16 before it
can be matched by ICU. While PCRE requires everything to be converted
to UTF-8, I've found that in practice this isn't actually a problem as
most of the time strings seem to be kept in UTF-8 or a UTF-8
compatible encoding (ie, ascii). RegexKit expends a lot of effort to
get access to the raw NSString buffer to avoid the constant allocation
and destruction of temporary strings for a one time match, but that
raw buffer obviously has to be in a UTF-8 compatible encoding for that
to happen. It would seem that using ICU would require an almost
constant conversion of strings to UTF-16 for a one time match, but
this is very much application and usage sensitive.
On top of those issues, the ICU library that ships with OS X is
'technically' not supported for developer use, it's mostly there for
apples internal usage. So by using it, you could essentially be
considered to be using an 'unpublished, private API'. The regex
syntax of PCRE is also much richer than the regex syntax provided by
ICU. Examples include Named Subcapture (not just $1 numbers, but
$name for a subpattern, very handy), conditional subpatterns,
recursive subpatterns, subpattern subroutines, etc. From what I can
tell, PCRE has every feature of the ICU regex pattern matcher, and a
lot more. The only area I'm not entirely sure on is the particulars
regarding the \p{} Unicode property support, the ICU documentation
only says that it supports it, but gives no other details as to what
exactly is supported. In general, the PCRE regex engine tends to be
one of the fastest regex matchers to boot.
I've attached my consolidated ICU Regex C API header file in case the
ICU library is a drop dead requirement for you. This is for the C API
of just the ICU Regex matcher only, nothing else. It's an all in one
file, nothing else needs to be included (I think, at least not from
the ICU headers at least). You'll need to link against /usr/lib/
libicu.dylib (typically just a -licu to the compiler/linker). The ICU
documentation can be found at http://www.icu-project.org/apiref/icu4c/uregex_8h.html
Hopefully it's enough to give you a head start if that's the route
you need to go. Again, I'm not sure if the next version of RegexKit
will include ICU support even though I've got it limping along right
now due to some other problems I outlined above. It probably will
make it in eventually, but it's sort of iffy for the next release.
|