Re: [icu-design] Proposals for a UText-based regular expression engine

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Mon, Dec 7, 2009 at 8:38 PM, Peter Edberg <pe...@ap...> wrote:

> OK, I have had 3 responses on the proposal copied below since I sent it out
> Oct 26:
>
> 1. Mark responded (Oct 26): "Adding UText support would be a good addition.
> I have no objection to extending the regex to handle UText by changing the
> common engine, if we have a good set of performance tests that show that
> before and after are comparable in performance. By comparable, I don't mean
> exactly the same: if new = old + 2%, no problem; if new = old + 25%, there's
> a problem. If the performance difference is too great, then we'd be better
> off having two different sets of APIs and code, so that the direct version
> could keep the same performance." => performance concern
>
> 2. Andy responded (Oct 26): "Another consideration is UTF-8 support.  UText
> can handle it, but I suspect that the performance hit would be substantially
> greater than with UText wrapping a UTF-16 based string.  While we are
> opening up regexp, we should at least think about how a UTF-8 specific API
> would fit in, even if we aren't doing it right now." => concern about adding
> parallel UTF8 API
>
> 3. Markus responded (Nov 16): "In addition, I want a UText implementation
> to be able to handle UTF-8 text, even if that's not the fastest way to
> support UTF-8. UText is designed to do that and more, but it takes a little
> extra code and testing, and wasn't done for DBBI+UText." => concern about
> making sure UText works for UTF8.
>
> Our current plan is to roll this into our Apple sources first, do some
> performance testing and tweaking, then roll it into a branch off of ICU
> trunk. We will address Mark's performance concerns during our
> testing/tweaking phase, and report some numbers allowing the ICU team to
> then make a decision about whether to use a common core path with the
> existing regex code or to use a separate path. Also, although the code does
> now appear to work with UTF8 passed in a UText, we will verify this (to
> address Markus's concern). Regarding Andy's point: The current UText regex
> proposal basically adds a complete parallel set of C++ methods that use
> UText instead of UnicodeString, and C functions that use UText* instead of
> UChar*. Are you suggesting doing something other than adding another
> parallel set that use char*?
>

For plain C, a parallel API would probably be the only choice.  For C++ it
might be possible to do something clever with templates.  Maybe with a
default being to wrap a user's incoming type in a UText, with
specializations for string types that the implementation knows about.  At
the engine level, I can envision a template based scheme that would avoid
the need to replicate source code while producing core engine code that is
nearly optimum for different string types.  We would end up with multiple
compiled copies of some code, but the core of the engine is pretty small.

For really optimized  UTF-8, we would want to change the compiled patterns
to keep string and multi-byte character literals directly in UTF-8.  How
this would all hang together needs more thought.

There are some existing optimizations that are almost certainly broken for
UTF-8, having to do with the shortest and longest possible matches for a
pattern, or a part of a pattern.

(There's a certain temptation to reenter the performance race with PCRE
here.  ICU has been sitting still for several years now.)

It's also tempting to try to do something to detect and un-stick
exponential-time patterns.  Perl has some ideas here.

  - Andy

> Regarding the actual API proposal, it sounds like there is no objection:
> • add methods on RegexPattern & RegexMatcher that parallel existing
> UnicodeString methods but take/return UText
> • add C functions that parallel existing UChar* functions but take/return
> UText*
> • add comparison functions (and UTEXT_CURRENT32 macro) for UText
>
> Correct? I suppose since this has been proposed for more than a week I can
> assume that (We'd like to confirm this so we can proceed with our roll-in
> and testing).
> Thanks,
> -Peter E
>
> On Oct 26, 2009, at 11:25 AM, Peter Edberg wrote:
>
> > Last summer, Apple intern Jordan Rose sent the proposal copied below
> regarding changing the ICU4C regular expression engine to operate on UTexts
> (this proposal reflected some initial feedback from Andy). The document
> links in the proposal no longer work, but I have moved their content into
> the following design doc: <
> http://sites.google.com/site/icusite/design/utext-for-regex>.
> >
> > This work also includes changing the internal indices in the regex engine
> to be 64-bit instead of 32-bit (external indices & lengths remain 32-bit).
> Performance via the Utext interface, although faster in a few specific
> cases, is generally somewhat slower than via UnicodeString or UChar*, up to
> about 1.5x worst case.
> >
> > We would like to move ahead with rolling this in to ICU, and hence I am
> reposting this to collect further feedback concerns, etc. The work will
> probably be done by longtime Apple employee Michael Grady, who is joining
> the Apple ICU team and will soon be participating in some of the Wednesday
> meetings.
> >
> > Thanks,
> > -Peter E
> >
> >> Date: Fri, 1 Aug 2008 20:08:30 -0700
> >> From: Jordan Rose <jor...@ap...>
> >> Subject: [icu-design] Proposals for a UText-based regular expression
> >>      engine
> >> To: icu...@li...
> >> Message-ID: <037...@ap...>
> >> Content-Type: text/plain; charset="us-ascii"
> >>
> >> Hello, all. I'm a summer intern working at Apple, and for the past few
> >> weeks I've been working on converting the ICU regular expression
> >> engine to use UText as per http://bugs.icu-project.org/trac/ticket/
> >> 4521. I'd like to put forward this group of proposals aimed at getting
> >> these changes integrated. All existing UnicodeString-based
> >> functionality would remain...functional...and the benchmarks I've run
> >> indicate performance is not affected overmuch (certainly outweighed by
> >> the time and memory costs of converting to a UnicodeString from an
> >> external format). There are four separate proposals here to deal with
> >> four areas of API change, in UText and in the regular expression
> >> engine itself.
> >>
> >> I have been working with Deborah Goldsmith,  Andy Heninger, and Peter
> >> Edberg during the past few weeks to get this working. The
> >> specification is fully implemented on my local copy of ICU, and unit-
> >> tested on i386, x86_64, ppc, and ppc64.
> >>
> >>
> >>
> >> In case anyone has trouble viewing the attachments, the files are also
> >> available online:
> >>
> >>
> http://www.fileden.com/files/2006/10/28/326783/UText-UTEXT_CURRENT32.txt
> >> http://www.fileden.com/files/2006/10/28/326783/UText-comparison.txt
> >> http://www.fileden.com/files/2006/10/28/326783/Regex.txt
> >> http://www.fileden.com/files/2006/10/28/326783/URegex.txt
> >>
> >> Given that Andy hasn't yet had time to review all of the internal
> >> changes, some of the API may need to be modified.
> >>
> >> Unfortunately, I am not going to have Internet access during the next
> >> week, but I thought it would be best to send this out early to
> >> generate discussion. As such, questions and comments are definitely
> >> welcome, but I won't be able to respond to them immediately. This is
> >> definitely a major change and care should be taken in integrating it
> >> into ICU.
> >>
> >> Thank you,
> >> Jordan Rose
> >> Apple Inc.
> >
>
>
>
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> icu-design mailing list
> icu...@li...
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design
>

Re: [icu-design] Proposals for a UText-based regular expression engine

Open Source C/C++/Java libraries from Unicode

Re: [icu-design] Proposals for a UText-based regular expression engine