Re: [icu-design] Proposals for a UText-based regular expression engine

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

OK, I have had 3 responses on the proposal copied below since I sent it out Oct 26:

1. Mark responded (Oct 26): "Adding UText support would be a good addition. I have no objection to extending the regex to handle UText by changing the common engine, if we have a good set of performance tests that show that before and after are comparable in performance. By comparable, I don't mean exactly the same: if new = old + 2%, no problem; if new = old + 25%, there's a problem. If the performance difference is too great, then we'd be better off having two different sets of APIs and code, so that the direct version could keep the same performance." => performance concern

2. Andy responded (Oct 26): "Another consideration is UTF-8 support.  UText can handle it, but I suspect that the performance hit would be substantially greater than with UText wrapping a UTF-16 based string.  While we are opening up regexp, we should at least think about how a UTF-8 specific API would fit in, even if we aren't doing it right now." => concern about adding parallel UTF8 API

3. Markus responded (Nov 16): "In addition, I want a UText implementation to be able to handle UTF-8 text, even if that's not the fastest way to support UTF-8. UText is designed to do that and more, but it takes a little extra code and testing, and wasn't done for DBBI+UText." => concern about making sure UText works for UTF8.

Our current plan is to roll this into our Apple sources first, do some performance testing and tweaking, then roll it into a branch off of ICU trunk. We will address Mark's performance concerns during our testing/tweaking phase, and report some numbers allowing the ICU team to then make a decision about whether to use a common core path with the existing regex code or to use a separate path. Also, although the code does now appear to work with UTF8 passed in a UText, we will verify this (to address Markus's concern). Regarding Andy's point: The current UText regex proposal basically adds a complete parallel set of C++ methods that use UText instead of UnicodeString, and C functions that use UText* instead of UChar*. Are you suggesting doing something other than adding another parallel set that use char*?

Regarding the actual API proposal, it sounds like there is no objection:
• add methods on RegexPattern & RegexMatcher that parallel existing UnicodeString methods but take/return UText
• add C functions that parallel existing UChar* functions but take/return UText*
• add comparison functions (and UTEXT_CURRENT32 macro) for UText

Correct? I suppose since this has been proposed for more than a week I can assume that (We'd like to confirm this so we can proceed with our roll-in and testing).
Thanks,
-Peter E

On Oct 26, 2009, at 11:25 AM, Peter Edberg wrote:

> Last summer, Apple intern Jordan Rose sent the proposal copied below regarding changing the ICU4C regular expression engine to operate on UTexts (this proposal reflected some initial feedback from Andy). The document links in the proposal no longer work, but I have moved their content into the following design doc: <http://sites.google.com/site/icusite/design/utext-for-regex>.
> 
> This work also includes changing the internal indices in the regex engine to be 64-bit instead of 32-bit (external indices & lengths remain 32-bit). Performance via the Utext interface, although faster in a few specific cases, is generally somewhat slower than via UnicodeString or UChar*, up to about 1.5x worst case.
> 
> We would like to move ahead with rolling this in to ICU, and hence I am reposting this to collect further feedback concerns, etc. The work will probably be done by longtime Apple employee Michael Grady, who is joining the Apple ICU team and will soon be participating in some of the Wednesday meetings.
> 
> Thanks,
> -Peter E
> 
>> Date: Fri, 1 Aug 2008 20:08:30 -0700
>> From: Jordan Rose <jor...@ap...>
>> Subject: [icu-design] Proposals for a UText-based regular expression
>> 	engine
>> To: icu...@li...
>> Message-ID: <037...@ap...>
>> Content-Type: text/plain; charset="us-ascii"
>> 
>> Hello, all. I'm a summer intern working at Apple, and for the past few
>> weeks I've been working on converting the ICU regular expression
>> engine to use UText as per http://bugs.icu-project.org/trac/ticket/
>> 4521. I'd like to put forward this group of proposals aimed at getting
>> these changes integrated. All existing UnicodeString-based
>> functionality would remain...functional...and the benchmarks I've run
>> indicate performance is not affected overmuch (certainly outweighed by
>> the time and memory costs of converting to a UnicodeString from an
>> external format). There are four separate proposals here to deal with
>> four areas of API change, in UText and in the regular expression
>> engine itself.
>> 
>> I have been working with Deborah Goldsmith,  Andy Heninger, and Peter
>> Edberg during the past few weeks to get this working. The
>> specification is fully implemented on my local copy of ICU, and unit-
>> tested on i386, x86_64, ppc, and ppc64.
>> 
>> 
>> 
>> In case anyone has trouble viewing the attachments, the files are also
>> available online:
>> 
>> http://www.fileden.com/files/2006/10/28/326783/UText-UTEXT_CURRENT32.txt
>> http://www.fileden.com/files/2006/10/28/326783/UText-comparison.txt
>> http://www.fileden.com/files/2006/10/28/326783/Regex.txt
>> http://www.fileden.com/files/2006/10/28/326783/URegex.txt
>> 
>> Given that Andy hasn't yet had time to review all of the internal
>> changes, some of the API may need to be modified.
>> 
>> Unfortunately, I am not going to have Internet access during the next
>> week, but I thought it would be best to send this out early to
>> generate discussion. As such, questions and comments are definitely
>> welcome, but I won't be able to respond to them immediately. This is
>> definitely a major change and care should be taken in integrating it
>> into ICU.
>> 
>> Thank you,
>> Jordan Rose
>> Apple Inc.
> 

Re: [icu-design] Proposals for a UText-based regular expression engine

Open Source C/C++/Java libraries from Unicode

Re: [icu-design] Proposals for a UText-based regular expression engine