On Tue, Jul 27, 2010 at 3:36 PM, Michael Grady <mishonok@apple.com> wrote:
Thanks Andy,
(comments inline)

On Jul 27, 2010, at 1:36 PM, Andy Heninger wrote:
On Tue, Jul 27, 2010 at 10:40 AM, Michael Grady <mishonok@apple.com> wrote:
Hi all,
This regex proposal is to change the signature of a regex group() method that was added in ICU 4.4 (still draft) as part of a batch of UText regex work.
Please also see the question at the bottom on uregex.

Looks good to me, with a couple of comment further down about whether we need to keep the older form of the UText based group() functions.

  - Andy

Feedback by: Wed Aug 4
Designated API reviewer: Andy


In order to disambiguate method signatures between
UText *group(UText *dest, UErrorCode &status)
UnicodeString group(int32_t groupNum, UErrorCode &status)
I had used the following technique:
enum MatcherDestIsUTextFlag { MATCHER_DEST_IS_UTEXT };
UText *group(UText *dest, MatcherDestIsUTextFlag flag, UErrorCode &status)

This proposal is to change the disambiguation of the method with something more useful.  Andy gets credit for the idea!
-Return capture group length - takes care of the signature
-Modify the behavior to return a UText that is an immutable shallow clone of the live input string in the matcher
-Point the returned UText to the start of a capture group

Benefits:  Performance advantage from no string copying.  Sill provides a path to get 64bit indexes of a match.

/** * Returns a shallow clone of the entire live input string with the UText current native index * set to the beginning of the requested group. * Note that copying the entire input string may cause significant performance and memory issues. * @param dest The UText into which the input should be copied, or NULL to create a new UText * @param group_len A reference to receive the length of the desired capture group * @param status A reference to a UErrorCode to receive any errors. * Possible errors are U_REGEX_INVALID_STATE if no match * has been attempted or the last match failed and * U_INDEX_OUTOFBOUNDS_ERROR for a bad capture group number. * @return dest if non-NULL, a shallow clone of the input text otherwise * * @draft ICU 4.6 */ virtual UText *group(UText *dest, int64_t &group_len, UErrorCode &status) const; virtual UText *group(int32_t groupNum, UText *dest, int64_t &group_len, UErrorCode &status) const;

This would not affect the existing method:
UText *group(int32_t groupNum, UText *dest, UErrorCode &status) const;
which will continue to return mutable UText/deep clones.

Is there any particular reason to keep this variant of group(), or can/should we dump it too?

To be honest, I was hoping you could clue me in on the usage patterns you had in mind that would make shallow clones more appealing than deep clones :)
I can see performance, but then at some point you still want to extract strings from the UText.  The deep clone UText API provide much more parity with the UnicodeString-based regex API (and UChar API in uregex), so if someone is moving to 64bit indexes, there's much less code to munge.
The shallow clone API are a new model and from my modifications to the test code so far, it's a bit more cumbersome to do your own extraction, but there may be real world usage scenarios where deferring extraction is useful, or perhaps minimal inspection of the UText (i.e. utext_current32) without extracting the whole group.

My suspicion is that very few people will use the UText APIs directly, that UText will generally appear in wrappers that are providing an interface to some other string type.  The creators of the wrappers will probably have a good understanding of what they are doing, and the wider group of developers won't normally see UText at all.  But maybe I'm wrong.  I do know that I would be annoyed if I had to explicitly create UTexts in order to use regular expressions in an application.

  - Andy

Question: At Apple, we don't seem to have a need for uregex_ API wrapped around this stuff yet. If there is interest, I can include it in the implementation.


/** Returns a shallow immutable clone of the entire input string.  The returned UText current native index
  *   is set to the beginning of the requested capture group.  The capture group length is also
  *   returned via groupLength.
  * Group #0 is the complete string of matched text.
  * Group #1 is the text matched by the first set of capturing parentheses.
  *   @param   regexp       The compiled regular expression.
  *   @param   groupNum     The capture group to extract.  Group 0 is the complete
  *                         match.  The value of this parameter must be
  *                         less than or equal to the number of capture groups in
  *                         the pattern.
  *   @param   dest         A mutable UText in which to store the current input.
  *                         If NULL, a new UText will be created as an immutable shallow clone
  *                         of the entire input string.
  *   @param   groupLength  The group length of the desired capture group.
  *   @param   status       A reference to a UErrorCode to receive any errors.
  *   @return               The subject text currently associated with this regular expression.
  *                         If a pre-allocated UText was provided, it will always be used and returned.

  *   @draft ICU 4.6
uregex_groupUTextShallow(URegularExpression *regexp,
                  int32_t             groupNum,
                  UText              *dest,
                  int64_t            *groupLength,
                  UErrorCode         *status);

I think that I would favor replacing the existing uregex_groupUText() with this new one.  I don't see any need to have both, and having both could be confusing - it's not obvious which would be preferred, or why.

See comment above, but the "Shallow" suffix is intended to make it more appealing to some performance oriented clients, but I'm not sure what those would be.

The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
icu-design mailing list
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design