From: Andy H. <and...@gm...> - 2010-07-27 20:36:19
|
On Tue, Jul 27, 2010 at 10:40 AM, Michael Grady <mis...@ap...> wrote: > Hi all, > This regex proposal is to change the signature of a regex group() method > that was added in ICU 4.4 (still draft) as part of a batch of UText regex > work. > Please also see the question at the bottom on uregex. > Looks good to me, with a couple of comment further down about whether we need to keep the older form of the UText based group() functions. - Andy > > http://bugs.icu-project.org/trac/ticket/<http://bugs.icu-project.org/trac/ticket/N> > 7855 > > Feedback by: Wed Aug 4 > Designated API reviewer: Andy > > regex.h > > In order to disambiguate method signatures between > UText *group(UText *dest, UErrorCode &status) > and > UnicodeString group(int32_t groupNum, UErrorCode &status) > I had used the following technique: > enum MatcherDestIsUTextFlag { MATCHER_DEST_IS_UTEXT }; > UText *group(UText *dest, MatcherDestIsUTextFlag flag, UErrorCode &status) > > This proposal is to change the disambiguation of the method with something > more useful. Andy gets credit for the idea! > -Return capture group length - takes care of the signature > -Modify the behavior to return a UText that is an immutable shallow clone > of the live input string in the matcher > -Point the returned UText to the start of a capture group > > Benefits: Performance advantage from no string copying. Sill provides a > path to get 64bit indexes of a match. > > /** * Returns a shallow clone of the entire live input string with the > UText current native index * set to the beginning of the requested group. * > Note that copying the entire input string may cause significant performance > and memory issues. * @param dest The UText into which the input should be > copied, or NULL to create a new UText * @param group_len A reference to > receive the length of the desired capture group * @param status A reference > to a UErrorCode to receive any errors. * Possible errors are > U_REGEX_INVALID_STATE if no match * has been attempted or the last match > failed and * U_INDEX_OUTOFBOUNDS_ERROR for a bad capture group number. * > @return dest if non-NULL, a shallow clone of the input text otherwise * * > @draft ICU 4.6 */ virtual UText *group(UText *dest, int64_t &group_len, > UErrorCode &status) const; virtual UText *group(int32_t groupNum, UText > *dest, int64_t &group_len, UErrorCode &status) const; > > This would not affect the existing method: > UText *group(int32_t groupNum, UText *dest, UErrorCode &status) const; > which will continue to return mutable UText/deep clones. > Is there any particular reason to keep this variant of group(), or can/should we dump it too? > Question: At Apple, we don't seem to have a need for uregex_ API wrapped > around this stuff yet. If there is interest, I can include it in the > implementation. > > uregex.h > > /** Returns a shallow immutable clone of the entire input string. The > returned UText current native index > * is set to the beginning of the requested capture group. The capture > group length is also > * returned via groupLength. > * Group #0 is the complete string of matched text. > * Group #1 is the text matched by the first set of capturing parentheses. > * > * @param regexp The compiled regular expression. > * @param groupNum The capture group to extract. Group 0 is the > complete > * match. The value of this parameter must be > * less than or equal to the number of capture > groups in > * the pattern. > * @param dest A mutable UText in which to store the current > input. > * If NULL, a new UText will be created as an > immutable shallow clone > * of the entire input string. > * @param groupLength The group length of the desired capture group. > * @param status A reference to a UErrorCode to receive any > errors. > * @return The subject text currently associated with this > regular expression. > * If a pre-allocated UText was provided, it will > always be used and returned. > > * > * @draft ICU 4.6 > */ > U_DRAFT UText * U_EXPORT2 > uregex_groupUTextShallow(URegularExpression *regexp, > int32_t groupNum, > UText *dest, > int64_t *groupLength, > UErrorCode *status); > > I think that I would favor replacing the existing uregex_groupUText() with this new one. I don't see any need to have both, and having both could be confusing - it's not obvious which would be preferred, or why. > > |