Re: [icu-design] API Proposal: Break Iteration and UText for ICU4C

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Those seem reasonable.

(I am generally in favor of using separate setters instead of complicated
constructors; especially when combined with chaining I think it is
clearer -- although chaining is not applicable here.)

One question: can you clarify why you want a shallow clone? It would seem
like the only reason for cloning is to be thread-safe, but a shallow clon=
e
doesn't guarantee that as I understand it (and I think thread safety shou=
ld
be at a higher level in this case anyway).

=E2=80=8EMark

----- Original Message -----=20
From: "Andy Heninger" <an...@jt...>
To: <icu...@li...>
Sent: Tuesday, June 21, 2005 15:32
Subject: Re: [icu-design] API Proposal: Break Iteration and UText for ICU=
4C

> Here are some minor tweaks to the proposal, below, for extending Break
> Iteration to work with UText.
>
> 1.  (Suggested by Markus)
>       > void BreakIterator::setText(UText &text);
>            becomes
>       void BreakIterator::setText(UText *text);
>
>       This is consistent with all other UText APIs, which are
>       uniformly passed around by pointer.
>
>       Make it clear in the description that the function is doing
>       a shallow clone of the supplied UText.
>
> 2.  (Also suggested by Markus)
>      > void CharacterIterator& BreakIteartor::getUText(UText &fillIn)
>           becomes
>       UText *CharacterIterator& BreakIteartor::getUText(
>                UText *fillIn,  UErrorcode &errorCode)
>
>       Again, make it clear that the function is shallow-cloning the
>       internal UText to produce the result.  By the usual UText
>       scheme for clone and open, a new UText will be allocated
>       if NULL is passed in.
>
>
> 3.   In the C API
>       >  UBreakIterator *ubrk_openUText(type, locale, UText,
>                             status)
>
>       I propose dropping this function, and relying on a
>       doing a ubrk_open() with no input text specified, followed
>       by a ubrk_setUText().
>
>       The problem is that to keep things symmetric, we would really
>       want to have two flavors of ubrk_openUText, one from rules
>       and one with a break iterator type, and this starts to blow
>       up the number of API functions more than I like.
>
>       Also, in practice while actually writing code using break
>       iterators, I have found that I always create the break iterator
>       with no text, and then set the text later.  Break iterators
>       are intended to be reused, and doing so naturally tends to
>       separate creation from setting the text.
>
>
>    -- Andy Heninger
>       hen...@us...
>
>
> Andy Heninger wrote:
>
> > ICU4C API Proposal for extending Break Iteration to work with UText.
> > Expires 6/23/05
> >
> > The general idea is to add the necessary functions to allow break
> > iteration to work with text input the form of UText in addition to th=
e
> > existing text forms (CharacterIterator, UChar *, UnicodeString).
> >
> >
> > //
> > //  Additions and changes to the C++ API:
> > //
> >
> > //
> > //  Reset the break iterator to operate over the text represented by
> > //  the UText.  The text boundary is reset to the start.
> > //
> > //  Ownership of the UText remains with the caller.  The UText need
> > //  not be preserved after calling this method, but the underlying
> > //  text itself should not be altered while invoking other break
> > //  iteration functions over it.
> > //
> > void
> > BreakIterator::setText(UText &text);
> >
> >
> > //
> > //  getText() is an existing function of BreakIterator.
> > //  When the original input is supplied as a UText,
> > //   this function will fail.  Because there is no
> > //   error status available, return a CharacterIterator
> > //   over an empty string in this case.
> > //
> > //   A possible alternative: do a CharacterIterator implementation
> > //   that wraps up a UText.
> > //
> > CharacterIterator& BreakIteartor::getText()
> >
> > //
> > // Get the UText for this break iterator.
> > //   The caller-supplied UText will be filled in with
> > //   the requested data.
> > //
> > //     It would be very dangerous to return a reference to the
> > //     internal live UText because that one is reused forever,
> > //     across all setText() operations.  UTexts are designed to
> > //     copy efficiently with a shallow UText::clone().
> > //
> > void BreakIterator::getUText(UText &ut);
> >
> > //
> > //  first() is an existing method of BreakIterator.
> > //  CharacterIterators, on which the existing BreakIterator
> > //  implementation is based, can have a non-zero starting
> > //  index.
> > //
> > //  UText does not have this capability.
> > //
> > //  When switching the internal implementation from
> > //  CharacterIterator to UText, we may want to think about
> > //  losing the ability for first() to be non-zero.
> > //
> > int32_t CharacterIterator::first()
> >
> >
> > //
> > // Additions to the C API
> > //
> >
> > //
> > //  Open a break iterator to operate over a UText.
> > //
> > //  Identical to the existing function ubrk_open(), except that
> > //  the text is supplied as a UText instead of a UChar* and length.
> > //
> > UBreakIterator *ubrk_openUText(
> >            UBreakIteratorType  type,
> >            const char         *locale,
> >            UText              *text,
> >            UErrorCode         *status);
> >
> >
> >
> > //
> > //  Reset the break iterator to work with new text.
> > //
> > //  Ownership of the UText remains with the caller.  The UText need
> > //  not be preserved after calling this method, but the underlying
> > //  text itself should not be altered while invoking other break
> > //  iteration functions over it.
> > //
> > void ubrk_setUText(UBreakIterator *bi,
> >                    UText          *text,
> >                    UErrorCode     *status);
> >
> >
> >
> >
> > Implementation Considerations:
> >
> > The RBBI implementation needs to be switched from being based on
> > CharacterIterator to being based on UText.  It's not conceptually har=
d
> > or tricky, but the changes are extensive.  It's a little worrisome to
> > change out the underpinnings of something as heavily used as RBBI,
> > replacing it with something brand new and not yet proven, UText.
> >
> > There are two alternatives that I can think of:
> >
> > o  Do the UText based RBBI implementation, but don't roll it in as
> >    the main RBBI implementation for ICU 3.4.  The new implementation
> >    would be used only when text was supplied as a UText.
> >
> >    This would probably also include some temporary restrictions on
> >    the C++ API related to how the C++ class hierarchy would need
> >    to be arranged.  The plain C API could be made to work cleanly.
> >
> >    It would also involve some code bloat from having two copies
> >    of the RBBI engine.  The size isn't huge, though.
> >
> > o  Write a CharacterIterator implementation that wraps up a
> >    UText.  Leave the RBBI engine as-is, working with
> >    CharacterIterator.
> >
> >    This would be safe, and provide the full RBBI API for UText.
> >    It would not run as efficiently as a native UText based rbbi
> >    engine.
> >
> >
>
>
>
>
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dcl=
ick
> _______________________________________________
> icu-design mailing list
> icu...@li...
> https://lists.sourceforge.net/lists/listinfo/icu-design
>
>

Re: [icu-design] API Proposal: Break Iteration and UText for ICU4C

Open Source C/C++/Java libraries from Unicode

Re: [icu-design] API Proposal: Break Iteration and UText for ICU4C