Re: [icu-design] Proposed API change to Unicode Spoof Detection

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Brodie Thiesfield <bro...@je...> wrote

> > On Thu, Apr 2, 2009 at 7:20 AM, Erik van der Poel <er...@go...>
wrote:
> >> You spoof are confusable UTF-8?
> >>
> >> I can has cheezburger?
> >>
> >> The English just sounds really funny.
> >
> > I agree that the function name reads oddly, but changing to "is" doesn't
fix it.
> >
> > areConfusableUTF8Identifiers( >s1, s2, ...)  might be clearer.
>
> Changing the the word order and adding identifiers makes it clearer:
>        uspoof_areIdentifiersConfusingUTF8
>
> Or leaving behind the verb-object-adj order, use a statement.
>        uspoof_identifiersAreConfusingUTF8
>
> It would usually be used in a test right?
>
> if (uspoof_identifiersAreConfusingUTF8(s1, s2)) {
>  // stop
> }

These are good suggestions.

At this point, we are too late in the ICU 4.2 cycle to make any API changes.
There's a possibility that we may want to revisit the spoof API later,
though.  I'm finding the the correct use of it, and especially all the
flags, to be confusing.  If it's confusing to me I worry about what it will
be like for others.  I'm not happy with it.

  -- Andy

On Thu, Apr 2, 2009 at 11:03 AM, Erik van der Poel <er...@go...> wrote:

> That's cool. We should try this thing out on some real-world URLs, and
> see which ones are flagged.
>
> Erik
>
> On Thu, Apr 2, 2009 at 10:35 AM, Andy Heninger <and...@gm...>
> wrote:
> > On Thu, Apr 2, 2009 at 10:17 AM, Erik van der Poel <er...@go...>
> wrote:
> >> I see. Yeah, it makes sense to use "are" when there are two parameters
> >> s1 and s2. It just looked funny, that's all. I don't feel strongly
> >> about this.
> >>
> >> By the way, if we have a single domain name that we want to check for
> >> confusability, what do we check it against? I.e. you have s1, but
> >> where do you get s2 from?
> >>
> >> I suppose you have another API that checks a single string for mixed
> >> scripts and other issues?
> >
> > Yes.  It's called uspoof_check(), again with variants for strings of
> > different types.  The set of checks to be performed is a property of
> > the spoof checker object.  For this function, the "position" output
> > parameter does provide useful information.
> >
> >  -- Andy
> >>
> >> Erik
> >>
> >> On Thu, Apr 2, 2009 at 9:56 AM, Andy Heninger <and...@gm...>
> wrote:
> >>> On Thu, Apr 2, 2009 at 7:20 AM, Erik van der Poel <er...@go...>
> wrote:
> >>>> You spoof are confusable UTF-8?
> >>>>
> >>>> I can has cheezburger?
> >>>>
> >>>> The English just sounds really funny.
> >>>
> >>> Well, it's only sort-of English.
> >>> The "are confusable" refers to the two string parameters, s1 and s2,
> >>> not to the  "UTF-8" which is an adjective, the type of the string
> >>> parameters.  "UTF8" is in the function name only because this is a
> >>> plain C API which can't do overloaded functions.
> >>>
> >>> "isXXX" is common in API function names, but it is generally referring
> >>> to a property of a single item, not of a pair of items.
> >>>
> >>> I agree that the function name reads oddly, but changing to "is"
> doesn't fix it.
> >>>
> >>> areConfusableUTF8Identifiers(s1, s2, ...)  might be clearer.
> >>>
> >>> But overall, I favor leaving it as it is.
> >>>
> >>>  -- Andy
> >>>
> >>>
> >>>>
> >>>> For the UTF-8 it might be debatable, but
> >>>> uspoof_areConfusableUnicodeString() should probably have the "are"
> >>>> changed to "is". I'd change it to "is" for the UTF-8 APIs too. It's
> >>>> very common to have the word "is" in APIs, but not so common to use
> >>>> the word "are", I believe.
> >>>>
> >>>> Thanks for working on this,
> >>>>
> >>>> Erik
> >>>>
> >>>> On Wed, Apr 1, 2009 at 10:35 PM, Andy Heninger <
> and...@gm...> wrote:
> >>>>> I am proposing a small change to the USpoofChecker API.
> >>>>>
> >>>>> In the function
> >>>>>
> >>>>> U_DRAFT int32_t U_EXPORT2
> >>>>> uspoof_areConfusableUTF8(const USpoofChecker *sc,
> >>>>>                         const char *s1, int32_t length1,
> >>>>>                         const char *s2, int32_t length2,
> >>>>>                         int32_t *position,
> >>>>>                         UErrorCode *status);
> >>>>>
> >>>>>
> >>>>> I propose eliminating the "position" parameter.
> >>>>>
> >>>>> This parameter turns out to serve no useful purpose.  In other spoof
> >>>>> checking functions, the corresponding "position" parameter returns
> the
> >>>>> position of a detected problem with the identifier being checked.
>  For
> >>>>> this function, we are testing whether two complete identifiers are
> >>>>> potentially visually confusable.  There is no specific position in
> >>>>> them that causes them to be confusable - they must be confusable at
> >>>>> all positions for there to be a problem.
> >>>>>
> >>>>> Since this is a new API, there are no compatibility issues with
> >>>>> removing the parameter.
> >>>>>
> >>>>> The same change is needed in uspoof_areConfusableUTF8() and
> >>>>> uspoof_areConfusableUnicodeString()
> >>>>>
> >>>>>
> ------------------------------------------------------------------------------
> >>>>> _______________________________________________
>
>

Re: [icu-design] Proposed API change to Unicode Spoof Detection

Open Source C/C++/Java libraries from Unicode

Re: [icu-design] Proposed API change to Unicode Spoof Detection