[Httplib2-discuss] =?WINDOWS-1252?Q?Bug_#1455955:_Followup_=96_matchin?= =?WINDOWS-1252?Q?g_"tokens"

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Here's a copy of my followup to bug #1455955 (Support for HMACDigest
authentication) [1], about the use of "[a-zA-Z0-9_-]" vs. "\w" to
match "tokens" as defined by HTTP [2], so that it can be discussed (if
anybody is subscribed to this list of course! :-P )

Moreover, matching tokens and quoted-strings can be done within a
single regex, using (?<=3D=85), (?=3D=85), (?<!=85) and (?!=85) constructs.=
 Here's
a regex matching both tokens and quoted-strings, at the expense of
being a bit harder to read:
WWW_AUTH =3D re.compile(r"^(?:\s*(?:,\s*)?([a-zA-Z0-9_-]+)\s*=3D\s*\"?((?<=
=3D\")(?:[^\\\"]|\\.)*?(?=3D\")|(?<!\")[a-zA-Z0-9_-]+(?!\"))\"?)(.*)$")
You then just have to remove any reference to WWW_AUTH2 and match2
from _parse_www_authenticate.

This regex also fixes a small bug preventing commas from being
prefixed with spaces (which is explicitely allowed by the definition
of #-lists in HTTP).
I just replaced "^,?\s*" with "^\s*(?:,\s*)?", i.e., match every
space, eventually followed by a comma and eventually other spaces. I
could have written "^\s*,?\s*" but I guess the non-matching group
construct is a bit more efficient as it doesn't try to match "\s*"
twice.

Back to the [a-zA-Z0-9_-] vs. \w problem, I've done some more research
and actually, the exact regex for a quoted string (without the <">s)
is
    (?:[^\0-\x1f\x7f-\xff\\\"]|\\[\0-\x7f]|\r\n[ \t]+)*?
but given that LWS has already been replaced with a single space, it
can be simplified as:
    (?:[^\0-\x1f\x7f-\xff\\"]|\\[\0-\x7f])*?
The only difference with what's currently in Httplib2 ([^\\\"]|\\.) is
that the regex above excludes CTLs from the first part and any octet
with value >=3D 128 (\xF0) from both parts.
The exact regex for a token is:
    [^\0-\x1f\x7f-\xff()<>@,;:\\\"/[\]?=3D{} \t]+
Such a (unreadable, I admit it) regex, compared to \w or
[a-zA-Z0-9_-]+, would match tokens such as to#en, to%en, to*en, to!en,
etc. which are valid tokens, even if probably never used.

[1] https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1455955&g=
roup_id=3D161082&atid=3D818437
[2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2

Er, Python doc for "re" says:
\w
    When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_].

So \w is equivalent here to [a-zA-Z0-9_] (because httplib2 uses
neither the LOCALE nor the UNICODE flags), which is even more
restrictive than my proposed [a-zA-Z0-9_-].

Compare re.match(r"^\w+$", "f-o-o") and re.match(r"^\w+$", "foo").

HTTP defines "token" as "1*<any CHAR except CTLs or separators>", and
both "/" and ":" which are present in almost every absoluteURI is a
"serapator", so an absoluteURI is not a token (and such must be
quoted).

[a-zA-Z0-9_-] is far from perfect, but at least a bit better than \w.

--
Thomas Broyer

[Httplib2-discuss] =?WINDOWS-1252?Q?Bug_#1455955:_Followup_=96_matchin?= =?WINDOWS-1252?Q?g_"tokens"

[Httplib2-discuss] =?WINDOWS-1252?Q?Bug_#1455955:_Followup_=96_matchin?= =?WINDOWS-1252?Q?g_"tokens"_in_regular_expressions?=