[Httplib2-discuss] =?WINDOWS-1252?Q?Bug_#1455955:_Followup_=96_matchin?= =?WINDOWS-1252?Q?g_"tokens"
Status: Beta
Brought to you by:
jcgregorio
From: Thomas B. <t.b...@gm...> - 2006-03-31 08:01:52
|
Here's a copy of my followup to bug #1455955 (Support for HMACDigest authentication) [1], about the use of "[a-zA-Z0-9_-]" vs. "\w" to match "tokens" as defined by HTTP [2], so that it can be discussed (if anybody is subscribed to this list of course! :-P ) Moreover, matching tokens and quoted-strings can be done within a single regex, using (?<=3D=85), (?=3D=85), (?<!=85) and (?!=85) constructs.= Here's a regex matching both tokens and quoted-strings, at the expense of being a bit harder to read: WWW_AUTH =3D re.compile(r"^(?:\s*(?:,\s*)?([a-zA-Z0-9_-]+)\s*=3D\s*\"?((?<= =3D\")(?:[^\\\"]|\\.)*?(?=3D\")|(?<!\")[a-zA-Z0-9_-]+(?!\"))\"?)(.*)$") You then just have to remove any reference to WWW_AUTH2 and match2 from _parse_www_authenticate. This regex also fixes a small bug preventing commas from being prefixed with spaces (which is explicitely allowed by the definition of #-lists in HTTP). I just replaced "^,?\s*" with "^\s*(?:,\s*)?", i.e., match every space, eventually followed by a comma and eventually other spaces. I could have written "^\s*,?\s*" but I guess the non-matching group construct is a bit more efficient as it doesn't try to match "\s*" twice. Back to the [a-zA-Z0-9_-] vs. \w problem, I've done some more research and actually, the exact regex for a quoted string (without the <">s) is (?:[^\0-\x1f\x7f-\xff\\\"]|\\[\0-\x7f]|\r\n[ \t]+)*? but given that LWS has already been replaced with a single space, it can be simplified as: (?:[^\0-\x1f\x7f-\xff\\"]|\\[\0-\x7f])*? The only difference with what's currently in Httplib2 ([^\\\"]|\\.) is that the regex above excludes CTLs from the first part and any octet with value >=3D 128 (\xF0) from both parts. The exact regex for a token is: [^\0-\x1f\x7f-\xff()<>@,;:\\\"/[\]?=3D{} \t]+ Such a (unreadable, I admit it) regex, compared to \w or [a-zA-Z0-9_-]+, would match tokens such as to#en, to%en, to*en, to!en, etc. which are valid tokens, even if probably never used. [1] https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1455955&g= roup_id=3D161082&atid=3D818437 [2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 Er, Python doc for "re" says: \w When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. So \w is equivalent here to [a-zA-Z0-9_] (because httplib2 uses neither the LOCALE nor the UNICODE flags), which is even more restrictive than my proposed [a-zA-Z0-9_-]. Compare re.match(r"^\w+$", "f-o-o") and re.match(r"^\w+$", "foo"). HTTP defines "token" as "1*<any CHAR except CTLs or separators>", and both "/" and ":" which are present in almost every absoluteURI is a "serapator", so an absoluteURI is not a token (and such must be quoted). [a-zA-Z0-9_-] is far from perfect, but at least a bit better than \w. -- Thomas Broyer |