Thread: [Httplib2-discuss] =?WINDOWS-1252?Q?Bug_#1455955:_Followup_=96_matchin?= =?WINDOWS-1252?Q?g_"tokens"
Status: Beta
Brought to you by:
jcgregorio
From: Thomas B. <t.b...@gm...> - 2006-03-31 08:01:52
|
Here's a copy of my followup to bug #1455955 (Support for HMACDigest authentication) [1], about the use of "[a-zA-Z0-9_-]" vs. "\w" to match "tokens" as defined by HTTP [2], so that it can be discussed (if anybody is subscribed to this list of course! :-P ) Moreover, matching tokens and quoted-strings can be done within a single regex, using (?<=3D=85), (?=3D=85), (?<!=85) and (?!=85) constructs.= Here's a regex matching both tokens and quoted-strings, at the expense of being a bit harder to read: WWW_AUTH =3D re.compile(r"^(?:\s*(?:,\s*)?([a-zA-Z0-9_-]+)\s*=3D\s*\"?((?<= =3D\")(?:[^\\\"]|\\.)*?(?=3D\")|(?<!\")[a-zA-Z0-9_-]+(?!\"))\"?)(.*)$") You then just have to remove any reference to WWW_AUTH2 and match2 from _parse_www_authenticate. This regex also fixes a small bug preventing commas from being prefixed with spaces (which is explicitely allowed by the definition of #-lists in HTTP). I just replaced "^,?\s*" with "^\s*(?:,\s*)?", i.e., match every space, eventually followed by a comma and eventually other spaces. I could have written "^\s*,?\s*" but I guess the non-matching group construct is a bit more efficient as it doesn't try to match "\s*" twice. Back to the [a-zA-Z0-9_-] vs. \w problem, I've done some more research and actually, the exact regex for a quoted string (without the <">s) is (?:[^\0-\x1f\x7f-\xff\\\"]|\\[\0-\x7f]|\r\n[ \t]+)*? but given that LWS has already been replaced with a single space, it can be simplified as: (?:[^\0-\x1f\x7f-\xff\\"]|\\[\0-\x7f])*? The only difference with what's currently in Httplib2 ([^\\\"]|\\.) is that the regex above excludes CTLs from the first part and any octet with value >=3D 128 (\xF0) from both parts. The exact regex for a token is: [^\0-\x1f\x7f-\xff()<>@,;:\\\"/[\]?=3D{} \t]+ Such a (unreadable, I admit it) regex, compared to \w or [a-zA-Z0-9_-]+, would match tokens such as to#en, to%en, to*en, to!en, etc. which are valid tokens, even if probably never used. [1] https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1455955&g= roup_id=3D161082&atid=3D818437 [2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 Er, Python doc for "re" says: \w When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. So \w is equivalent here to [a-zA-Z0-9_] (because httplib2 uses neither the LOCALE nor the UNICODE flags), which is even more restrictive than my proposed [a-zA-Z0-9_-]. Compare re.match(r"^\w+$", "f-o-o") and re.match(r"^\w+$", "foo"). HTTP defines "token" as "1*<any CHAR except CTLs or separators>", and both "/" and ":" which are present in almost every absoluteURI is a "serapator", so an absoluteURI is not a token (and such must be quoted). [a-zA-Z0-9_-] is far from perfect, but at least a bit better than \w. -- Thomas Broyer |
From: Thomas B. <t.b...@gm...> - 2006-03-31 08:29:28
|
2006/3/31, Thomas Broyer <t.b...@gm...>: > Moreover, matching tokens and quoted-strings can be done within a > single regex, using (?<=3D=85), (?=3D=85), (?<!=85) and (?!=85) construct= s. [=85] > [=85] a small bug preventing commas from being prefixed with spaces > (which is explicitely allowed by the definition of #-lists in HTTP). [=85] > Back to the [a-zA-Z0-9_-] vs. \w problem, I've done some more research > and actually, the exact regex for a quoted string (without the <">s) > is [=85] > The exact regex for a token is [=85] I've created a bug report [1] (1461941 =96 Bugs in _parse_www_authenticate's regex + use a single regex) with attached patch, and an alternative "lax" regex (far more readable) [1] http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1461941&gr= oup_id=3D161082&atid=3D818434 -- Thomas Broyer |
From: Joe G. <joe...@gm...> - 2006-03-31 15:22:08
|
Thomas, That's really great work on the regex, can you also add some unit tests that exercise the regex? Thanks, -joe On 3/31/06, Thomas Broyer <t.b...@gm...> wrote: > 2006/3/31, Thomas Broyer <t.b...@gm...>: > > Moreover, matching tokens and quoted-strings can be done within a > > single regex, using (?<=3D=85), (?=3D=85), (?<!=85) and (?!=85) constru= cts. > [=85] > > [=85] a small bug preventing commas from being prefixed with spaces > > (which is explicitely allowed by the definition of #-lists in HTTP). > [=85] > > Back to the [a-zA-Z0-9_-] vs. \w problem, I've done some more research > > and actually, the exact regex for a quoted string (without the <">s) > > is [=85] > > The exact regex for a token is [=85] > > I've created a bug report [1] (1461941 =96 Bugs in > _parse_www_authenticate's regex + use a single regex) with attached > patch, and an alternative "lax" regex (far more readable) > > [1] http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1461941&= group_id=3D161082&atid=3D818434 > > -- > Thomas Broyer > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting langua= ge > that extends applications into web and mobile media. Attend the live webc= ast > and join the prime developer group breaking into this new coding territor= y! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=110944&bid$1720&dat=121642 > _______________________________________________ > Httplib2-discuss mailing list > Htt...@li... > https://lists.sourceforge.net/lists/listinfo/httplib2-discuss > -- Joe Gregorio http://bitworking.org |
From: Thomas B. <t.b...@gm...> - 2006-04-03 13:25:10
|
2006/3/31, Joe Gregorio <joe...@gm...>: > Thomas, > That's really great work on the regex, can you also add some > unit tests that exercise the regex? When running the existing unit tests with my modified regex, the test "HttpPrivateTest.testParseWWWAuthenticateMultiple4" fails: the value of the "qop" auth-param contains a tab (\t). Since then, I assumed the headers had been normalized, while actually _normalize_headers only normalizes field names. Precisely, HTTP/1.1 says that "A recipient MAY replace any linear white space with a single SP before interpreting the field value or forwarding the message downstream." I assumed it was done while actually it's not. There are two options here: - unfold and normalize white space in _normalize_headers - unfold and normalize white space only in _parse_www_authenticate, only for www-authenticate or authentication-info headers before processing I reject the option of modifying the regex to accomodate folded field values: it's much easier normalizing the values before processing and it's totally HTTP/1.1-compliant. And as I was investigating in _parse_www_authentication, I also noticed quoted pairs (in quoted strings) are never "unquoted", so the following (new) unit test fails: res =3D httplib2._parse_www_authenticate({ 'www-authenticate': 'Test realm=3D"a \\"test\\" realm"'}) self.assertEqual(res['test']['realm'], 'a "test" realm') as res['test']['realm'] contains 'a \\"test\\" realm'. This (unquoting) can be done using either a regex and the "sub" method, or splitting and joining the string. I personnaly have no preference. Also (and finally), as strict WWW-Authenticate "parsing" might cause unrecoverable errors (I mean, a parameter treated as an auth-scheme, or consuming the following challenge, instead of exceptions), I tend to go for the simpler regex ("strict send/lax receive") I provided in my previous mail (it still needs some testing though). Or how about putting both regexes in the code and providing a switch for the one to use (e.g. "httplib2.USE_STRICT_WWW_AUTHENTICATE_PARSING =3D true", defaulting to false)? -- Thomas Broyer |
From: Thomas B. <t.b...@gm...> - 2006-04-07 10:11:28
Attachments:
_parse_www_authenticate + tests.patch
|
2006/4/3, Thomas Broyer <t.b...@gm...>: > There are two options here: > - unfold and normalize white space in _normalize_headers > - unfold and normalize white space only in _parse_www_authenticate, > only for www-authenticate or authentication-info headers before > processing [...] > And as I was investigating in _parse_www_authentication, I also > noticed quoted pairs (in quoted strings) are never "unquoted", so the > following (new) unit test fails: > res =3D httplib2._parse_www_authenticate({ 'www-authenticate': > 'Test realm=3D"a \\"test\\" realm"'}) > self.assertEqual(res['test']['realm'], 'a "test" realm') > as res['test']['realm'] contains 'a \\"test\\" realm'. > This (unquoting) can be done using either a regex and the "sub" > method, or splitting and joining the string. I personnaly have no > preference. > > Also (and finally), as strict WWW-Authenticate "parsing" might cause > unrecoverable errors (I mean, a parameter treated as an auth-scheme, > or consuming the following challenge, instead of exceptions), I tend > to go for the simpler regex ("strict send/lax receive") I provided in > my previous mail (it still needs some testing though). Or how about > putting both regexes in the code and providing a switch for the one to > use (e.g. "httplib2.USE_STRICT_WWW_AUTHENTICATE_PARSING =3D true", > defaulting to false)? The attached patch fixes bug #1461941 and: - unfold and normalize spaces in _normalize_headers (_parse_www_authenticate assumes unfolded header values; regex fixed to accept \t as well, just in case, so doesn't assume fully-normalized spaces, only unfolded header value) - unquote "quoted-pairs" (done with a regex, split/join version available in comments) - use relaxed parsing by default but can be switched to strict parsing using a global variable - adds unit tests exercising both regex's -- Thomas Broyer |
Sorry, that took waaay too long for me to get back to. I have applied the patch, thanks! -joe On 4/7/06, Thomas Broyer <t.b...@gm...> wrote: > 2006/4/3, Thomas Broyer <t.b...@gm...>: > > There are two options here: > > - unfold and normalize white space in _normalize_headers > > - unfold and normalize white space only in _parse_www_authenticate, > > only for www-authenticate or authentication-info headers before > > processing > [...] > > And as I was investigating in _parse_www_authentication, I also > > noticed quoted pairs (in quoted strings) are never "unquoted", so the > > following (new) unit test fails: > > res =3D httplib2._parse_www_authenticate({ 'www-authenticate': > > 'Test realm=3D"a \\"test\\" realm"'}) > > self.assertEqual(res['test']['realm'], 'a "test" realm') > > as res['test']['realm'] contains 'a \\"test\\" realm'. > > This (unquoting) can be done using either a regex and the "sub" > > method, or splitting and joining the string. I personnaly have no > > preference. > > > > Also (and finally), as strict WWW-Authenticate "parsing" might cause > > unrecoverable errors (I mean, a parameter treated as an auth-scheme, > > or consuming the following challenge, instead of exceptions), I tend > > to go for the simpler regex ("strict send/lax receive") I provided in > > my previous mail (it still needs some testing though). Or how about > > putting both regexes in the code and providing a switch for the one to > > use (e.g. "httplib2.USE_STRICT_WWW_AUTHENTICATE_PARSING =3D true", > > defaulting to false)? > > The attached patch fixes bug #1461941 and: > - unfold and normalize spaces in _normalize_headers > (_parse_www_authenticate assumes unfolded header values; regex fixed > to accept \t as well, just in case, so doesn't assume fully-normalized > spaces, only unfolded header value) > - unquote "quoted-pairs" (done with a regex, split/join version > available in comments) > - use relaxed parsing by default but can be switched to strict > parsing using a global variable > - adds unit tests exercising both regex's > > -- > Thomas Broyer > > > -- Joe Gregorio http://bitworking.org |