From: A.M. K. <aku...@us...> - 2003-04-10 14:18:40
|
Update of /cvsroot/py-howto/pyhowto In directory sc8-pr-cvs1:/tmp/cvs-serv11351 Modified Files: regex.tex Log Message: [Patch #718809 from Jarno Virtanen] Various minor corrections to regex.tex; I've also made a few more minor rewrites. Index: regex.tex =================================================================== RCS file: /cvsroot/py-howto/pyhowto/regex.tex,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -r1.17 -r1.18 *** regex.tex 7 Apr 2003 19:51:23 -0000 1.17 --- regex.tex 10 Apr 2003 14:18:34 -0000 1.18 *************** *** 35,40 **** Perl-style regular expression patterns. Earlier versions of Python came with the \module{regex} module, which provides Emacs-style ! patterns. Emacs-style patterns are slightly less readable, and ! doesn't provide as many features, so there's not much reason to use the \module{regex} module when writing new code, though you might encounter old code that uses it. --- 35,40 ---- Perl-style regular expression patterns. Earlier versions of Python came with the \module{regex} module, which provides Emacs-style ! patterns. Emacs-style patterns are slightly less readable and ! don't provide as many features, so there's not much reason to use the \module{regex} module when writing new code, though you might encounter old code that uses it. *************** *** 215,219 **** it can at first, and if no match is found it will then progressively back up and retry the rest of the RE again and again. It will back up ! until it's tried zero matches for \regexp{[bcd]*}, and if that subsequently fails, the engine will conclude that the string doesn't match the RE at all. --- 215,219 ---- it can at first, and if no match is found it will then progressively back up and retry the rest of the RE again and again. It will back up ! until it has tried zero matches for \regexp{[bcd]*}, and if that subsequently fails, the engine will conclude that the string doesn't match the RE at all. *************** *** 245,249 **** earlier, but that might as well be infinity. ! Readers of a reductionist bent may notice that the 3 other qualifiers can all be expressed using this notation. \regexp{\{0,\}} is the same as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and --- 245,249 ---- earlier, but that might as well be infinity. ! Readers of a reductionist bent may notice that the three other qualifiers can all be expressed using this notation. \regexp{\{0,\}} is the same as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and *************** *** 348,354 **** \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} ! \lineii{match}{Determine if the RE matches at the beginning of the string.} ! \lineii{search}{Scan through a string, looking for any location where this RE matches.} \lineii{findall()}{Find all substrings where the RE matches, --- 348,354 ---- \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} ! \lineii{match()}{Determine if the RE matches at the beginning of the string.} ! \lineii{search()}{Scan through a string, looking for any location where this RE matches.} \lineii{findall()}{Find all substrings where the RE matches, *************** *** 997,1001 **** \end{verbatim} ! \subsection{Other Assertions} Another zero-width assertion is the lookahead assertion. Lookahead --- 997,1001 ---- \end{verbatim} ! \subsection{Lookahead Assertions} Another zero-width assertion is the lookahead assertion. Lookahead *************** *** 1016,1020 **** \end{itemize} ! An example will help make this concrete and will demonstrate a case where a lookahead is useful. Consider a simple pattern to match a filename and split it apart into a base name and an extension, --- 1016,1020 ---- \end{itemize} ! An example will help make this concrete by demonstrating a case where a lookahead is useful. Consider a simple pattern to match a filename and split it apart into a base name and an extension, *************** *** 1022,1030 **** is the base name, and \samp{rc} is the filename's extension. ! The pattern to match this is quite simple: \regexp{.*[.].*\$}. ! (Notice that the \samp{.} needs to be treated specially because it's a metacharacter; I've put it inside a character class. Also notice the trailing \regexp{\$}; this is added to ensure that all the rest of the ! string must be included in the extension.) This regular expression matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and \samp{printers.conf}. --- 1022,1033 ---- is the base name, and \samp{rc} is the filename's extension. ! The pattern to match this is quite simple: ! ! \regexp{.*[.].*\$} ! ! Notice that the \samp{.} needs to be treated specially because it's a metacharacter; I've put it inside a character class. Also notice the trailing \regexp{\$}; this is added to ensure that all the rest of the ! string must be included in the extension. This regular expression matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and \samp{printers.conf}. *************** *** 1037,1045 **** % $ ! First attempt: Exclude \samp{bat} by requiring that the first ! character of the extension is not a \samp{b}. This is wrong, because it ! also doesn't match \samp{foo.bar}. ! \regexp{.*[.]([\^b]..|.[\^a].|..[\^t])\$} The expression gets messier when you try to patch up the first --- 1040,1049 ---- % $ ! The first attempt above tries to exclude \samp{bat} by requiring that ! the first character of the extension is not a \samp{b}. This is ! wrong, because the pattern also doesn't match \samp{foo.bar}. ! % Messes up the HTML without the curly braces around \^ ! \regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$} The expression gets messier when you try to patch up the first *************** *** 1048,1056 **** \samp{a}; or the third character isn't \samp{t}. This accepts \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a ! three-letter extension, and doesn't accept \samp{sendmail.cf}. ! Another bug, so we'll complicate the pattern again in an effort to fix ! it. ! \regexp{.*[.]([\^b].?.?|.[\^a]?.?|..?[\^t]?)\$} In the third attempt, the second and third letters are all made --- 1052,1060 ---- \samp{a}; or the third character isn't \samp{t}. This accepts \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a ! three-letter extension and won't accept a filename with a two-letter ! extension such as \samp{sendmail.cf}. We'll complicate the pattern ! again in an effort to fix it. ! \regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$} In the third attempt, the second and third letters are all made *************** *** 1059,1081 **** The pattern's getting really complicated now, which makes it hard to ! read and understand. Worse, this solution doesn't scale well; if the ! problem changes and you want to exclude both \samp{bat} and \samp{exe} ! as extensions, the pattern would get still more complicated and ! confusing. ! ! A negative lookahead cuts through all this. Go back to the original ! pattern, and, before the \regexp{.*} which matches the extension, ! insert \regexp{(?!bat\$)}. This means: if the expression \regexp{bat} ! doesn't match at this point, try the rest of the pattern; if ! \regexp{bat\$} does match, the whole pattern will fail. The trailing ! \regexp{\$} is required to ensure that something like ! \samp{sample.batch}, where the extension only starts with \samp{bat}, ! will be allowed. ! ! After this modification, the whole pattern is ! \regexp{.*[.](?!bat\$).*\$}. Excluding another filename extension is ! now easy; simply add it as an alternative inside the assertion. \regexp{.*[.](?!bat\$|exe\$).*\$} ! excludes both \samp{bat} and \samp{exe}. --- 1063,1087 ---- The pattern's getting really complicated now, which makes it hard to ! read and understand. Worse, if the problem changes and you want to ! exclude both \samp{bat} and \samp{exe} as extensions, the pattern ! would get even more complicated and confusing. ! ! A negative lookahead cuts through all this: ! ! \regexp{.*[.](?!bat\$).*\$} ! % $ ! ! The lookahead means: if the expression \regexp{bat} doesn't match at ! this point, try the rest of the pattern; if \regexp{bat\$} does match, ! the whole pattern will fail. The trailing \regexp{\$} is required to ! ensure that something like \samp{sample.batch}, where the extension ! only starts with \samp{bat}, will be allowed. ! ! Excluding another filename extension is now easy; simply add it as an ! alternative inside the assertion. The following pattern excludes ! filenames that end in either \samp{bat} or \samp{exe}: ! \regexp{.*[.](?!bat\$|exe\$).*\$} ! % $ *************** *** 1087,1093 **** \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} ! \lineii{split}{Split the string into a list, splitting it wherever the RE matches} ! \lineii{sub}{Find all substrings where the RE matches, and replace them with a different string} ! \lineii{subn}{Does the same thing as \method{sub()}, but returns the new string and the number of replacements} \end{tableii} --- 1093,1099 ---- \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} ! \lineii{split()}{Split the string into a list, splitting it wherever the RE matches} ! \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string} ! \lineii{subn()}{Does the same thing as \method{sub()}, but returns the new string and the number of replacements} \end{tableii} *************** *** 1193,1197 **** \end{verbatim} ! Empty matches are replaced only when not they're not adjacent to a previous match. --- 1199,1203 ---- \end{verbatim} ! Empty matches are replaced only when they're not adjacent to a previous match. *************** *** 1223,1229 **** There's also a syntax for referring to named groups as defined by the \regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the ! substring matched by the group named \samp{name}, and \samp{\e ! g<\var{number}>} uses the corresponding group number. \samp{\e g<2>} ! is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be interpreted as a reference to group 20, not a reference to group 2 --- 1229,1237 ---- There's also a syntax for referring to named groups as defined by the \regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the ! substring matched by the group named \samp{name}, and ! \samp{\e g<\var{number}>} ! uses the corresponding group number. ! \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, ! but isn't ambiguous in a replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be interpreted as a reference to group 20, not a reference to group 2 *************** *** 1303,1309 **** from a string or replacing it with another single character. You might do this with something like \code{re.sub('\e n', ' ', S)}, but ! \method{translate()} is capable of doing both these tasks, ! and will be much faster that any regular expression operation can ever ! be. In short, before turning to the \module{re} module, consider whether --- 1311,1316 ---- from a string or replacing it with another single character. You might do this with something like \code{re.sub('\e n', ' ', S)}, but ! \method{translate()} is capable of doing both tasks ! and will be faster that any regular expression operation can be. In short, before turning to the \module{re} module, consider whether *************** *** 1347,1351 **** starting character, only trying the full match if a \character{C} is found. ! Adding \regexp{.*} defeats this optimization, and requires scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use \function{re.search()} instead. --- 1354,1358 ---- starting character, only trying the full match if a \character{C} is found. ! Adding \regexp{.*} defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use \function{re.search()} instead. |