From: Fred L. D. <fd...@us...> - 2001-04-23 16:54:53
|
Update of /cvsroot/py-howto/pyhowto In directory usw-pr-cvs1:/tmp/cvs-serv29333 Modified Files: regex.tex Log Message: Fix re.VERSION-modified RE; "#" as part of the pattern was not escaped. Closes SF bug #416374. Wrap some wide paragraphs. Remove extraneous "%" characters from otherwise blank lines after verbatim environments, except in a couple of places where we needed to bow to font-lock. ;-( Index: regex.tex =================================================================== RCS file: /cvsroot/py-howto/pyhowto/regex.tex,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -r1.8 -r1.9 *** regex.tex 2000/07/28 02:06:27 1.8 --- regex.tex 2001/04/23 16:54:49 1.9 *************** *** 70,74 **** We'll start by learning about the simplest possible regular ! expressions. Since regular expressions are used to operate on strings, we'll start with the most common task: matching characters. For a detailed explanation of the computer science underlying regular --- 70,75 ---- We'll start by learning about the simplest possible regular ! expressions. Since regular expressions are used to operate on ! strings, we'll start with the most common task: matching characters. For a detailed explanation of the computer science underlying regular *************** *** 90,100 **** devoted to discussing various metacharacters and what they do. ! Here's a complete list of the metacharacters; their meanings will be discussed ! in the rest of this HOWTO. \begin{verbatim} . ^ $ * + ? { [ \ | ( ) \end{verbatim} ! % The first metacharacter we'll look at is \samp{[}; it's used for specifying a character class, which is a set of characters that you --- 91,102 ---- devoted to discussing various metacharacters and what they do. ! Here's a complete list of the metacharacters; their meanings will be ! discussed in the rest of this HOWTO. \begin{verbatim} . ^ $ * + ? { [ \ | ( ) \end{verbatim} ! % $ ! The first metacharacter we'll look at is \samp{[}; it's used for specifying a character class, which is a set of characters that you *************** *** 107,114 **** RE would be \regexp{[a-z]}. ! Metacharacters are not active inside classes. For example, \regexp{[akm\$]} ! will match any of the characters \character{a}, \character{k}, ! \character{m}, or \character{\$}; \character{\$} is usually a metacharacter, but inside a character class it's stripped ! of its special nature. You can match the characters not within a range by \dfn{complementing} --- 109,117 ---- RE would be \regexp{[a-z]}. ! Metacharacters are not active inside classes. For example, ! \regexp{[akm\$]} will match any of the characters \character{a}, ! \character{k}, \character{m}, or \character{\$}; \character{\$} is ! usually a metacharacter, but inside a character class it's stripped of ! its special nature. You can match the characters not within a range by \dfn{complementing} *************** *** 134,150 **** \item[\code{\e d}]Matches any decimal digit; this is equivalent to the class \regexp{[0-9]}. ! % \item[\code{\e D}]Matches any non-digit character; this is equivalent to the class \verb|[^0-9]|. ! % \item[\code{\e s}]Matches any whitespace character; this is equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. ! % \item[\code{\e S}]Matches any non-whitespace character; this is equivalent to the class \verb|[^ \t\n\r\f\v]|. ! % \item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class \regexp{[a-zA-Z0-9_]}. ! % \item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class \verb|[^a-zA-Z0-9_]|. --- 137,153 ---- \item[\code{\e d}]Matches any decimal digit; this is equivalent to the class \regexp{[0-9]}. ! \item[\code{\e D}]Matches any non-digit character; this is equivalent to the class \verb|[^0-9]|. ! \item[\code{\e s}]Matches any whitespace character; this is equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. ! \item[\code{\e S}]Matches any non-whitespace character; this is equivalent to the class \verb|[^ \t\n\r\f\v]|. ! \item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class \regexp{[a-zA-Z0-9_]}. ! \item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class \verb|[^a-zA-Z0-9_]|. *************** *** 272,276 **** <re.RegexObject instance at 80b4150> \end{verbatim} ! % \function{re.compile()} also accepts an optional \var{flags} argument, used to enable various special features and syntax --- 275,279 ---- <re.RegexObject instance at 80b4150> \end{verbatim} ! \function{re.compile()} also accepts an optional \var{flags} argument, used to enable various special features and syntax *************** *** 281,285 **** >>> p = re.compile('ab*', re.IGNORECASE) \end{verbatim} ! % The RE is passed to \function{re.compile()} as a string. REs are handled as strings because regular expressions aren't --- 284,288 ---- >>> p = re.compile('ab*', re.IGNORECASE) \end{verbatim} ! The RE is passed to \function{re.compile()} as a string. REs are handled as strings because regular expressions aren't *************** *** 379,383 **** <re.RegexObject instance at 80c3c28> \end{verbatim} ! % Now, you can try matching various strings against the RE \regexp{[a-z]+}. An empty string shouldn't match at all, since --- 382,386 ---- <re.RegexObject instance at 80c3c28> \end{verbatim} ! Now, you can try matching various strings against the RE \regexp{[a-z]+}. An empty string shouldn't match at all, since *************** *** 392,396 **** None \end{verbatim} ! % Now, let's try it on a string that it should match, such as \samp{tempo}. In this case, \method{match()} will return a --- 395,399 ---- None \end{verbatim} ! Now, let's try it on a string that it should match, such as \samp{tempo}. In this case, \method{match()} will return a *************** *** 403,407 **** <re.MatchObject instance at 80c4f68> \end{verbatim} ! % Now you can query the \class{MatchObject} for information about the matching string. \class{MatchObject} instances also have several --- 406,410 ---- <re.MatchObject instance at 80c4f68> \end{verbatim} ! Now you can query the \class{MatchObject} for information about the matching string. \class{MatchObject} instances also have several *************** *** 425,429 **** (0, 5) \end{verbatim} ! % \method{group()} returns the substring that was matched by the RE. \method{start()} and \method{end()} return the starting and --- 428,432 ---- (0, 5) \end{verbatim} ! \method{group()} returns the substring that was matched by the RE. \method{start()} and \method{end()} return the starting and *************** *** 445,449 **** (4, 11) \end{verbatim} ! % In actual programs, the most common style is to store the \class{MatchObject} in a variable, and then check if it was --- 448,452 ---- (4, 11) \end{verbatim} ! In actual programs, the most common style is to store the \class{MatchObject} in a variable, and then check if it was *************** *** 458,462 **** print 'No match' \end{verbatim} ! % \subsection{Module-Level Functions} --- 461,465 ---- print 'No match' \end{verbatim} ! \subsection{Module-Level Functions} *************** *** 475,479 **** <re.MatchObject instance at 80c5978> \end{verbatim} ! % Under the hood, these functions simply produce a \class{RegexObject} for you and call the appropriate method on it. They also store the --- 478,482 ---- <re.MatchObject instance at 80c5978> \end{verbatim} ! Under the hood, these functions simply produce a \class{RegexObject} for you and call the appropriate method on it. They also store the *************** *** 498,502 **** starttagopen = re.compile( ... ) \end{verbatim} ! % (I generally prefer to work with the compiled object, even for one-time uses, but few people will be as much of a purist about this --- 501,505 ---- starttagopen = re.compile( ... ) \end{verbatim} ! (I generally prefer to work with the compiled object, even for one-time uses, but few people will be as much of a purist about this *************** *** 594,598 **** \begin{verbatim} charref = re.compile(r""" ! &# # Start of a numeric entity reference (?P<char> [0-9]+[^0-9] # Decimal form --- 597,601 ---- \begin{verbatim} charref = re.compile(r""" ! &\# # Start of a numeric entity reference (?P<char> [0-9]+[^0-9] # Decimal form *************** *** 602,606 **** """, re.VERBOSE) \end{verbatim} ! % Without the verbose setting, the RE would look like this: \begin{verbatim} --- 605,609 ---- """, re.VERBOSE) \end{verbatim} ! Without the verbose setting, the RE would look like this: \begin{verbatim} *************** *** 609,613 **** "|x[0-9a-fA-F]+[^0-9a-fA-F])") \end{verbatim} ! % In the above example, Python's automatic concatenation of string literals has been used to break up the RE into smaller pieces, but it's still more difficult to --- 612,616 ---- "|x[0-9a-fA-F]+[^0-9a-fA-F])") \end{verbatim} ! In the above example, Python's automatic concatenation of string literals has been used to break up the RE into smaller pieces, but it's still more difficult to *************** *** 639,643 **** \begin{list}{}{} ! % \item[\regexp{|}] Alternation, or the ``or'' operator. --- 642,646 ---- \begin{list}{}{} ! \item[\regexp{|}] Alternation, or the ``or'' operator. *************** *** 651,655 **** To match a literal \character{|}, use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. ! % \item[\regexp{\^}] Matches at the beginning of lines. Unless the \constant{MULTILINE} flag has been set, this will only match at the --- 654,658 ---- To match a literal \character{|}, use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. ! \item[\regexp{\^}] Matches at the beginning of lines. Unless the \constant{MULTILINE} flag has been set, this will only match at the *************** *** 670,674 **** use \regexp{\e\^}, or enclose it inside a character class, as in \regexp{[{\e}\^]}. ! % \item[\regexp{\$}] Matches at the end of lines, which is defined as either the end of the string, or any location followed by a newline --- 673,677 ---- use \regexp{\e\^}, or enclose it inside a character class, as in \regexp{[{\e}\^]}. ! \item[\regexp{\$}] Matches at the end of lines, which is defined as either the end of the string, or any location followed by a newline *************** *** 683,690 **** <re.MatchObject instance at 80adfa8> \end{verbatim} ! % ! To match a literal \character{\$}, ! use \regexp{\e\$}, or enclose it inside a character class, as in \regexp{[\$]}. ! % \item[\regexp{\e A}] Matches only at the start of the string. When not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are effectively --- 686,694 ---- <re.MatchObject instance at 80adfa8> \end{verbatim} ! % $ ! ! To match a literal \character{\$}, use \regexp{\e\$}, or enclose it ! inside a character class, as in \regexp{[\$]}. ! \item[\regexp{\e A}] Matches only at the start of the string. When not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are effectively *************** *** 693,699 **** \regexp{\^} may match at several locations inside the string (anywhere following a newline character). ! % \item[\regexp{\e Z}]Matches only at the end of the string. ! % \item[\regexp{\e b}] Word boundary. This is a zero-width assertion that matches only at the --- 697,703 ---- \regexp{\^} may match at several locations inside the string (anywhere following a newline character). ! \item[\regexp{\e Z}]Matches only at the end of the string. ! \item[\regexp{\e b}] Word boundary. This is a zero-width assertion that matches only at the *************** *** 714,718 **** None \end{verbatim} ! % There are two subtleties you should remember when using this special sequence. First, this is the worst collision between Python's string --- 718,722 ---- None \end{verbatim} ! There are two subtleties you should remember when using this special sequence. First, this is the worst collision between Python's string *************** *** 731,743 **** <re.MatchObject instance at 80c3ee0> \end{verbatim} ! % Second, inside a character class, where there's no use for this assertion, \regexp{\e b} represents the backspace character, for compatibility with Python's string literals. ! % \item[\regexp{\e B}] Another zero-width assertion, this is the opposite of \regexp{\e b}, only matching when the current position is not at a word boundary. ! % \end{list} --- 735,747 ---- <re.MatchObject instance at 80c3ee0> \end{verbatim} ! Second, inside a character class, where there's no use for this assertion, \regexp{\e b} represents the backspace character, for compatibility with Python's string literals. ! \item[\regexp{\e B}] Another zero-width assertion, this is the opposite of \regexp{\e b}, only matching when the current position is not at a word boundary. ! \end{list} *************** *** 927,931 **** 'Lots' \end{verbatim} ! % Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. Here's an example RE --- 931,935 ---- 'Lots' \end{verbatim} ! Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. Here's an example RE *************** *** 940,944 **** r'"') \end{verbatim} ! % It's obviously much easier to retrieve \code{m.group('zonem')}, instead of having to remember to retrieve group 9. --- 944,948 ---- r'"') \end{verbatim} ! It's obviously much easier to retrieve \code{m.group('zonem')}, instead of having to remember to retrieve group 9. *************** *** 997,1000 **** --- 1001,1005 ---- \verb|.*[.][^b].*$| + % $ First attempt: Exclude \samp{bat} by requiring that the first *************** *** 1007,1014 **** The expression gets messier when you try to patch up the first solution by requiring one of the following cases to match: the first ! character of the extension isn't ! \samp{b}; the second character isn't \samp{a}; or the third ! character isn't \samp{t}. This accepts \samp{foo.bar} and rejects ! \samp{autoexec.bat}, but it requires a three-letter extension, and doesn't accept \samp{sendmail.cf}. Another bug, so we'll complicate the pattern again in an effort to fix it. \regexp{.*[.]([\^b].?.?|.[\^a]?.?|..?[\^t]?)\$} --- 1012,1021 ---- The expression gets messier when you try to patch up the first solution by requiring one of the following cases to match: the first ! character of the extension isn't \samp{b}; the second character isn't ! \samp{a}; or the third character isn't \samp{t}. This accepts ! \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a ! three-letter extension, and doesn't accept \samp{sendmail.cf}. ! Another bug, so we'll complicate the pattern again in an effort to fix ! it. \regexp{.*[.]([\^b].?.?|.[\^a]?.?|..?[\^t]?)\$} *************** *** 1068,1072 **** returned as the final element of the list. In the following example, the delimiter will be any sequence of non-alphanumeric characters. ! % \begin{verbatim} >>> p = re.compile(r'\W+') --- 1075,1079 ---- returned as the final element of the list. In the following example, the delimiter will be any sequence of non-alphanumeric characters. ! \begin{verbatim} >>> p = re.compile(r'\W+') *************** *** 1076,1080 **** ['This', 'is', 'a', 'test, short and sweet, of split().'] \end{verbatim} ! % Sometimes you're not only interested in what the text between delimiters is, but also need to know what the delimiter was. If --- 1083,1087 ---- ['This', 'is', 'a', 'test, short and sweet, of split().'] \end{verbatim} ! Sometimes you're not only interested in what the text between delimiters is, but also need to know what the delimiter was. If *************** *** 1090,1094 **** ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] \end{verbatim} ! % The module-level function \function{re.split()} adds the RE to be used as the first argument, but is otherwise the same. --- 1097,1101 ---- ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] \end{verbatim} ! The module-level function \function{re.split()} adds the RE to be used as the first argument, but is otherwise the same. *************** *** 1131,1135 **** 'colour socks and red shoes' \end{verbatim} ! % Empty matches are replaced only when not they're not adjacent to a previous match. --- 1138,1142 ---- 'colour socks and red shoes' \end{verbatim} ! Empty matches are replaced only when not they're not adjacent to a previous match. *************** *** 1140,1144 **** '-a-b-d-' \end{verbatim} ! % If \var{replacement} is a string, any backslash escapes in it are processed. That is, \samp{\e n} is converted to a single newline --- 1147,1151 ---- '-a-b-d-' \end{verbatim} ! If \var{replacement} is a string, any backslash escapes in it are processed. That is, \samp{\e n} is converted to a single newline *************** *** 1155,1159 **** 'subsection{First} subsection{second}' \end{verbatim} ! % In addition to character escapes and backreferences as described above, \samp{\e g<name>} will use the substring matched by the group --- 1162,1166 ---- 'subsection{First} subsection{second}' \end{verbatim} ! In addition to character escapes and backreferences as described above, \samp{\e g<name>} will use the substring matched by the group *************** *** 1176,1180 **** 'subsection{First}' \end{verbatim} ! % \var{replacement} can also be a function, which gives you even more powerful control. If \var{replacement} is a function, the function is --- 1183,1187 ---- 'subsection{First}' \end{verbatim} ! \var{replacement} can also be a function, which gives you even more powerful control. If \var{replacement} is a function, the function is *************** *** 1183,1187 **** information to compute the desired replacement string and return it. For example: ! % \begin{verbatim} >>> def hexrepl( match ): --- 1190,1194 ---- information to compute the desired replacement string and return it. For example: ! \begin{verbatim} >>> def hexrepl( match ): *************** *** 1194,1198 **** 'Call 0xffd2 for printing, 0xc000 for user code.' \end{verbatim} ! % When using the module-level \function{re.sub()} function, the pattern is passed as the first argument. The pattern may be a string or a --- 1201,1205 ---- 'Call 0xffd2 for printing, 0xc000 for user code.' \end{verbatim} ! When using the module-level \function{re.sub()} function, the pattern is passed as the first argument. The pattern may be a string or a *************** *** 1260,1264 **** None \end{verbatim} ! % On the other hand, \module{search()} will scan forward through the string, reporting the first match it finds. --- 1267,1271 ---- None \end{verbatim} ! On the other hand, \module{search()} will scan forward through the string, reporting the first match it finds. *************** *** 1270,1274 **** (2, 7) \end{verbatim} ! % Sometimes you'll be tempted to keep using \function{re.match()}, and just add \regexp{.*} to the front of your RE. Resist this tempation, --- 1277,1281 ---- (2, 7) \end{verbatim} ! Sometimes you'll be tempted to keep using \function{re.match()}, and just add \regexp{.*} to the front of your RE. Resist this tempation, *************** *** 1303,1307 **** <html><head><title>Title</title> \end{verbatim} ! % The RE matches the \character{<} in \samp{<html>}, and the \regexp{.*} consumes the rest of the string. There's still more left --- 1310,1314 ---- <html><head><title>Title</title> \end{verbatim} ! The RE matches the \character{<} in \samp{<html>}, and the \regexp{.*} consumes the rest of the string. There's still more left *************** *** 1324,1328 **** <html> \end{verbatim} ! % \subsection{Not using re.VERBOSE} --- 1331,1335 ---- <html> \end{verbatim} ! \subsection{Not using re.VERBOSE} *************** *** 1356,1360 **** """, re.VERBOSE) \end{verbatim} ! % This is far more readable than: --- 1363,1368 ---- """, re.VERBOSE) \end{verbatim} ! % $ ! This is far more readable than: *************** *** 1362,1366 **** pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") \end{verbatim} ! % \section{Feedback} --- 1370,1375 ---- pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") \end{verbatim} ! % $ ! \section{Feedback} *************** *** 1383,1387 **** substring matched by the group \emph{cannot} be retrieved after performing a match or referenced later in the pattern. ! % \item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group --- 1392,1396 ---- substring matched by the group \emph{cannot} be retrieved after performing a match or referenced later in the pattern. ! \item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group *************** *** 1396,1403 **** or \code{m.end('id')}, and also by name in pattern text (e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}). ! % \item[\code{(?P=\var{name})}] Matches whatever text was matched by the earlier group named \var{name}. - % \item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't --- 1405,1411 ---- or \code{m.end('id')}, and also by name in pattern text (e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}). ! \item[\code{(?P=\var{name})}] Matches whatever text was matched by the earlier group named \var{name}. \item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't *************** *** 1405,1409 **** example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's followed by \code{'Asimov'}. ! % \item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This is a negative lookahead assertion. For example, --- 1413,1417 ---- example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's followed by \code{'Asimov'}. ! \item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This is a negative lookahead assertion. For example, |