|
From: Fred L. D. <fd...@us...> - 2001-04-23 16:54:53
|
Update of /cvsroot/py-howto/pyhowto
In directory usw-pr-cvs1:/tmp/cvs-serv29333
Modified Files:
regex.tex
Log Message:
Fix re.VERSION-modified RE; "#" as part of the pattern was not escaped.
Closes SF bug #416374.
Wrap some wide paragraphs.
Remove extraneous "%" characters from otherwise blank lines after verbatim
environments, except in a couple of places where we needed to bow to
font-lock. ;-(
Index: regex.tex
===================================================================
RCS file: /cvsroot/py-howto/pyhowto/regex.tex,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -r1.8 -r1.9
*** regex.tex 2000/07/28 02:06:27 1.8
--- regex.tex 2001/04/23 16:54:49 1.9
***************
*** 70,74 ****
We'll start by learning about the simplest possible regular
! expressions. Since regular expressions are used to operate on strings, we'll start with the most common task: matching characters.
For a detailed explanation of the computer science underlying regular
--- 70,75 ----
We'll start by learning about the simplest possible regular
! expressions. Since regular expressions are used to operate on
! strings, we'll start with the most common task: matching characters.
For a detailed explanation of the computer science underlying regular
***************
*** 90,100 ****
devoted to discussing various metacharacters and what they do.
! Here's a complete list of the metacharacters; their meanings will be discussed
! in the rest of this HOWTO.
\begin{verbatim}
. ^ $ * + ? { [ \ | ( )
\end{verbatim}
! %
The first metacharacter we'll look at is \samp{[}; it's used for
specifying a character class, which is a set of characters that you
--- 91,102 ----
devoted to discussing various metacharacters and what they do.
! Here's a complete list of the metacharacters; their meanings will be
! discussed in the rest of this HOWTO.
\begin{verbatim}
. ^ $ * + ? { [ \ | ( )
\end{verbatim}
! % $
!
The first metacharacter we'll look at is \samp{[}; it's used for
specifying a character class, which is a set of characters that you
***************
*** 107,114 ****
RE would be \regexp{[a-z]}.
! Metacharacters are not active inside classes. For example, \regexp{[akm\$]}
! will match any of the characters \character{a}, \character{k},
! \character{m}, or \character{\$}; \character{\$} is usually a metacharacter, but inside a character class it's stripped
! of its special nature.
You can match the characters not within a range by \dfn{complementing}
--- 109,117 ----
RE would be \regexp{[a-z]}.
! Metacharacters are not active inside classes. For example,
! \regexp{[akm\$]} will match any of the characters \character{a},
! \character{k}, \character{m}, or \character{\$}; \character{\$} is
! usually a metacharacter, but inside a character class it's stripped of
! its special nature.
You can match the characters not within a range by \dfn{complementing}
***************
*** 134,150 ****
\item[\code{\e d}]Matches any decimal digit; this is
equivalent to the class \regexp{[0-9]}.
! %
\item[\code{\e D}]Matches any non-digit character; this is
equivalent to the class \verb|[^0-9]|.
! %
\item[\code{\e s}]Matches any whitespace character; this is
equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
! %
\item[\code{\e S}]Matches any non-whitespace character; this is
equivalent to the class \verb|[^ \t\n\r\f\v]|.
! %
\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
\regexp{[a-zA-Z0-9_]}.
! %
\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
\verb|[^a-zA-Z0-9_]|.
--- 137,153 ----
\item[\code{\e d}]Matches any decimal digit; this is
equivalent to the class \regexp{[0-9]}.
!
\item[\code{\e D}]Matches any non-digit character; this is
equivalent to the class \verb|[^0-9]|.
!
\item[\code{\e s}]Matches any whitespace character; this is
equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
!
\item[\code{\e S}]Matches any non-whitespace character; this is
equivalent to the class \verb|[^ \t\n\r\f\v]|.
!
\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
\regexp{[a-zA-Z0-9_]}.
!
\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
\verb|[^a-zA-Z0-9_]|.
***************
*** 272,276 ****
<re.RegexObject instance at 80b4150>
\end{verbatim}
! %
\function{re.compile()} also accepts an optional \var{flags}
argument, used to enable various special features and syntax
--- 275,279 ----
<re.RegexObject instance at 80b4150>
\end{verbatim}
!
\function{re.compile()} also accepts an optional \var{flags}
argument, used to enable various special features and syntax
***************
*** 281,285 ****
>>> p = re.compile('ab*', re.IGNORECASE)
\end{verbatim}
! %
The RE is passed to \function{re.compile()} as a string.
REs are handled as strings because regular expressions aren't
--- 284,288 ----
>>> p = re.compile('ab*', re.IGNORECASE)
\end{verbatim}
!
The RE is passed to \function{re.compile()} as a string.
REs are handled as strings because regular expressions aren't
***************
*** 379,383 ****
<re.RegexObject instance at 80c3c28>
\end{verbatim}
! %
Now, you can try matching various strings against the RE
\regexp{[a-z]+}. An empty string shouldn't match at all, since
--- 382,386 ----
<re.RegexObject instance at 80c3c28>
\end{verbatim}
!
Now, you can try matching various strings against the RE
\regexp{[a-z]+}. An empty string shouldn't match at all, since
***************
*** 392,396 ****
None
\end{verbatim}
! %
Now, let's try it on a string that it should match, such as
\samp{tempo}. In this case, \method{match()} will return a
--- 395,399 ----
None
\end{verbatim}
!
Now, let's try it on a string that it should match, such as
\samp{tempo}. In this case, \method{match()} will return a
***************
*** 403,407 ****
<re.MatchObject instance at 80c4f68>
\end{verbatim}
! %
Now you can query the \class{MatchObject} for information about the
matching string. \class{MatchObject} instances also have several
--- 406,410 ----
<re.MatchObject instance at 80c4f68>
\end{verbatim}
!
Now you can query the \class{MatchObject} for information about the
matching string. \class{MatchObject} instances also have several
***************
*** 425,429 ****
(0, 5)
\end{verbatim}
! %
\method{group()} returns the substring that was matched by the
RE. \method{start()} and \method{end()} return the starting and
--- 428,432 ----
(0, 5)
\end{verbatim}
!
\method{group()} returns the substring that was matched by the
RE. \method{start()} and \method{end()} return the starting and
***************
*** 445,449 ****
(4, 11)
\end{verbatim}
! %
In actual programs, the most common style is to store the
\class{MatchObject} in a variable, and then check if it was
--- 448,452 ----
(4, 11)
\end{verbatim}
!
In actual programs, the most common style is to store the
\class{MatchObject} in a variable, and then check if it was
***************
*** 458,462 ****
print 'No match'
\end{verbatim}
! %
\subsection{Module-Level Functions}
--- 461,465 ----
print 'No match'
\end{verbatim}
!
\subsection{Module-Level Functions}
***************
*** 475,479 ****
<re.MatchObject instance at 80c5978>
\end{verbatim}
! %
Under the hood, these functions simply produce a \class{RegexObject}
for you and call the appropriate method on it. They also store the
--- 478,482 ----
<re.MatchObject instance at 80c5978>
\end{verbatim}
!
Under the hood, these functions simply produce a \class{RegexObject}
for you and call the appropriate method on it. They also store the
***************
*** 498,502 ****
starttagopen = re.compile( ... )
\end{verbatim}
! %
(I generally prefer to work with the compiled object, even for
one-time uses, but few people will be as much of a purist about this
--- 501,505 ----
starttagopen = re.compile( ... )
\end{verbatim}
!
(I generally prefer to work with the compiled object, even for
one-time uses, but few people will be as much of a purist about this
***************
*** 594,598 ****
\begin{verbatim}
charref = re.compile(r"""
! &# # Start of a numeric entity reference
(?P<char>
[0-9]+[^0-9] # Decimal form
--- 597,601 ----
\begin{verbatim}
charref = re.compile(r"""
! &\# # Start of a numeric entity reference
(?P<char>
[0-9]+[^0-9] # Decimal form
***************
*** 602,606 ****
""", re.VERBOSE)
\end{verbatim}
! %
Without the verbose setting, the RE would look like this:
\begin{verbatim}
--- 605,609 ----
""", re.VERBOSE)
\end{verbatim}
!
Without the verbose setting, the RE would look like this:
\begin{verbatim}
***************
*** 609,613 ****
"|x[0-9a-fA-F]+[^0-9a-fA-F])")
\end{verbatim}
! %
In the above example, Python's automatic concatenation of string literals has been used to
break up the RE into smaller pieces, but it's still more difficult to
--- 612,616 ----
"|x[0-9a-fA-F]+[^0-9a-fA-F])")
\end{verbatim}
!
In the above example, Python's automatic concatenation of string literals has been used to
break up the RE into smaller pieces, but it's still more difficult to
***************
*** 639,643 ****
\begin{list}{}{}
! %
\item[\regexp{|}]
Alternation, or the ``or'' operator.
--- 642,646 ----
\begin{list}{}{}
!
\item[\regexp{|}]
Alternation, or the ``or'' operator.
***************
*** 651,655 ****
To match a literal \character{|},
use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
! %
\item[\regexp{\^}] Matches at the beginning of lines. Unless the
\constant{MULTILINE} flag has been set, this will only match at the
--- 654,658 ----
To match a literal \character{|},
use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
!
\item[\regexp{\^}] Matches at the beginning of lines. Unless the
\constant{MULTILINE} flag has been set, this will only match at the
***************
*** 670,674 ****
use \regexp{\e\^}, or enclose it inside a character class, as in
\regexp{[{\e}\^]}.
! %
\item[\regexp{\$}] Matches at the end of lines, which is defined as
either the end of the string, or any location followed by a newline
--- 673,677 ----
use \regexp{\e\^}, or enclose it inside a character class, as in
\regexp{[{\e}\^]}.
!
\item[\regexp{\$}] Matches at the end of lines, which is defined as
either the end of the string, or any location followed by a newline
***************
*** 683,690 ****
<re.MatchObject instance at 80adfa8>
\end{verbatim}
! %
! To match a literal \character{\$},
! use \regexp{\e\$}, or enclose it inside a character class, as in \regexp{[\$]}.
! %
\item[\regexp{\e A}] Matches only at the start of the string. When not
in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are effectively
--- 686,694 ----
<re.MatchObject instance at 80adfa8>
\end{verbatim}
! % $
!
! To match a literal \character{\$}, use \regexp{\e\$}, or enclose it
! inside a character class, as in \regexp{[\$]}.
!
\item[\regexp{\e A}] Matches only at the start of the string. When not
in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are effectively
***************
*** 693,699 ****
\regexp{\^} may match at several locations inside the string (anywhere
following a newline character).
! %
\item[\regexp{\e Z}]Matches only at the end of the string.
! %
\item[\regexp{\e b}] Word boundary.
This is a zero-width assertion that matches only at the
--- 697,703 ----
\regexp{\^} may match at several locations inside the string (anywhere
following a newline character).
!
\item[\regexp{\e Z}]Matches only at the end of the string.
!
\item[\regexp{\e b}] Word boundary.
This is a zero-width assertion that matches only at the
***************
*** 714,718 ****
None
\end{verbatim}
! %
There are two subtleties you should remember when using this special
sequence. First, this is the worst collision between Python's string
--- 718,722 ----
None
\end{verbatim}
!
There are two subtleties you should remember when using this special
sequence. First, this is the worst collision between Python's string
***************
*** 731,743 ****
<re.MatchObject instance at 80c3ee0>
\end{verbatim}
! %
Second, inside a character class, where there's no use for this
assertion, \regexp{\e b} represents the backspace character, for
compatibility with Python's string literals.
! %
\item[\regexp{\e B}] Another zero-width assertion, this is the
opposite of \regexp{\e b}, only matching when the current
position is not at a word boundary.
! %
\end{list}
--- 735,747 ----
<re.MatchObject instance at 80c3ee0>
\end{verbatim}
!
Second, inside a character class, where there's no use for this
assertion, \regexp{\e b} represents the backspace character, for
compatibility with Python's string literals.
!
\item[\regexp{\e B}] Another zero-width assertion, this is the
opposite of \regexp{\e b}, only matching when the current
position is not at a word boundary.
!
\end{list}
***************
*** 927,931 ****
'Lots'
\end{verbatim}
! %
Named groups are handy because they let you use easily-remembered
names, instead of having to remember numbers. Here's an example RE
--- 931,935 ----
'Lots'
\end{verbatim}
!
Named groups are handy because they let you use easily-remembered
names, instead of having to remember numbers. Here's an example RE
***************
*** 940,944 ****
r'"')
\end{verbatim}
! %
It's obviously much easier to retrieve \code{m.group('zonem')},
instead of having to remember to retrieve group 9.
--- 944,948 ----
r'"')
\end{verbatim}
!
It's obviously much easier to retrieve \code{m.group('zonem')},
instead of having to remember to retrieve group 9.
***************
*** 997,1000 ****
--- 1001,1005 ----
\verb|.*[.][^b].*$|
+ % $
First attempt: Exclude \samp{bat} by requiring that the first
***************
*** 1007,1014 ****
The expression gets messier when you try to patch up the first
solution by requiring one of the following cases to match: the first
! character of the extension isn't
! \samp{b}; the second character isn't \samp{a}; or the third
! character isn't \samp{t}. This accepts \samp{foo.bar} and rejects
! \samp{autoexec.bat}, but it requires a three-letter extension, and doesn't accept \samp{sendmail.cf}. Another bug, so we'll complicate the pattern again in an effort to fix it.
\regexp{.*[.]([\^b].?.?|.[\^a]?.?|..?[\^t]?)\$}
--- 1012,1021 ----
The expression gets messier when you try to patch up the first
solution by requiring one of the following cases to match: the first
! character of the extension isn't \samp{b}; the second character isn't
! \samp{a}; or the third character isn't \samp{t}. This accepts
! \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
! three-letter extension, and doesn't accept \samp{sendmail.cf}.
! Another bug, so we'll complicate the pattern again in an effort to fix
! it.
\regexp{.*[.]([\^b].?.?|.[\^a]?.?|..?[\^t]?)\$}
***************
*** 1068,1072 ****
returned as the final element of the list. In the following example,
the delimiter will be any sequence of non-alphanumeric characters.
! %
\begin{verbatim}
>>> p = re.compile(r'\W+')
--- 1075,1079 ----
returned as the final element of the list. In the following example,
the delimiter will be any sequence of non-alphanumeric characters.
!
\begin{verbatim}
>>> p = re.compile(r'\W+')
***************
*** 1076,1080 ****
['This', 'is', 'a', 'test, short and sweet, of split().']
\end{verbatim}
! %
Sometimes you're not only interested in what the text between
delimiters is, but also need to know what the delimiter was. If
--- 1083,1087 ----
['This', 'is', 'a', 'test, short and sweet, of split().']
\end{verbatim}
!
Sometimes you're not only interested in what the text between
delimiters is, but also need to know what the delimiter was. If
***************
*** 1090,1094 ****
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
\end{verbatim}
! %
The module-level function \function{re.split()} adds the RE to be
used as the first argument, but is otherwise the same.
--- 1097,1101 ----
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
\end{verbatim}
!
The module-level function \function{re.split()} adds the RE to be
used as the first argument, but is otherwise the same.
***************
*** 1131,1135 ****
'colour socks and red shoes'
\end{verbatim}
! %
Empty matches are replaced only when not they're not
adjacent to a previous match.
--- 1138,1142 ----
'colour socks and red shoes'
\end{verbatim}
!
Empty matches are replaced only when not they're not
adjacent to a previous match.
***************
*** 1140,1144 ****
'-a-b-d-'
\end{verbatim}
! %
If \var{replacement} is a string, any backslash escapes in it are
processed. That is, \samp{\e n} is converted to a single newline
--- 1147,1151 ----
'-a-b-d-'
\end{verbatim}
!
If \var{replacement} is a string, any backslash escapes in it are
processed. That is, \samp{\e n} is converted to a single newline
***************
*** 1155,1159 ****
'subsection{First} subsection{second}'
\end{verbatim}
! %
In addition to character escapes and backreferences as described
above, \samp{\e g<name>} will use the substring matched by the group
--- 1162,1166 ----
'subsection{First} subsection{second}'
\end{verbatim}
!
In addition to character escapes and backreferences as described
above, \samp{\e g<name>} will use the substring matched by the group
***************
*** 1176,1180 ****
'subsection{First}'
\end{verbatim}
! %
\var{replacement} can also be a function, which gives you even more
powerful control. If \var{replacement} is a function, the function is
--- 1183,1187 ----
'subsection{First}'
\end{verbatim}
!
\var{replacement} can also be a function, which gives you even more
powerful control. If \var{replacement} is a function, the function is
***************
*** 1183,1187 ****
information to compute the desired replacement string and return it.
For example:
! %
\begin{verbatim}
>>> def hexrepl( match ):
--- 1190,1194 ----
information to compute the desired replacement string and return it.
For example:
!
\begin{verbatim}
>>> def hexrepl( match ):
***************
*** 1194,1198 ****
'Call 0xffd2 for printing, 0xc000 for user code.'
\end{verbatim}
! %
When using the module-level \function{re.sub()} function, the pattern
is passed as the first argument. The pattern may be a string or a
--- 1201,1205 ----
'Call 0xffd2 for printing, 0xc000 for user code.'
\end{verbatim}
!
When using the module-level \function{re.sub()} function, the pattern
is passed as the first argument. The pattern may be a string or a
***************
*** 1260,1264 ****
None
\end{verbatim}
! %
On the other hand, \module{search()} will scan forward through the
string, reporting the first match it finds.
--- 1267,1271 ----
None
\end{verbatim}
!
On the other hand, \module{search()} will scan forward through the
string, reporting the first match it finds.
***************
*** 1270,1274 ****
(2, 7)
\end{verbatim}
! %
Sometimes you'll be tempted to keep using \function{re.match()}, and
just add \regexp{.*} to the front of your RE. Resist this tempation,
--- 1277,1281 ----
(2, 7)
\end{verbatim}
!
Sometimes you'll be tempted to keep using \function{re.match()}, and
just add \regexp{.*} to the front of your RE. Resist this tempation,
***************
*** 1303,1307 ****
<html><head><title>Title</title>
\end{verbatim}
! %
The RE matches the \character{<} in \samp{<html>}, and the
\regexp{.*} consumes the rest of the string. There's still more left
--- 1310,1314 ----
<html><head><title>Title</title>
\end{verbatim}
!
The RE matches the \character{<} in \samp{<html>}, and the
\regexp{.*} consumes the rest of the string. There's still more left
***************
*** 1324,1328 ****
<html>
\end{verbatim}
! %
\subsection{Not using re.VERBOSE}
--- 1331,1335 ----
<html>
\end{verbatim}
!
\subsection{Not using re.VERBOSE}
***************
*** 1356,1360 ****
""", re.VERBOSE)
\end{verbatim}
! %
This is far more readable than:
--- 1363,1368 ----
""", re.VERBOSE)
\end{verbatim}
! % $
!
This is far more readable than:
***************
*** 1362,1366 ****
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
\end{verbatim}
! %
\section{Feedback}
--- 1370,1375 ----
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
\end{verbatim}
! % $
!
\section{Feedback}
***************
*** 1383,1387 ****
substring matched by the group \emph{cannot} be retrieved after
performing a match or referenced later in the pattern.
! %
\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
the substring matched by the group is accessible via the symbolic group
--- 1392,1396 ----
substring matched by the group \emph{cannot} be retrieved after
performing a match or referenced later in the pattern.
!
\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
the substring matched by the group is accessible via the symbolic group
***************
*** 1396,1403 ****
or \code{m.end('id')}, and also by name in pattern text
(e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
! %
\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
earlier group named \var{name}.
- %
\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
--- 1405,1411 ----
or \code{m.end('id')}, and also by name in pattern text
(e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
!
\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
earlier group named \var{name}.
\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
***************
*** 1405,1409 ****
example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
followed by \code{'Asimov'}.
! %
\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
is a negative lookahead assertion. For example,
--- 1413,1417 ----
example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
followed by \code{'Asimov'}.
!
\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
is a negative lookahead assertion. For example,
|