[Py-howto-checkins] CVS: pyhowto regex.tex,1.17,1.18

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/py-howto/pyhowto
In directory sc8-pr-cvs1:/tmp/cvs-serv11351

Modified Files:
	regex.tex 
Log Message:
[Patch #718809 from Jarno Virtanen] Various minor corrections to regex.tex; I've also made a few more minor rewrites.

Index: regex.tex
===================================================================
RCS file: /cvsroot/py-howto/pyhowto/regex.tex,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -r1.17 -r1.18
*** regex.tex	7 Apr 2003 19:51:23 -0000	1.17
--- regex.tex	10 Apr 2003 14:18:34 -0000	1.18
***************
*** 35,40 ****
  Perl-style regular expression patterns.  Earlier versions of Python
  came with the \module{regex} module, which provides Emacs-style
! patterns.  Emacs-style patterns are slightly less readable, and
! doesn't provide as many features, so there's not much reason to use
  the \module{regex} module when writing new code, though you might
  encounter old code that uses it.
--- 35,40 ----
  Perl-style regular expression patterns.  Earlier versions of Python
  came with the \module{regex} module, which provides Emacs-style
! patterns.  Emacs-style patterns are slightly less readable and
! don't provide as many features, so there's not much reason to use
  the \module{regex} module when writing new code, though you might
  encounter old code that uses it.
***************
*** 215,219 ****
  it can at first, and if no match is found it will then progressively
  back up and retry the rest of the RE again and again.  It will back up
! until it's tried zero matches for \regexp{[bcd]*}, and if that
  subsequently fails, the engine will conclude that the string doesn't
  match the RE at all.
--- 215,219 ----
  it can at first, and if no match is found it will then progressively
  back up and retry the rest of the RE again and again.  It will back up
! until it has tried zero matches for \regexp{[bcd]*}, and if that
  subsequently fails, the engine will conclude that the string doesn't
  match the RE at all.
***************
*** 245,249 ****
  earlier, but that might as well be infinity.  

! Readers of a reductionist bent may notice that the 3 other qualifiers
  can all be expressed using this notation.  \regexp{\{0,\}} is the same
  as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
--- 245,249 ----
  earlier, but that might as well be infinity.  

! Readers of a reductionist bent may notice that the three other qualifiers
  can all be expressed using this notation.  \regexp{\{0,\}} is the same
  as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
***************
*** 348,354 ****

  \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
!   \lineii{match}{Determine if the RE matches at the beginning of
    the string.}
!   \lineii{search}{Scan through a string, looking for any location
    where this RE matches.}
    \lineii{findall()}{Find all substrings where the RE matches,
--- 348,354 ----

  \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
!   \lineii{match()}{Determine if the RE matches at the beginning of
    the string.}
!   \lineii{search()}{Scan through a string, looking for any location
    where this RE matches.}
    \lineii{findall()}{Find all substrings where the RE matches,
***************
*** 997,1001 ****
  \end{verbatim}

! \subsection{Other Assertions}

  Another zero-width assertion is the lookahead assertion.  Lookahead
--- 997,1001 ----
  \end{verbatim}

! \subsection{Lookahead Assertions}

  Another zero-width assertion is the lookahead assertion.  Lookahead
***************
*** 1016,1020 ****
  \end{itemize}

! An example will help make this concrete and will demonstrate a case
  where a lookahead is useful.  Consider a simple pattern to match a
  filename and split it apart into a base name and an extension,
--- 1016,1020 ----
  \end{itemize}

! An example will help make this concrete by demonstrating a case
  where a lookahead is useful.  Consider a simple pattern to match a
  filename and split it apart into a base name and an extension,
***************
*** 1022,1030 ****
  is the base name, and \samp{rc} is the filename's extension.  

! The pattern to match this is quite simple: \regexp{.*[.].*\$}.
! (Notice that the \samp{.} needs to be treated specially because it's a
  metacharacter; I've put it inside a character class.  Also notice the
  trailing \regexp{\$}; this is added to ensure that all the rest of the
! string must be included in the extension.)  This regular expression
  matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
  \samp{printers.conf}.
--- 1022,1033 ----
  is the base name, and \samp{rc} is the filename's extension.  

! The pattern to match this is quite simple: 
! 
! \regexp{.*[.].*\$}
! 
! Notice that the \samp{.} needs to be treated specially because it's a
  metacharacter; I've put it inside a character class.  Also notice the
  trailing \regexp{\$}; this is added to ensure that all the rest of the
! string must be included in the extension.  This regular expression
  matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
  \samp{printers.conf}.
***************
*** 1037,1045 ****
  % $

! First attempt: Exclude \samp{bat} by requiring that the first
! character of the extension is not a \samp{b}.  This is wrong, because it 
! also doesn't match \samp{foo.bar}.

! \regexp{.*[.]([\^b]..|.[\^a].|..[\^t])\$}

  The expression gets messier when you try to patch up the first
--- 1040,1049 ----
  % $

! The first attempt above tries to exclude \samp{bat} by requiring that
! the first character of the extension is not a \samp{b}.  This is
! wrong, because the pattern also doesn't match \samp{foo.bar}.

! % Messes up the HTML without the curly braces around \^
! \regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}

  The expression gets messier when you try to patch up the first
***************
*** 1048,1056 ****
  \samp{a}; or the third character isn't \samp{t}.  This accepts
  \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
! three-letter extension, and doesn't accept \samp{sendmail.cf}.
! Another bug, so we'll complicate the pattern again in an effort to fix
! it.

! \regexp{.*[.]([\^b].?.?|.[\^a]?.?|..?[\^t]?)\$}

  In the third attempt, the second and third letters are all made
--- 1052,1060 ----
  \samp{a}; or the third character isn't \samp{t}.  This accepts
  \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
! three-letter extension and won't accept a filename with a two-letter
! extension such as \samp{sendmail.cf}.  We'll complicate the pattern
! again in an effort to fix it.

! \regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}

  In the third attempt, the second and third letters are all made
***************
*** 1059,1081 ****

  The pattern's getting really complicated now, which makes it hard to
! read and understand.  Worse, this solution doesn't scale well; if the
! problem changes and you want to exclude both \samp{bat} and \samp{exe}
! as extensions, the pattern would get still more complicated and
! confusing.
! 
! A negative lookahead cuts through all this.  Go back to the original
! pattern, and, before the \regexp{.*} which matches the extension,
! insert \regexp{(?!bat\$)}.  This means: if the expression \regexp{bat}
! doesn't match at this point, try the rest of the pattern; if
! \regexp{bat\$} does match, the whole pattern will fail.  The trailing
! \regexp{\$} is required to ensure that something like
! \samp{sample.batch}, where the extension only starts with \samp{bat},
! will be allowed.
! 
! After this modification, the whole pattern is
! \regexp{.*[.](?!bat\$).*\$}.  Excluding another filename extension is
! now easy; simply add it as an alternative inside the assertion.
  \regexp{.*[.](?!bat\$|exe\$).*\$}
! excludes both \samp{bat} and \samp{exe}.

--- 1063,1087 ----

  The pattern's getting really complicated now, which makes it hard to
! read and understand.  Worse, if the problem changes and you want to
! exclude both \samp{bat} and \samp{exe} as extensions, the pattern
! would get even more complicated and confusing.
! 
! A negative lookahead cuts through all this:
! 
! \regexp{.*[.](?!bat\$).*\$}
! % $
! 
! The lookahead means: if the expression \regexp{bat} doesn't match at
! this point, try the rest of the pattern; if \regexp{bat\$} does match,
! the whole pattern will fail.  The trailing \regexp{\$} is required to
! ensure that something like \samp{sample.batch}, where the extension
! only starts with \samp{bat}, will be allowed.
! 
! Excluding another filename extension is now easy; simply add it as an
! alternative inside the assertion.  The following pattern excludes
! filenames that end in either \samp{bat} or \samp{exe}:
! 
  \regexp{.*[.](?!bat\$|exe\$).*\$}
! % $

***************
*** 1087,1093 ****

  \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
!   \lineii{split}{Split the string into a list, splitting it wherever the RE matches}
!   \lineii{sub}{Find all substrings where the RE matches, and replace them with a different string}
!   \lineii{subn}{Does the same thing as \method{sub()}, 
     but returns the new string and the number of replacements}
  \end{tableii}
--- 1093,1099 ----

  \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
!   \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
!   \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
!   \lineii{subn()}{Does the same thing as \method{sub()}, 
     but returns the new string and the number of replacements}
  \end{tableii}
***************
*** 1193,1197 ****
  \end{verbatim}

! Empty matches are replaced only when not they're not
  adjacent to a previous match.  

--- 1199,1203 ----
  \end{verbatim}

! Empty matches are replaced only when they're not
  adjacent to a previous match.  

***************
*** 1223,1229 ****
  There's also a syntax for referring to named groups as defined by the
  \regexp{(?P<name>...)} syntax.  \samp{\e g<name>} will use the
! substring matched by the group named \samp{name}, and \samp{\e
! g<\var{number}>} uses the corresponding group number.  \samp{\e g<2>}
! is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
  replacement string such as \samp{\e g<2>0}.  (\samp{\e 20} would be
  interpreted as a reference to group 20, not a reference to group 2
--- 1229,1237 ----
  There's also a syntax for referring to named groups as defined by the
  \regexp{(?P<name>...)} syntax.  \samp{\e g<name>} will use the
! substring matched by the group named \samp{name}, and 
! \samp{\e g<\var{number}>} 
! uses the corresponding group number.  
! \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, 
! but isn't ambiguous in a
  replacement string such as \samp{\e g<2>0}.  (\samp{\e 20} would be
  interpreted as a reference to group 20, not a reference to group 2
***************
*** 1303,1309 ****
  from a string or replacing it with another single character.  You
  might do this with something like \code{re.sub('\e n', ' ', S)}, but
! \method{translate()} is capable of doing both these tasks,
! and will be much faster that any regular expression operation can ever
! be.

  In short, before turning to the \module{re} module, consider whether
--- 1311,1316 ----
  from a string or replacing it with another single character.  You
  might do this with something like \code{re.sub('\e n', ' ', S)}, but
! \method{translate()} is capable of doing both tasks
! and will be faster that any regular expression operation can be.

  In short, before turning to the \module{re} module, consider whether
***************
*** 1347,1351 ****
  starting character, only trying the full match if a \character{C} is found.

! Adding \regexp{.*} defeats this optimization, and requires scanning to
  the end of the string and then backtracking to find a match for the
  rest of the RE.  Use \function{re.search()} instead.
--- 1354,1358 ----
  starting character, only trying the full match if a \character{C} is found.

! Adding \regexp{.*} defeats this optimization, requiring scanning to
  the end of the string and then backtracking to find a match for the
  rest of the RE.  Use \function{re.search()} instead.