|
From: Brenda L. <asp...@us...> - 2003-05-12 05:00:11
|
Update of /cvsroot/squeak/squeak/platforms/Cross/plugins/RePlugin In directory sc8-pr-cvs1:/tmp/cvs-serv13419 Added Files: RePlugin3-Fixes.1.cs RePlugin3-Fixes2.1.cs RePlugin3.3.1.cs Log Message: Ian Piumarta's 3.4.1 release --- NEW FILE: RePlugin3-Fixes.1.cs --- 'From Squeak 3.2 of 11 July 2002 [latest update: #4917] on 17 August 2002 at 6:10:11 pm'! "Change Set: RePlugin3-Fixes Date: 17 August 2002 Author: ti...@su... Some small fixes to RePlugin3 to allow compiling on Acorn"! !RePlugin methodsFor: 're primitives' stamp: 'tpr 8/17/2002 18:00'! primPCREExec "<rcvr primPCREExec: searchObject>, where rcvr is an object with instance variables: 'patternStr compileFlags pcrePtr extraPtr errorStr errorOffset matchFlags' Apply the regular expression (stored in <pcrePtr> and <extratr>, generated from calls to primPCRECompile), to smalltalk String searchObject using <matchOptions>. If there is no match, answer nil. Otherwise answer a ByteArray of offsets representing the results of the match." | searchObject searchBuffer length result matchSpacePtr matchSpaceSize | self export: true. self var:#searchBuffer declareC: 'char *searchBuffer'. self var:#matchSpacePtr declareC: 'int *matchSpacePtr'. self var:#result declareC: 'int result'. "Load Parameters" searchObject _ interpreterProxy stackObjectValue: 0. searchBuffer _ interpreterProxy arrayValueOf: searchObject. length _ interpreterProxy byteSizeOf: searchObject. self loadRcvrFromStackAt: 1. "Load Instance Variables" pcrePtr _ self rcvrPCREBufferPtr. extraPtr _ self rcvrExtraPtr. matchFlags _ self rcvrMatchFlags. matchSpacePtr _ self rcvrMatchSpacePtr. matchSpaceSize _ self rcvrMatchSpaceSize. interpreterProxy failed ifTrue:[^ nil]. result _ self cCode: 'pcre_exec((pcre *)pcrePtr, (pcre_extra *)extraPtr, searchBuffer, length, 0, matchFlags, matchSpacePtr, matchSpaceSize)'. interpreterProxy pop: 2; pushInteger: result. "empty call so compiler doesn't bug me about variables not used" self touch: searchBuffer; touch: matchSpacePtr; touch: matchSpaceSize; touch: length ! ! !RePlugin methodsFor: 're primitives' stamp: 'tpr 8/17/2002 18:00'! primPCREExecfromto "<rcvr primPCREExec: searchObject> from: fromInteger to: toInteger>, where rcvr is an object with instance variables: 'patternStr compileFlags pcrePtr extraPtr errorStr errorOffset matchFlags' Apply the regular expression (stored in <pcrePtr> and <extratr>, generated from calls to primPCRECompile), to smalltalk String searchObject using <matchOptions>, beginning at offset <fromInteger> and continuing until offset <toInteger>. If there is no match, answer nil. Otherwise answer a ByteArray of offsets representing the results of the match." | searchObject searchBuffer length result matchSpacePtr matchSpaceSize fromInteger toInteger | self export: true. self var:#searchBuffer declareC: 'char *searchBuffer'. self var:#fromInteger declareC: 'int fromInteger'. self var:#toInteger declareC: 'int toInteger'. self var:#matchSpacePtr declareC: 'int *matchSpacePtr'. self var:#result declareC: 'int result'. "Load Parameters" toInteger _ interpreterProxy stackIntegerValue: 0. fromInteger _ interpreterProxy stackIntegerValue: 1. searchObject _ interpreterProxy stackObjectValue: 2. searchBuffer _ interpreterProxy arrayValueOf: searchObject. length _ interpreterProxy byteSizeOf: searchObject. self loadRcvrFromStackAt: 3. "Validate parameters" interpreterProxy success: (1 <= fromInteger). interpreterProxy success: (toInteger<=length). fromInteger _ fromInteger - 1. "Smalltalk offsets are 1-based" interpreterProxy success: (fromInteger<=toInteger). "adjust length, searchBuffer" length _ toInteger - fromInteger. searchBuffer _ searchBuffer + fromInteger. "Load Instance Variables" pcrePtr _ self rcvrPCREBufferPtr. extraPtr _ self rcvrExtraPtr. matchFlags _ self rcvrMatchFlags. matchSpacePtr _ self rcvrMatchSpacePtr. matchSpaceSize _ self rcvrMatchSpaceSize. interpreterProxy failed ifTrue:[^ nil]. result _ self cCode: 'pcre_exec((pcre *)pcrePtr, (pcre_extra *)extraPtr, searchBuffer, length, 0, matchFlags, matchSpacePtr, matchSpaceSize)'. interpreterProxy pop: 2; pushInteger: result. "empty call so compiler doesn't bug me about variables not used" self touch: searchBuffer; touch: matchSpacePtr; touch: matchSpaceSize; touch: length ! ! !RePlugin methodsFor: 'rcvr linkage' stamp: 'tpr 8/17/2002 18:01'! allocateStringAndSetRcvrErrorStrFromCStr: aCStrBuffer |length errorStrObj errorStrObjPtr | self var: #aCStrBuffer declareC: 'const char *aCStrBuffer'. self var: #errorStrObjPtr declareC: 'void *errorStrObjPtr'. "Allocate errorStrObj" length _ self cCode: 'strlen(aCStrBuffer)'. errorStrObj _ interpreterProxy instantiateClass: (interpreterProxy classString) indexableSize: length. self loadRcvrFromStackAt: 0. "Assume garbage collection after instantiation" "Copy aCStrBuffer to errorStrObj's buffer" errorStrObjPtr _ interpreterProxy arrayValueOf: errorStrObj. self cCode:'memcpy(errorStrObjPtr,aCStrBuffer,length)'. self touch: errorStrObjPtr; touch: errorStrObj. "Set rcvrErrorStr from errorStrObj and Return" self rcvrErrorStrFrom: errorStrObj. ^errorStrObj.! ! !RePlugin class methodsFor: 'plugin code generation' stamp: 'tpr 8/17/2002 18:02'! declareCVarsIn: cg cg addHeaderFile:'"rePlugin.h"'. "Memory Managament Error Checking" cg var: 'netMemory' declareC: 'int netMemory = 0'. cg var: 'numAllocs' declareC: 'int numAllocs = 0'. cg var: 'numFrees' declareC: 'int numFrees = 0'. cg var: 'lastAlloc' declareC: 'int lastAlloc = 0'. "The receiver Object Pointer" cg var: 'rcvr' declareC: 'int rcvr'. "Instance Variables of Receiver Object" cg var: 'patternStr' declareC: 'int patternStr'. cg var: 'compileFlags' declareC: 'int compileFlags'. cg var: 'pcrePtr' declareC: 'int pcrePtr'. cg var: 'extraPtr' declareC: 'int extraPtr'. cg var: 'errorStr' declareC: 'int errorStr'. cg var: 'errorOffset' declareC: 'int errorOffset'. cg var: 'matchFlags' declareC: 'int matchFlags'. "Support Variables for Access to Receiver Instance Variables" cg var: 'patternStrPtr' declareC: 'const char * patternStrPtr'. cg var: 'errorStrBuffer' declareC: 'const char * errorStrBuffer'.! ! --- NEW FILE: RePlugin3-Fixes2.1.cs --- 'From Squeak3.2gamma of 15 January 2002 [latest update: #4881] on 21 August 2002 at 10:52:51 pm'! "Change Set: RePlugin3-Fixes2 Date: 21 August 2002 Author: ian...@in... Adds a cast to one method in RePlugin to avoid a compiler warning."! !RePlugin methodsFor: 'rcvr linkage' stamp: 'ikp 8/21/2002 22:40'! rcvrExtraPtr |extraObj| self inline: true. extraObj _ interpreterProxy fetchPointer: 3 ofObject: rcvr. (extraObj = (interpreterProxy nilObject)) ifTrue: [^ self cCode: '(int) NULL']. ^self cCoerce:(interpreterProxy arrayValueOf: extraObj) to: 'int'.! ! --- NEW FILE: RePlugin3.3.1.cs --- 'From Squeak3.3alpha of 12 January 2002 [latest update: #4934] on 16 August 2002 at 10:56:49 pm'! "Change Set: RePlugin3.3 Date: 16 August 2002 Author: acg Perl-Style Regular Expressions in Smalltalk by Andrew C. Greenberg Version 3.3beta I. Regular Expressions in General Regular expressions are a language for specifying text to ease the searching and manipulation of text. A complete discussion of regular expressions is beyond the scope of this document. See Jeffrey Friedl, Mastering Regular Expressions, by O'Reilly for a relatively complete. The regular expressions supported by this package are similar to those presently used in Perl 5.05 and Python, and are based upon Philip Hazel's excellent PCRE libraries (incorporated almost without change, subject to a free license described in Re aLicenseComment. Thanks are due to Markus Kohler and Stephen Pair for their assistance in the initial ports of early versions of the Plugin. An explanation of the expressions available in this package are summarized in Re aRegexComment, Re anOptionsComment and Re aGlobalSearchComment. A more detailed description of RePlugin is available downloading the file 'RePluginDoco,' which can be obtained from http://www.gate.net/~werdna/RePlugin.html, into your default directory, and then executing Utilities reconstructTextWindowsFromFileNamed: 'RePluginDoco' II. Overview of the 'Package.' The following new classes are provided: Class Description of Instances ---------------------- ------------------------------------------------------------------- Re A regular expression matching engine ReMatch Result of a search using Re RePattern Deprecated engine class from earlier plugin versions RePlugin The Plugin 'Glue' to the PCRE Library. String Various new messages were added to String, which are the principal means for users to access the package. PluginCodeGenerator has been deleted from the packgage. III. Some Examples. A. Simple Matching and Querying of Matches To search a string for matches in a regular expression, use String reMatch: 'just trying to catch some zzz''s before noon' matchRe: 'z+' which returns true if matched, and false otherwise. If more information from a positive search result is desired, the method reMatch will return a ReMatch object corresponding to the result. 'just trying to catch some zzz''s before noon' reMatch: 'z+' The match object can be queried in various ways. For example, to obtain details when parenthetical phrases of a regular expression are captured: |m| m _ 'Andy was born on 10/02/1957, and not soon enough!!' reMatch: '(\d\d)/(\d\d)/((19|20)?\d\d)'. m matches answers with: ('10' '02' '1957' '19' ) The first message answers a ReMatch m representing the result of a search of the string for matches of re (nil would be returned if no match was found). The third message answered a collection of the parenthetical subgroups matched, each showing the day, month and year as extracted from the string. B. Global Matching and String Substitutions You can perform global searches to repeatedly search a string for non-overlapping occurrences of a pattern by using reMatch:collect: For example, 'this is a test' collectRe: '\w+' can be used to gather a collection of all words in the search string, answering: OrderedCollection ('this' 'is' 'a' 'test' ) For slightly more complex collections, you can use #reMatch:andCollect: Additionally, you can perform global searches with text substitutions using reMatch:sub: For example, 'this is a test' reMatch: '\w+' andReplace: [:m | '<', (m match), '>'] can be used to replace every word in the search string with the word enclosed by matching brackets, answering: '<this> <is> <a> <test>' Further examples and documentation can be found in the references above, and in the comments and definitions set forth in ReMatch, RePattern and String. "! Object subclass: #Re instanceVariableNames: 'pattern compiledPattern isAnchored isCaseSensitive isDollarEndOnly isDotIncludesNewline isExtended isExtra isMultiline isBeginningOfLine isEndOfLine isGreedy ' classVariableNames: '' module: #(Werdna Re)! !Re commentStamp: '<historical>' prior: 0! Perl-Style Regular Expressions in Smalltalk Documentation The documentation category of this method contains substantial documentation on the operation of this Class. Re aGeneralComment Re aGlobalSearchComment Re aRegexComment Re aRegexGoryDetailsComment Re aVersionsComment Re anReComment Re anReOverviewComment à Re aLicenseComment Examples: (Re on: 'a.*y') search: 'Candy is dandy.' 'a.*y' asRe search: 'Candy is dandy.' 'Candy is dandy' reMatch: 'a.*y' (Re on: '\w+') searchAndCollect: 'Candy is dandy.' '\w+' asRe searchAndCollect: 'Candy is dandy.' 'Candy is dandy.' reMatch: '\w+' andCollect: [:m | m match] Structure: pattern String -- the string with the regular expression source code compiledPattern RePlugin representing a compiled pattern isAnchored Boolean -- representing an option setting is ... Booleans -- for the other options below List ofcommon public methods: #opt: sets options using Perl-style string #beAnchored #beNotAnchored #isAnchored #isAnchored: #beBeginningOfLine #beNotBeginningOfLine #isBeginningOfLine #isBeginningOfLine: #beCaseSensitive #beNotCaseSensitive #isCaseSensitive #isCaseSensitive: #beDollarEndOnly #beNotDollarEndOnly #isDollarEndOnly #isDollarEndOnly: #beDotIncludesNewline #beNotDotIncludesNewline #isDotIncludesNewLine #isDotIncludesNewline: #beEndOfLine #beNotEndOfLine #isEndOfLine #isEndOfLine: #beExtended #beNotExtended #isExtended #isExtended: #beExtra #beNotExtra #isExtra #isNotExtra: #beGreedy #beNotGreedy #isGreedy #isGreedy: #beMultiline #beNotMultiline #isMultiline #isMultiline: Getters and setters for options in traditional Smalltalk style search: aTargetString search aTargetString from: startInteger to: endInteger Compiling the pattern, if necessary, search a string (or substring) using the pattern. Answers nil if no match. searchAndCollect: aTargetString search: aTargetString andCollect: aBlock search: aTargetString andCollect: aBlock matchCount: anInteger Compiling the pattern, if necessary, gather all (or, if specified, the first anInteger) non-overlapping matches to me in aTargetString. Answer a collection of the results of applying aBlock to each ReMatch result. search: aTargetString andReplace: aBlock search: aTargetString andReplace: aBlock matchCount: anInteger Compiling the pattern, if necessary, find all (or, if specified, the first anInteger) non-overlapping matches to me in aTargetString. Answer a new string, created by substituting the results of applying aBlock to each ReMatch result for the matched substring. ! ]style[(44 16 109 1 1 18 2 23 2 16 2 27 2 19 2 14 2 22 3 1 19 1 12 280 11 236 30 1 6 40 687 66 8 13 8 13 7 12 5 10 1 118 18 13 9 13 13 6 9 13 13 6 13 9 1 217 8 13 13 6 9 13 13 6 13 9 1 266)bf3,bf1,f1,bf1,f1,f1LRe aGeneralComment;,f1,f1LRe aGlobalSearchComment;,f1,f1LRe aRegexComment;,f1,f1LRe aRegexGoryDetailsComment;,f1,f1LRe aVersionsComment;,f1,f1LRe anReComment;,f1,f1LRe anReOverviewComment;,bf1,f1,f1LRe aLicenseComment;,f1,bf1,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1! Re class instanceVariableNames: ''! Object subclass: #ReMatch instanceVariableNames: 'matchArray re searchString pos endpos ' classVariableNames: '' module: #(Werdna Re)! !ReMatch commentStamp: '<historical>' prior: 0! ReMatch: Perl-Style Regular Expression Search Results I. Introduction This Class is part of a package of classes providing a Smalltalk wrapper to Philip Hazel's excellent PCRE library. The Plugin interface and Smalltalk wrapper was written by Andrew C. Greenberg. As discussed in RePattern aGeneralComment, the functionality is essentially embodied in this class, Class RePattern and certain new messages in Class String. A summary of the regular expression syntax c an be found in RePattern aRegexComment and a summary of the compile option codes available can be found in RePattern anOptionsComment. A more detailed description of RePlugin is available downloading the file 'RePluginDoco,' which can be obtained from http://www.gate.net/~werdna/RePlugin.html, into your default directory, and then executing Utilities reconstructTextWindowsFromFileNamed: 'RePluginDoco' II. Principal Match Results The substring of searchString matched by re is given by: m match which can be derived from searchString as follows: m searchString copyFrom: (m from) to: (m to) III. Captured Groups (and Collections of Captured Groups) The number of substrings capturable by a parenthetical grouping in an re (regardless of the number actually matched to create m) is given by: m numGroups The string captured by parenthetical grouping i, where 1<=i<=(m numGroups) is given by m matchAt: i and this can be generated as follows: m searchString copyFrom: (m fromAt: i) to: (m toAt: i) And an array of size (m numGroups) can be generated from strings and indices accordingly: m matches m froms m tos ! ]style[(53 2 15 214 25 65 9 103 23 69 26 120 41 53 61 2 27 177 57 488)bf3,f1,bf2,f1,f1LRePattern aGeneralComment;,f1,f1LRePattern Comment;,f1,f1LRePattern aRegexComment;,f1,f1LRePattern anOptionsComment;,f1,f1Rhttp://www.gate.net/~werdna/RePlugin.html;,f1,f1dUtilities reconstructTextWindowsFromFileNamed: 'RePluginDoco';;,f1,bf2,f1,bf2,f1! ReMatch class instanceVariableNames: ''! Object subclass: #RePattern instanceVariableNames: 'pattern compileOptions pcrePointer extraPointer errorString offset matchOptions matchSpace lastMatchResult ' classVariableNames: '' module: #(Werdna Re)! !RePattern commentStamp: '<historical>' prior: 0! RePattern: Compiled Perl-Style Regular Expressions I. Introduction. This Smalltalk implementation of modern Perl-Style regular expressions was compiled by Andrew Greenberg <we...@ga...> and contributors, based upon the excellent PCRE library by Philip Hazel. As discussed in RePattern aGeneralComment, the functionality is essentially embodied in this class, Class ReMatch and certain new messages in Class String. A summary of the regular expression syntax can be found in RePattern aRegexComment and a summary of the compile option codes available can be found in RePattern anOptionsComment. A substantially more detailed description of RePlugin is available downloading the file "RePluginDoco," which can be obtained from http://www.gate.net/~werdna/RePlugin.html, into your default directory, and then executing Utilities reconstructTextWindowsFromFileNamed: 'RePluginDoco' II. To Search a String or Substring For Pattern Matches (Once Only): Examples: 'Squeak or Squawk!!' reMatch: '^Squ(ea|aw)k' 'Squeak or Squawk!!' reMatch: '^Squ(ea|aw)k' opt: 'imsxABEXZ' 'Squeak or Squawk!!' reMatch: '^Squ(ea|aw)k!!' from: 11 more generally, srchStr reMatch: patStr [opt: oStr] [from: start] [to: stop] For a one-time search of a string (or substring) for occurences of a match pattern. The message will be answered with nil (if there is no match) or an instance of ReMatch, which can then be queried for further details about the match. III. Global Searching and Replacing The re package provides rudimentary facilities for global searches and replacements on a string. The following expressions '\w+' reMatch: 'this is a test' collect: [:m | m] (RePattern on: '\w+') search: 'this is a test' collect: [:m | m] return an ordered collection of the results of repeated non-overlapping applications of the pattern to the string, or nil if there are no matches in the string. To produce a list of matched strings, you can for example execute the following: '\w+' reMatch: 'this is a test' collect: [:m| m match] (RePattern on: '\w+') search: 'this is a test' collect: [:m | m match] You can also perform global search and string replacements, where the answer is a string with unmatched text left alone, and matched text replaced by the result of a call to a Block passed the ReMatch object as a single parameter. For example, ('\w+' reMatch: 'this is a test' sub: [:m| '<', (m match), '>'] and (RePattern on: '\w+') search: 'this is a test' sub: [:m| '<', (m match), '>'] return a string with each nonblank word surrounded by angle brackets. For more details, see RePattern aGlobalSearchComment. IV. To Create Compiled Regular Expression Objects (For Repeated Matching): '^Squ(ea|aw)k!!$' asRePattern '^Squ(ea|aw)k!!$' asRePatternOpt: 'imsxAEX' '^Squ(ea|aw)k!!$' asRePatternOpt: 'imsxAEX' onErrorRun: aBlock RePattern on: '^Squ(ea|aw)k!!$' RePattern on: '^Squ(ea|aw)k!!$' opt: 'imsxAEX' RePattern on: '^Squ(ea|aw)k!!$' opt: 'imsxAEX' onErrorRun: [:pat :offset :message | "your code here" ] Each of the preceding expressions returns an instance of RePattern, compiled for efficient matching when the pattern is repeatedly searched against different strings. RePattern ordinarily caches a dozen or so of the most recently compiled patterns, but nevertheless invokes a cost for the table lookup. To avoid compile and lookup costs, use the above messages. To perform a one-time search, see above. V. To Search a Compiled Regexp Against A String or Substring for Matches: searchString reMatch: re [from: from] [to: to] [opt: optStr] or re search: searchString [from: from] [to: to] [opt: optStr] Examples: 'Squeak or Squawk' reMatch: re. re search: 'Squeak or Squawk!!'. re search: 'Squeak or Squawk!!' opt: 'ABZ'. If no match is found, these messages answer nil. Otherwise, they answer with a corresponding instance of ReMatch.! ]style[(50 1 1 17 2 211 25 65 7 103 23 69 26 1 2 131 41 53 61 2 69 499 36 2 1110 30 3 75 749 73 362 7 1)bf3,f3,f1b,bf2,f1b,f1,f1LRePattern aGeneralComment;,f1,f1LReMatch Comment;,f1,f1LRePattern aRegexComment;,f1,f1LRePattern anOptionsComment;,f1,f1b,f1,f1Rhttp://www.gate.net/~werdna/RePlugin.html;,f1,f1dUtilities reconstructTextWindowsFromFileNamed: 'RePluginDoco';;,f1,bf2,f1,bf2,bf1,f1,f1LRePattern aGlobalSearchComment;,f1,f1b,f1,f1b,f1,f1LReMatch Comment;,f1! RePattern class instanceVariableNames: 'Patterns Options CompileObjects Front '! TestInterpreterPlugin subclass: #RePlugin instanceVariableNames: 'netMemory numAllocs numFrees lastAlloc patternStr rcvr compileFlags pcrePtr extraPtr errorStr errorOffset matchFlags patternStrPtr errorStrBuffer ' classVariableNames: '' module: #(Werdna Re)! !RePlugin commentStamp: '<historical>' prior: 0! /* Regular Expression Plugin (This class comment becomes part of rePlugin.c) RePlugin translate: 'RePlugin.c' doInlining: true. See documentation and source code for the PCRE C Library Code. This plugin is designed to serve an object such as RePattern: patternStr A 0-terminated string comprising the pattern to be compiled. compileFlags An Integer representing re compiler options PCREBuffer A ByteArray of regular expression bytecodes extraPtr A ByteArray of match optimization data (or nil) errorString A String Object For Holding an Error Message (when compile failed) errorOffset The index in patternStr (0-based) where the error ocurred (when compile failed) matchFlags An Integer representing re matcher options matchSpaceObj An Integer array for match results and workspace during matching. The instance variables must appear in the preceding order. MatchSpaceObj must be allocated by the calling routine and contain at least 6*(numGroups+1) bytes. */ #include "pcre.h" #include "internal.h" /* Slight machine-specific hack for MacOS Memory Management */ #ifdef TARGET_OS_MAC #define malloc(ptr) NewPtr(ptr) #define free(ptr) DisposePtr(aPointer) #endif /* Adjust malloc and free routines as used by PCRE */ void rePluginFree(void * aPointer); void * rePluginMalloc(size_t anInteger); void *(*pcre_malloc)(size_t) = rePluginMalloc; void (*pcre_free)(void *) = rePluginFree; ! RePlugin class instanceVariableNames: ''! TestCase subclass: #ReTest instanceVariableNames: '' classVariableNames: '' module: #(Werdna Re)! ReTest class instanceVariableNames: ''! !Re met hodsFor: 'documentation' stamp: 'acg 8/3/2002 22:57'! aGeneralComment " Perl-Style Regular Expressions in Smalltalk by Andrew C. Greenberg I. Regular Expressions in General Regular expressions are a language for specifying text to ease the searching and manipulation of text. A complete discussion of regular expressions is beyond the scope of this document. See Jeffrey Friedl, Mastering Regular Expressions, by O'Reilly for a relatively complete. The regular expressions supported by this package are similar to those presently used in Perl 5.05 and Python, and are based upon Philip Hazel's excellent PCRE libraries (incorporated almost without change, subject to a free license described in Re aLicenseComment. Thanks are due to Markus Kohler and Stephen Pair for their assistance in the initial ports of early versions of the Plugin. An explanation of the expressions available in this package are summarized in Re aRegexComment, Re anOptionsComment and Re aGlobalSearchComment. A more detailed description of RePlugin is available downloading the file 'RePluginDoco,' which can be obtained from http://www.gate.net/~werdna/RePlugin.html, into your default directory, and then executing Utilities reconstructTextWindowsFromFileNamed: 'RePluginDoco' II. Overview of the 'Package.' The following new classes are provided: Class Description of Instances ---------------------- ------------------------------------------------------------------- Re A regular expression matching engine ReMatch Result of a search using Re RePattern Deprecated engine class from earlier plugin versions RePlugin The Plugin 'Glue' to the PCRE Library. String Various new messages were added to String, which are the principal means for users to access the package. PluginCodeGenerator has been deleted from the packgage. III. Some Examples. A. Simple Matching and Querying of Matches To search a string for matches in a regular expression, use String reMatch: 'just trying to catch some zzz''s before noon' matchRe: 'z+' which returns true if matched, and false otherwise. If more information from a positive search result is desired, the method reMatch will return a ReMatch object corresponding to the result. 'just trying to catch some zzz''s before noon' reMatch: 'z+' The match object can be queried in various ways. For example, to obtain details when parenthetical phrases of a regular expression are captured: |m| m _ 'Andy was born on 10/02/1957, and not soon enough!!' reMatch: '(\d\d)/(\d\d)/((19|20)?\d\d)'. m matches answers with: ('10' '02' '1957' '19' ) The first message answers a ReMatch m representing the result of a search of the string for matches of re (nil would be returned if no match was found). The third message answered a collection of the parenthetical subgroups matched, each showing the day, month and year as extracted from the string. B. Global Matching and String Substitutions You can perform global searches to repeatedly search a string for non-overlapping occurrences of a pattern by using reMatch:collect: For example, 'this is a test' collectRe: '\w+' can be used to gather a collection of all words in the search string, answering: OrderedCollection ('this' 'is' 'a' 'test' ) For slightly more complex collections, you can use #reMatch:andCollect: Additionally, you can perform global searches with text substitutions using reMatch:sub: For example, 'this is a test' reMatch: '\w+' andReplace: [:m | '<', (m match), '>'] can be used to replace every word in the search string with the word enclosed by matching brackets, answering: '<this> <is> <a> <test>' Further examples and documentation can be found in the references above, and in the comments and definitions set forth in ReMatch, RePattern and String. "! ]style[(19 44 24 34 211 29 289 18 206 16 2 19 5 24 119 41 53 61 2 30 224 7 34 9 59 8 46 6 179 65 291 7 442 7 268 43 832 7 2 9 5 7 2)f1b,bf3,bf2,bf1,f1,f1i,f1,f1LRe aLicenseComment;,f1,f1LRe aRegexComment;,f1,f1LRe anOptionsComment;,f1,f1LRe aGlobalSearchComment;,f1,f1Rhttp://www.gate.net/~werdna/RePlugin.html;,f1,f1dUtilities reconstructTextWindowsFromFileNamed: 'RePluginDoco';;,f1,f1b,f1,f1LReMatch Comment;,f1,f1LRePattern Comment;,f1,f1LRePlugin Comment;,f1,f1LString Comment;,f1,f1b,f1,f1LReMatch Comment;,f1,f1LReMatch Comment;,f1,f1b,f1,f1LReMatch Comment;,f1,f1LRePattern Comment;,f1,f1LString Comment;,f1! ! !Re methodsFor: 'documentation' stamp: 'acg 8/3/2002 23:06'! aGlobalSearchComment " Global Searching Introduction RePattern provides facilities to support global searching and global searching and replacement of search strings with semantics quite similar to that of Perl 5.004. Global searching means that the search string is repeatedly searched for matches, beginning at the beginning of the string, and subsequently beginning the next match immediately after the preceding match terminated. For example, if we wanted to find all words in the subject string, we could execute: 'this is a test' reMatch: '\w+' collect: [:m | m match] which returns OrderedCollection ('this' 'is' 'a' 'test' ). The collect: keyword directs PCRE to repeat the search for '\w+', and to return a collection of the result of applying each ReMatch to the block. (In this case, the block simply returns the string that was matched.) To do global searching and string substitution, we could execute: 'this is a test' reMatch: '\w+' sub: [:m | '<', m match, '>'] which return '<this> <is> <a> <test>' The sub: keyword directs PCRE to repeat the search, and to return the original string, but with each matched substring replaced by the result of applying the block to the corresponding ReMatch object. Global Matching Functions RePattern convenience functions provide the following general global functions: searchString reMatch: pattern [opt: oStr] collect: aBlock [num: anInteger] searchString reMatch: pattern [opt: oStr] sub: aBlock [num: anInteger] Optionally, you may specify search and compile options with oStr, and you may specify a maximum number of searches performed in the global search with anInteger. If anInteger is less than 0, then as many searches as can be performed, will be performed. Special Case of the Empty Match Finally, the definition given above would infinite loop if the pattern matches an empty string. For example: 'abcdef' reMatch: 'x*' sub: [:m| '<', m match, '>'] will actually match the empty string just before and after each letter of the string, even though there is no x there. ('x+' would return nil). Since the string ends where it begins, at the beginning of the string, repeating the search from that point would simply infinite loop. Accordingly, RePattern gives the pattern a one-character 'bump' after matching an empty string, at which point the block is applied. For example, the preceding would answer '<>a<>b<>c<>d<>e<>f<>' In the case of global searching (but not replacement), an empty string will not result in the ReMatch being applied to the block if the empty match immediately follows a match that has already been made. Accordingly, '123' reMatch: '\d*' collect: [:m| m match] answers OrderedCollection ('123' ) and not OrderedCollection ('123' '') However, this last caveat does not apply to substitutions, so '123' reMatch: '\d*' sub: [:m| '<', m match, '>'] answers '<123><>' "! ]style[(21 3 18 12 1 2 9 460 1 243 7 154 1 290 7 10 25 487 31 1138 1 1)f1b,f1,bf3,bf2,bf3,bf2,f1LRePattern Comment;,f1,f1u,f1,f1LReMatch Comment;,f1,f1u,f1,f1LReMatch Comment;,f1,bf2,f1,bf2,f1,bf2,f1! ! !Re methodsFor: 'documentation' stamp: 'acg 8/3/2002 23:01'! aLicenseComment " RePlugin is Open Source Software As noted earlier, the non-Smalltalk code on which these classes are based is Philip Hazel's excellent PCRE Package, which is distributed subject to the following license. The Smalltalk wrapper and plugin interface is written by Andrew C. Greenberg <we...@ga...> and other contributors, and is distributed subject to the same terms. PCRE LICENCE ------------ PCRE is a library of functions to support regular expressions whose syntax and semantics are as close as possible to those of the Perl 5 language. Written by: Philip Hazel <ph...@ca...> University of Cambridge Computing Service, Cambridge, England. Phone: +44 1223 334714. Copyright (c) 1997-1999 University of Cambridge Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions: 1. This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2. The origin of this software must not be misrepresented, either by explicit claim or by omission. 3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. 4. If PCRE is embedded in any software that is released under the GNU General Purpose Licence (GPL), then the terms of that licence shall supersede any condition above with which it is incompatible. "! ]style[(16 3 32 340 12 1127)f1b,f1,bf3,f1,bf2,f1! ! !Re methodsFor: 'documentation' stamp: 'acg 8/3/2002 23:01'! aRegexComment " Regular Expressions A more detailed description of RePlugin is available downloading the file 'RePluginDoco,' which can be obtained from http://www.gate.net/~werdna/RePlugin.html, into your default directory, and then executing Utilities reconstructTextWindowsFromFileNamed: 'RePluginDoco' Basic MetaCharacters The regular expressions recognized in this package generally track those of Perl 5.05, and are set forth in greater detail in the PCRE documentation accompanying the package. A summary follows: \ General escape character with several uses ^ Assert start of subjct (or line, in multiline mode) Also used to negate class definitions $ Assert end of subject (or line, in multiline mode) . match any character but newline (by default) [] class definitions | start of alternative branch () subpattern ? extends the meaning of '(' (see below) quantifies previous extension (1 or 0 occurrences) (e.g. a?) minimizes previous quantifier (e.g. a*?) * 0 or more quantifier + 1 or more quantifier {} Min/Max Quantifier {3} {3,} {3,5} Inside Character Classes \ general escape character ^ negates the class, if the first character - indicates character range, if not escaped or the last character Special Escape Sequences \a alarm (hex 7) \cx control-x, where x is any character \e escape (hex 1b) \f formfeed (hex 0c) \n newline \r carriage return \t tab \xhh Character with hexcode hh \ddd Character with octal code ddd, or a backreference \d matches decimal digit \D non-decimal digit \s whitespace \S non-whitespace \w any 'word' character \W any non-word character \b asserts a word boundary \B asserts not a non-word boundary \A asserts start of subject (regardless of mode) \Z asserts end of subject (regardless of mode) Internal Option Setting Letters enclosed within a pattern and appearing between '(?' and ')' can be used to change the imsx options. For example. (?im-sx) sets caseless and multiline modes, and unsets dotall and extended modes. See the PCRE documentation for further details. Non-Grouping Subpatterns Groupings can be enclosed by parentheses without text being captured by following the leading parenthesis with a question mark and colon. for example: 'abc(?:def)*' repeats the 'def', but does not capture matches in a grouping. Assertions An assertion is a test on characters that does not actually consume any characters. There are two kinds, those that look ahead of the current position, and those that look behind. Consider the following example: \w+(?=;) which matches a word followed by a semicolon, but doesn't include the semicolon in the match. Another example: (?<!!foo)bar finds occurences of bar not preceded by foo. All lookbehind assertions must be of fixed length, but not all alternatives in such an assertion need be of the same length. Once-Only Subpatterns (?>\d+)bar Once only subpatterns 'lock up' after finding a match, to prevent backtracking in various cases. Essentially, a subpattern ofthis type matches the string that an identical standalone pattern would match if anchored at the current point in the subject string first encountering the expression. Conditional Subpatterns (?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern) These permit one of two subpatterns to be matched, depending upon a preceding condition. There are two kinds of conditions: (1) a sequence of digits, specifying that a numbered subpattern has been matched; and (2) an assertion, either positive, negative, lookahead or lookbehind. Comments (?# This is a comment) Also, in extended mode, comments may be inserted between a '#' and a newline. "! ]style[(17 21 117 41 53 61 2 1 20 1 1 194 2 547 28 139 29 528 26 255 27 230 13 353 1 170 38 293 94 280 36 78 3)f1b,bf3,f1,f1Rhttp://www.gate.net/~werdna/RePlugin.html;,f1,f1dUtilities reconstructTextWindowsFromFileNamed: 'RePluginDoco';;,f1,bf3,bf2,bf3,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b,f1,f1b! ! !Re methodsFor: 'documentation' stamp: 'acg 8/4/2002 11:13'! aRegexGoryDetailsComment " Regular Expression Syntax -- the Gory Details Introduction RePlugin is a Squeak Plugin providing modern regular expression matching operations similar to those found in Perl. It was written by Andrew C. Greenberg (we...@ga...), with contributions by Markus Kohler, Stephen Pair and others. RePlugin 2.3b (and 'the Gory Details' portion of this document) is directly taken from Version 2.04 of the excellent PCRE library by Philip Hazel with only minor modifications. The syntax and semantics of the regular expressions supported by PCRE are described below. Regular expressions are also described in the Perl documentation and in a number of other books, some of which have copious examples. Jeffrey Friedl's 'Mastering Regular Expressions', published by O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description here is intended as reference documentation. Regular Expressions A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern The quick brown fox matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in some special way. There are two different sets of meta-characters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized in square brackets. Outside square brackets, the meta-characters are as follows: \ general escape character with several uses ^ assert start of subject (or line, in multiline mode) $ assert end of subject (or line, in multiline mode) . match any character except newline (by default) [ start character class definition | start of alternative branch ( start subpattern ) end subpattern ? extends the meaning of ( also 0 or 1 quantifier also quantifier minimizer * 0 or more quantifier + 1 or more quantifier { start min/max quantifier Part of a pattern that is in square brackets is called a 'character class'. In a character class the only meta-characters are: \ general escape character ^ negate the class, but only if the first character - indicates character range ] terminates the character class The following sections describe the use of each of the meta-characters. BACKSLASH The backslash character has several uses. Firstly, if it is followed by a non-alphameric character, it takes away any special meaning that character may have. This use of backslash as an escape char acter applies both inside and outside character classes. For example, if you want to match a '*' character, you write '\*' in the pattern. This applies whether or not the following character would otherwise be interpreted as a meta-character, so it is always safe to precede a non-alphameric with '\' to specify that it stands for itself. In particular, if you want to match a backslash, you write '\\'. If a pattern is compiled with the 'x' (beExtended) option, whitespace in the pattern (other than in a character class) and characters between a '#' outside a character class and the next newline character are ignored. An escaping backslash can be used to include a whitespace or '#' character as part of the pattern. A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is usually easier to use one of the following escape sequences than the binary character it represents: \a alarm, that is, the BEL character (hex 07) \cx 'control-x', where x is any character \e escape (hex 1B) \f formfeed (hex 0C) \n newline (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) \xhh character with hex code hh \ddd character with octal code ddd, or backreference The precise effect of '\cx' is as follows: if 'x' is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus '\cz' becomes hex 1A, but '\c{' becomes hex 3B, while '\c;' becomes hex 7B. After '\x', up to two hexadecimal digits are read (letters can be in upper or lower case). After '\0' up to two further octal digits are read. In both cases, if there are fewer than two digits, just those that are present are used. Thus the sequence '\0\x\07' specifies two binary zeros followed by a BEL character. Make sure you supply two digits after the initial zero if the character that follows is itself an octal digit. The handling of a backslash followed by a digit other than 0 is complicated. Outside a character class, PCRE reads it and any following digits as a decimal number. If the number is less than 10, or if there have been at least that many previous capturing left parentheses in the expression, the entire sequence is taken as a back reference. A description of how this works is given later, following the discussion of parenthesized subpatterns. Inside a character class, or if the decimal number is greater than 9 and there have not been that many capturing subpatterns, PCRE re-reads up to three octal digits following the backslash, and generates a single byte from the least significant 8 bits of the value. Any subsequent digits stand for themselves. For example: \040 is another way of writing a space \40 is the same, provided there are fewer than 40 previous capturing subpatterns \7 is always a back reference \11 might be a back reference, or another way of writing a tab \011 is always a tab \0113 is a tab followed by the character '3' \113 is the character with octal code 113 (since there can be no more than 99 back references) \377 is a byte consisting entirely of 1 bits \81 is either a back reference, or a binary zero followed by the two characters '8' and '1' Note that octal values of 100 or greater must not be introduced by a leading zero, because no more than three octal digits are ever read. All the sequences that define a single byte value can be used both inside and outside character classes. In addition, inside a character class, the sequence '\b' is interpreted as the backspace character (hex 08). Outside a character class it has a different meaning (see below). The third use of backslash is for specifying generic character types: \d any decimal digit \D any character that is not a decimal digit \s any whitespace character \S any character that is not a whitespace character \w any 'word' character \W any 'non-word' character Each pair of escape sequences partitions the complete set of characters into two disjoint sets. Any given character matches one, and only one, of each pair. A 'word' character is any letter or digit or the underscore character, that is, any character which can be part of a Perl 'word'. The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale- specific matching is taking place (see 'Locale support' above). For example, in the 'fr' (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w. These character type sequences can appear both inside and outside character classes. They each match one character of the appropriate type. If the current matching point is at the end of the subject string, all of them fail, since there is no character to match. The fourth use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of subpatterns for more complicated assertions is described below. The backslashed assertions are \b word boundary \B not a word boundary \A start of subject (independent of multiline mode) \Z end of subject or newline at end (independent of multiline mode) \z end of subject (independent of multiline mode) These assertions may not appear in character classes (but note that '\b' has a different meaning, namely the backspace character, inside a character class). A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively. The \A, \Z, and \z assertions differ from the traditional circumflex and dollar (described below) in that they only ever match at the very start and end of the subject string, whatever options are set. They are not affected by the 'B' (beNotBeginningOfLine) or 'Z' (beNotEndOfLine) options. The difference between \Z and \z is that \Z matches before a newline that is the last character of the string as well as at the end of the string, whereas \z matches only at the end. CIRCUMFLEX AND DOLLAR Outside a character class, in the default matching mode, the circumflex character is an assertion which is true only if the current matching point is at the start of the subject string. Inside a character class, circumflex has an entirely different meaning (see below). Circumflex need not be the first character of the pattern if a number of alternatives are involved, but it should be the first thing in each alternative in which it appears if the pattern is ever to match that branch. If all possible alternatives start with a circumflex, that is, if the pattern is constrained to match only at the start of the subject, it is said to be an 'anchored' pattern. (There are also other constructs that can cause a pattern to be anchored.) A dollar character is an assertion which is true only if the current matching point is at the end of the subject string, or immediately before a newline character that is the last character in the string (by default). Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class. The meaning of dollar can be changed so that it matches only at the very end of the string, by setting the 'E' ('beDollarEndOnly') option at compile or matching time. This does not affect the \Z assertion. The meanings of the circumflex and dollar characters are changed if the 'm' (beMultiline) option is set. When this is the case, they match immediately after and immediately before an internal '\n' character, respectively, in addition to matching at the start and end of the subject string. For example, the pattern /^abc$/ matches the subject string 'def\nabc' in multiline mode, but not otherwise. Consequently, patterns that are anchored in single line mode because all branches start with '^' are not anchored in multiline mode. The 'E' (beExtended) option is ignored if 's' is set. Note that the sequences \A, \Z, and \z can be used to match the start and end of the subject in both modes, and if all branches of a pattern start with \A is it always anchored, whether 's' (beDotIncludesNewline) is set or not. PERIOD, DOT Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing character, but not (by default) newline. If the 's' (beDotIncludesNewline) option is set, then dots match newlines as well. The handling of dot is entirely independent of the handling of circumflex and dollar, the only relationship being that they both involve newline characters. Dot has no special meaning in a character class. SQUARE BRACKETS An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. A character class matches a single character in the subject; the character must be in the set of characters defined by the class, unless the first character in the class is a circumflex, in which case the subject character must not be in the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash. For example, the character class [aeiou] matches any lower case vowel, while [^aeiou] matches any character that is not a lower case vowel. Note that a circumflex is just a convenient notation for specifying the characters which are in the class by enumerating those that are not. It is not an assertion: it still consumes a character from the subject string, and fails if the current pointer is at the end of the string. When caseless matching is set, any letters in a class represent both their upper case and lower case versions, so for example, a caseless [aeiou] matches 'A' as well as 'a', and a caseless [^aeiou] does not match 'A', whereas a caseful version would. The newline character is never treated in any special way in character classes, whatever the setting of the 's' (beDotIncludesNewline) or 'm' (beMultiline) options is. A class such as [^a] will always match a newline. The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class. It is not possible to have the literal character ']' as the end character of a range. A pattern such as [W-]46] is interpreted as a class of two characters ('W' and '-') followed by a literal string '46]', so it would match 'W46]' or '-46]'. However, if the ']' is escaped with a backslash it is interpreted as the end of range, so [W-\]46] is interpreted as a single class containing a range followed by two separate characters. The octal or hexadecimal representation of ']' can also be used to end a range. Ranges operate in ASCII collating sequence. They can also be used for characters specified numerically, for example [\000-\037]. If a range that includes letters is used when caseless matching is set, it matches the letters in either case. For example, [W-c] is equivalent to [][\^_`wxyzabc], matched caselessly, and if character tables for the 'fr' locale are in use, [\xc8-\xcb] matches accented E characters in both cases. The character types \d, \D, \s, \S, \w, and \W may also appear in a character class, and add the characters that they match to the class. For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can conveniently be used with the upper case character types to specify a more restricted set of characters than the matching lower case type. For example, the class [^\W_] matches any letter or digit, but not underscore. All non-alphameric characters other than \, -, ^ (at the start) and the terminating ] are non-special in character classes, but it does no harm if they are escaped. VERTICAL BAR Vertical bar characters are used to separate alternative patterns. For example, the pattern gilbert|sullivan matches either 'gilbert' or 'sullivan'. Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. If the alternatives are within a subpattern (defined below), 'succeeds' means matching the rest of the main pattern as well as the alternative in the subpattern. INTERNAL OPTION SETTING The settings of caseless, multiline, dotall and extended options can be changed from within the pattern by a sequence of Perl option letters enclosed between '(?' and ')'. The option letters are i for Caseless Matching Mode m for Multiline Mode s for Dotall Mode (Dot matches newlines) x for Extended Mode (whitespace not meaningful, comments permitted) For example, (?im) sets caseless, multiline matching. It is also possible to unset these options by preceding the letter with a hyphen, and a combined setting and unsetting such as (?im-sx), which sets caseless and multiline while unsetting dotall and extended, is also permitted. If a letter appears both before and after the hyphen, the option is unset. The scope of these option changes depends on where in the pattern the setting occurs. For settings that are outside any subpattern (defined below), the effect is the same as if the options were set or unset at the start of matching. The following patterns all behave in exactly the same way: (?i)abc a(?i)bc ab(?i)c abc(?i) which in turn is the same as compiling the pattern abc with 'i' set. In other words, such 'top level' settings apply to the whole pattern (unless there are other changes inside subpatterns). If there is more than one setting of the same option at top level, the rightmost setting is used. If an option change occurs inside a subpattern, the effect is different. This is a change of behaviour in Perl 5.005. An option change inside a subpattern affects only that part of the subpattern that follows it, so (a(?i)b)c matches abc and aBc and no other strings (assuming 'i' is not used). By this means, options can be made to have different settings in different parts of the pattern. Any changes made in one alternative do carry on into subsequent branches within the same subpattern. For example, (a(?i)b|c) matches 'ab', 'aB', 'c', and 'C', even though when matching 'C' the first branch is abandoned before the option setting. This is because the effects of option settings happen at compile time. There would be some very weird behaviour otherwise. The PCRE-specific options 'U' and 'X' can be changed in the same way as the Perl-compatible options. The (?X) flag setting is special in that it must always occur earlier in the pattern than any of the additional features it turns on, even when it is at top level. It is best put at the start. SUBPATTERNS Subpatterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a subpattern does two things: 1. It localizes a set of alternatives. For example, the pattern cat(aract|erpillar|) matches one of the words 'cat', 'cataract', or 'caterpillar'. Without the parentheses, it would match 'cataract', 'erpillar' or the empty string. 2. It sets up the subpattern as a capturing subpattern (as defined above). When the whole pattern matches, that portion of the subject string that matchedOpening parentheses are cou nted from left to right (starting from 1) to obtain the numbers of the capturing subpatterns. For example, if the string 'the red king' is matched against the pattern the ((red|white) (king|queen)) the captured substrings are 'red king', 'red', and 'king', and are numbered 1, 2, and 3. The fact that plain parentheses fulfil two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by '?:', the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string 'the white queen' is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are 'white queen' and 'queen', and are numbered 1 and 2. The maximum number of captured substrings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200. As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the '?' and the ':'. Thus the two patterns (?i:saturday|sunday) (?:(?i)saturday|sunday) match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match 'SUNDAY' as well as 'Saturday'. REPETITION Repetition is specified by quantifiers, which can follow any of the following items: a single character, possibly escaped the . metacharacter a character class a back reference (see next section) a parenthesized subpattern (unless it is an assertion - see below) The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example: z{2,4} matches 'zz', 'zzz', or 'zzzz'. A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if the second number and the comma are both omitted, the quantifier specifies an exact number of required matches. Thus [aeiou]{3,} matches at least 3 successive vowels, but may match many more, while \d{8} matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters. The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present. For convenience (and historical compatibility) the three most common quantifiers have single-character abbreviations: * is equivalent to {0,} + is equivalent to {1,} ? is equivalent to {0,1} It is possible to construct infinite loops by following a subpattern that can match no characters with a quantifier that has no upper limit, for example: (a?)* Earlier versions of Perl and PCRE used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such patterns are now accepted, but if any repetition of the subpattern does in fact match no characters, the loop is forcibly broken. By default, the quantifiers are 'greedy', that is, they match as much as possible (up to the maximum number of permitted times), without causing the rest of the pattern to fail. The classic example of where this gives problems is in trying to match comments in C programs. These appear between the sequences /* and */ and within the sequence, individual * and / characters may appear. An attempt to match C comments by applying the pattern /\*.*\*/ to the string /* first command */ not comment /* second comment */ fails, because it matches the entire string due to the greediness of the .* item. However, if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern /\*.*?\*/ does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has... [truncated message content] |