From: Peter L. <pet...@te...> - 2011-05-26 18:13:55
|
Den Thursday 26 May 2011 19.24.18 skrev John Ralls: > On May 26, 2011, at 8:44 AM, John Ralls wrote: > > On May 26, 2011, at 3:37 AM, Rob Healey wrote: > >> Greetings: > >> > >> I did not even know about the [] and (), so I am grateful that someone > >> asked the question... > >> > >> Sincerely yours, > >> Rob G. Healey > >> > >> > >> On Thu, May 26, 2011 at 3:27 AM, doug <do...@o2...> wrote: > >> > >> On 25/05/11 21:08, Serge Noiraud wrote: > >> > Le 25/05/2011 20:36, doug a écrit : > >> >> On 25/05/11 18:44, Peter Landgren wrote: > >> >>> Hi, > >> >>> > >> >>> I'm definitely not an expert on regular expressions, so I > >> >>> need some help: > >> >>> I would like to easily find people with names spelled > >> >>> with on or two "s": > >> >>> Like Nilson and Nilsson in the same person filter search. > >> >>> > >> >>> /Peter > >> >> > >> >> Does this work? > >> >> > >> >> \s*[a-rt-zA-Z]*[s|ss]\w* > >> > > >> > I don't really know how it works in gramps, but the solution > >> > should be : > >> > (s|ss) > >> > > >> > The [] means only one character : from a to z and from A to Z > >> > the () means several characters : in our case s or ss > >> > > >> >> Doug > >> > >> Ah! thanks for that. I hadn't appreciated the difference > >> between [] and () > > > > Better and easier to use a lazy quantifier: \b[a-zA-Z]+?(s|ss)[a-z]*\b. > > Note that \w adds [0-9_], and you probably don't want that when you're > > matching names. I trust that the code behind this has re.M set so that > > [a-z] will be interpreted correctly (i.e., not literally, but as any > > unicode alphabetic character). "\b" means word boundary, and is better > > than \s (whitespace) for isolating words... especially "zero or more" > > whitespace (\s*). > > Oops, that's wrong. There isn't any unicode magic in [a-z] with re.M, so > the only way to make it work with non-ascii characters is > \b\w+?(s|ss)\w*\b . Python 3 is supposed to support POSIX character > classes, so eventually you'll be able to use > \b[[:alpha:]]+?(s|ss)[[:alpha:]]*\b, which will avoid matching numbers and > underscores. > > Regards, > John Ralls Thanks for all input. But I needed a very simple regular expression. I wanted to filter out persons, spelling their surnames a little different: There are four versions of "Eriksson": Erikson Eriksson Ericson Ericsson Which I get with: eri[ck](s|ss)on Regards, Peter |