From: Chris H. <ha...@ve...> - 2004-05-26 14:58:39
|
Now that I have successfully built my clisp 2.29 w/regexp and tried a few out, I must admit to a little bafflement. Based on the information I've been able to gather, the clisp regexp package uses the host OS's regexp engine, and on FreeBSD and Debian, from what I can find, that is also what grep uses - on both Debian and FreeBSD, grep is GNU grep. So if I can match a pattern using grep -E (aka egrep?) on a file, it seems reasonable to expect that the same pattern should work similarly when moved into a regexp-quoted regex-compiled with :EXTENDED and run against that same file. Except it doesn't seem to. Using the pattern '<[/]?issuer(.*)?>' grep will pick out each appropriate line in the following data on both platforms: <notSubjectToSection16>0</notSubjectToSection16> <issuer> <issuerCik>0000351077</issuerCik> <issuerName>CITIZENS BANKING CORP</issuerName> <issuerTradingSymbol>CBCF</issuerTradingSymbol> </issuer> <reportingOwner> while clisp returns NIL. It is my understanding that clisp should return the 'co-ordinates' of at least the leading tags, and for now that is OK - I am trying to learn the clisp regex 'flavor'. I've tried quoting by hand as well, since regexp-quote produces this: "<\\[/\\]?issuer(\\.\\*)+>", i.e., ?()+ don't seem to be considered special. Hand quoting didn't seem to make any difference in this case, though sometimes grouping w/parens does seem to work in clisp. I already have some useful regexes to do what I want on this data that run in production under awk and/or guile - I can pretty much just cut and paste expressions from one to the other. I realize that regexes don't always 'port' well (heh!), but I *would* like to use clisp for this, so any insight anybody has to offer would be much appreciated. I've already read everything I could find in the clisp Extensions pages, so any pointers to further information would also be warmly received. Aloha, +Chris --=20 Good judgment comes from experience. Experience comes from bad judgment. - Jim Horning |
From: John K. H. <hi...@al...> - 2004-05-26 16:02:22
|
> Now that I have successfully built my clisp 2.29 w/regexp and tried a Before I answer your regex question, I should second Sam's point that 2.29 is not a great CLISP to be running. If you can get the latest one you will be much happier. On the other hand, in this case I believe the regex question is applicable going back to 2.29 so I will try to answer it. > Based on the information I've been able to gather, the clisp regexp > package uses the host OS's regexp engine, and on FreeBSD and Debian, It's noted in the CLISP docs at http://clisp.cons.org/impnotes/modules.html#regexp that "The REGEXP module implements the POSIX regular expressions" w/ a link to the CLISP page w/ the POSIX syntax: http://clisp.cons.org/impnotes/regexp.html You should check out this page in detail if you have not already. > from what I can find, that is also what grep uses - on both Debian and > FreeBSD, grep is GNU grep. No I don't believe GNU grep uses the POSIX syntax > Using the pattern '<[/]?issuer(.*)?>' grep Posix requires that the "?" question-mark quantifier as well as the group parentheses be preceded by a backslash. See the syntax page at http://clisp.cons.org/impnotes/regexp.html > I am trying to learn the clisp regex 'flavor'. I think it is POSIX which is different from GNU grep. > I've tried quoting by hand as well, since regexp-quote produces this: > "<\\[/\\]?issuer(\\.\\*)+>", i.e., ?()+ don't seem to be considered I think this will do what you want: (setf patt "<[/]\\?issuer\\(.*\\)\\?>") (regexp:match patt "<issuerName>CITIZENS BANKING CORP</issuerName>") Note that the backslashes are doubled so that the literal string assigned to "patt" is: <[/]\?issuer\(.*\)\?>") and the ? quantifier and grouping parens are preceded by backslash > I already have some useful regexes to do what I want on this data that > run in production under awk and/or guile - I can pretty much just cut > and paste expressions from one to the other. There is a module called "pcre" (Perl-Compatible Regular Expressions) that will use regex syntax ala perl. I prefer this actually, mainly because perl's syntax avoids what Larry Wall called "backslashitis", an overabundance of backslashes which under POSIX appear in the very commonly used operators. Adding to this the fact that to get a backslash INTO a literal string (in perl, C, lisp) you have to double it as "\\" then you go crazy. Perl's syntax considers the special operators like ? to be special by DEFAULT and if you want them literally THEN you must quote them. I think the POSIX designers did this because it is conceptually more elegant and consistent to have all regex operators be escaped w/ backslash, but then it looks like hell when you actually use it. Sam: the syntax page at: http://clisp.cons.org/impnotes/regexp.html should probably say it's POSIX, and draw a distinction b/w other syntaxes (GNU grep, perl, etc.) I agree all this different syntaxes are reall pain --- John Hinsdale, Alma Mater Software, Inc., Tarrytown, NY 10591-3710 USA hi...@al... | http://www.alma.com/staff/hin | +1 914 631 4690 |
From: Chris H. <ha...@ve...> - 2004-05-27 01:45:25
|
On Wed, May 26, 2004 at 12:02:15PM -0400, John K. Hinsdale wrote: >=20 > I think this will do what you want: >=20 > (setf patt "<[/]\\?issuer\\(.*\\)\\?>") > (regexp:match patt "<issuerName>CITIZENS BANKING CORP</issuerName>") >=20 > Note that the backslashes are doubled so that the literal string > assigned to "patt" is: >=20 > <[/]\?issuer\(.*\)\?>") >=20 > and the ? quantifier and grouping parens are preceded by backslash >=20 > > I already have some useful regexes to do what I want on this data that > > run in production under awk and/or guile - I can pretty much just cut > > and paste expressions from one to the other. >=20 > There is a module called "pcre" (Perl-Compatible Regular Expressions) > that will use regex syntax ala perl. I prefer this actually, mainly > because perl's syntax avoids what Larry Wall called "backslashitis", Like in Emacs regexes? ;-) Module "pcre" doesn't seem to be available for 2.29, I'm still working on getting current clisp to build on my box. At any rate, I'd prefer to not have to add yet another piece to the list of required supporting software for my app, and my needs, I am fairly certain, can be met with the basic package, if I can just get my mind around it. >=20 > I agree all this different syntaxes are reall pain=20 >=20 Naaaw, they just make life that much more 'interesting'. ;-D I *am* trying to upgrade to the current clisp - I noticed that I have gcc 3.0 installed already, so I think I will try that for building clisp. FWIW, I keep the http://clisp.cons.org/impnotes/modules.html#regexp page open in my browser when I am working with regexp in clisp. According to the man _and_ info pages grep/egrep/fgrep, if the env variable POSIXLY_CORRECT is set, they will behave appropriately. More specifically on the quoting, they inform us that 'basic' expressions require quoting by backslash, 'grep -E'/egrep do not, but from my testing will accept them. And thanks for the proper quoted version - as it turns out I *had* tried it but not on a free-standing string, as in your example. Please see the following as to what I mean by this, and why I am still a bit confused. I find the following to be some 'interesting' ;-) behavior. [125]> test-rx "<[/]\\?issuer\\(.*\\)\\?>" [126]> test-str " <notSubjectToSection16>0</notSubjectToSection16> <issuer> <issuerCik>0000351077</issuerCik> <issuerName>CITIZENS BANKING CORP</issuerName> <issuerTradingSymbol>CBCF</issuerTradingSymbol> </issuer> <reportingOwner> " [127]> (match test-rx test-str) #S(REGEXP::REGMATCH_T :RM_SO 58 :RM_EO 255) ; #S(REGEXP::REGMATCH_T :RM_SO 65 :RM_EO 254) [128]>=20 So far so good. 'K. (And thanks again.) (BTW, how do I access the second match? Lisp newbie here, I'm afraid. I've tried all sort of things, but can't seem to figure it out. Use 'multiple-value-something' perhaps?) Now, I'm using this (probably very naive and newbie-like) function to read the file containing the data to be matched against. (defun get-file (fname) (let ((in-data "") (in-name fname) (new-ln (coerce (list #\Newline) 'string))) (with-open-file (in-file in-name :direction :input) (do ((cur-line (read-line in-file nil "<<<EOF") (read-line in-file nil "<<<EOF"))) ((equal "<<<EOF" cur-line)) ; (setf in-data (concatenate 'string in-data cur-line new-ln)= ))) (setf in-data (concatenate 'string in-data cur-line)))) in-data)) (defvar xml (get-file "srchsec-xmple.xml")) So then I try: [128]> (match test-rx xml) NIL [129]>=20 As shown, 'get-file', *doesn't* include newlines, since read-line doesn't pass them on, but I get the same result if I *do* include the newlines. It is my understanding that the interaction of regexes and newlines can sometimes be a bit subtle, so I am unsure of the consequences. I've also tried: [129]> (setf xml (coerce (get-file "srchsec-xmple.xml") 'string)) [130]> (match test-rx xml) NIL [131]>=20 I noticed the following at the very start of the contents of var 'xml': "<?xml version=3D\\\"1.0\\\"?>" so I also tried (setf xml (subseq xml 24 (length xml))) to get rid of the offending data, thinking that the quoted/escapes characters might confuse the regexp engine somehow, but still get NIL on the match. Any idea if those characters would adversely affect a match in some way? I really appreciate your patience and time in helping a clueless newbie luser like me! Aloha, +Chris --=20 Good judgment comes from experience. Experience comes from bad judgment. - Jim Horning |
From: Sam S. <sd...@gn...> - 2004-05-27 02:51:31
|
> * Chris Hall <un...@ir...g> [2004-05-26 15:45:17 -1000]: > > #S(REGEXP::REGMATCH_T :RM_SO 58 :RM_EO 255) ; > #S(REGEXP::REGMATCH_T :RM_SO 65 :RM_EO 254) > > (BTW, how do I access the second match? Lisp newbie here, I'm afraid. > I've tried all sort of things, but can't seem to figure it out. Use > 'multiple-value-something' perhaps?) yes, MULTIPLE-VALUE-BIND or MULTIPLE-VALUE-LIST > Now, I'm using this (probably very naive and newbie-like) function to > read the file containing the data to be matched against. > > (defun get-file (fname) > (let ((in-data "") > (in-name fname) > (new-ln (coerce (list #\Newline) 'string))) > (with-open-file (in-file in-name :direction :input) > (do ((cur-line > (read-line in-file nil "<<<EOF") > (read-line in-file nil "<<<EOF"))) > ((equal "<<<EOF" cur-line)) > ; (setf in-data (concatenate 'string in-data cur-line new-ln)))) > (setf in-data (concatenate 'string in-data cur-line)))) > in-data)) 1. you want to use EQ to check for EOF (this is actually a bug in your code!) 2. also, you want to pass the stream itself as the 3rd argument to READ-LINE (this is a standard idiom). 3. if you want to read the whole file into one string, you should do it like this: (defun read-whole-file (name) (with-open-file (s name) (let ((ret (make-string (file-length s)))) (read-sequence ret s) ret))) your method is extremely inefficient. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Do not tell me what to do and I will not tell you where to go. |
From: Sam S. <sd...@gn...> - 2004-05-27 01:52:56
|
> * John K. Hinsdale <uva@nyzn.pbz> [2004-05-26 12:02:15 -0400]: > > Sam: the syntax page at: > > http://clisp.cons.org/impnotes/regexp.html > > should probably say it's POSIX, and draw a distinction b/w other > syntaxes (GNU grep, perl, etc.) that page is generated from texinfo source lifted from ed man page (at least that's what the source says) I am inclined to _remove_ this page at all because we already link to the canonical regexp reference <http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html> and because lack of DocBook/XML source makes it a distribution nightmare. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Only the mediocre are always at their best. |
From: Chris H. <ha...@ve...> - 2004-05-27 17:06:12
|
On Wed, May 26, 2004 at 12:02:15PM -0400, John K. Hinsdale wrote: >=20 > I think this will do what you want: >=20 > (setf patt "<[/]\\?issuer\\(.*\\)\\?>") > (regexp:match patt "<issuerName>CITIZENS BANKING CORP</issuerName>") >=20 > Note that the backslashes are doubled so that the literal string > assigned to "patt" is: >=20 > <[/]\?issuer\(.*\)\?>") >=20 > and the ? quantifier and grouping parens are preceded by backslash >=20 > > I already have some useful regexes to do what I want on this data that > > run in production under awk and/or guile - I can pretty much just cut > > and paste expressions from one to the other. >=20 > There is a module called "pcre" (Perl-Compatible Regular Expressions) > that will use regex syntax ala perl. I prefer this actually, mainly > because perl's syntax avoids what Larry Wall called "backslashitis", > an overabundance of backslashes which under POSIX appear in the very > commonly used operators. Adding to this the fact that to get a > backslash INTO a literal string (in perl, C, lisp) you have to double > it as "\\" then you go crazy. Perl's syntax considers the special > operators like ? to be special by DEFAULT and if you want them > literally THEN you must quote them. >=20 > I think the POSIX designers did this because it is conceptually more > elegant and consistent to have all regex operators be escaped w/ > backslash, but then it looks like hell when you actually use it. >=20 > Sam: the syntax page at: >=20 > http://clisp.cons.org/impnotes/regexp.html >=20 > should probably say it's POSIX, and draw a distinction b/w > other syntaxes (GNU grep, perl, etc.) >=20 > I agree all this different syntaxes are reall pain=20 >=20 No pain, no gain! ;-) (And as we used to say in the marines: "Pain is just weakness leaving the body!") The reference material that for some reason made things click for me on this was the regex.h link from the clisp extensions page for regexp - if I am not mistaken, that is *the* description for how the engine works. It might be 'just' a man page, but it is *much* better than the one for the same subject on my Debian box. http://www.opengroup.org/onlinepubs/007904975/basedefs/regex.h.html Anyway, I think I've found the answers to about 90% of my questions in regard to the clisp regexp package, and the last one Ive got a fairly decent workaround for. I'm beginning to get comfortable with this package - so far, it seems to meet my needs quite well. If you are at all interested, starting with the pattern you so kindly suggested to me, here is what I've come up with so far. (defvar xml-tree " <notSubjectToSection16>0</notSubjectToSection16> <issuer> <issuerCik>0000351077</issuerCik> <issuerName>CITIZENS BANKING CORP</issuerName> <issuerTradingSymbol>CBCF</issuerTradingSymbol> </issuer> <reportingOwner> ") (defvar issuer-leaf=3D20 (match-string xml-tree (match "<issuer>.*?</issuer>" xml-tree :extended t))) Sets 'issuer-leaf' to just the content between and inclusive of the <issuer/> tags, and illustrates my last remaining question about this. If I read the entire original file into xml-tree (about 13100 bytes), and use "<issuer>(.*)?</issuer>", the match fails. Remove the the grouping parentheses, and it works as above. *However*, if we now use the pattern *with* the parentheses on 'issuer-leaf' as in: (match-string issuer-leaf=3D20 (cadr (multiple-value-list=3D20 (match "<issuer>(.*)?</issuer>" issuer-leaf :extended t)))) Returns: " <issuerCik>0000351077</issuerCik> <issuerName>CITIZENS BANKING CORP</issuerName> <issuerTradingSymbol>CBCF</issuerTradingSymbol> " The ':extended t' argument removes the need for most of, if not all, ofthe painful backslash escaping as well, as per the referenced man page - _that_ makes things a bit easier, IMO. The 'match' form in the example returns two match structures: first, the 'overall match', then next, any groups specified in the pattern - in this case, the stuff *between* the <issuer/> tags. Woot! *Now* we are getting somewhere! Using this strategy it is very straightforward to get to the juicy bits. But why, oh why doesn't parentheses/grouping work against the whole xml tree, I wonder? Hmmm. I'm not going to spend much more time on it, since I've now got a working strategy, but I *am* curious about it. =3D20 I'd thought at first that it was due to '.* greediness', so I put in the '?' to make them lazy instead - nope, no difference. As a nice benefit though, the '?' seems to provide a consistent 0.01 second speed up on my ooold, slooow box against this very tiny sample. (BTW, I only used 'cadr' in the example because I know from experimentation that there are two matches, and the second is the group that I am interested in - I doubt I would use it in a real program.) Aloha, +Chris __ No single drop of water thinks it is responsible for the flood. -- Old adage |
From: Sam S. <sd...@gn...> - 2004-05-27 18:01:22
|
Chris, it just occurred to me that you are parsing an XML file: maybe you would prefer to use a full parser instead of searching for a specific element? CLOCC/CLLIB/xml.lisp will parse the whole XML file for you. There are other CL XML parsers available, check out cliki. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Programming is like sex: one mistake and you have to support it for a lifetime. |
From: Chris H. <ha...@ve...> - 2004-06-02 07:34:38
|
On Thu, May 27, 2004 at 02:01:17PM -0400, Sam Steingold wrote: > Chris, > it just occurred to me that you are parsing an XML file: maybe you > would prefer to use a full parser instead of searching for a specific > element? > CLOCC/CLLIB/xml.lisp will parse the whole XML file for you. > There are other CL XML parsers available, check out cliki. Thanks, and wow! that parser looks pretty complete! Even has namespace support, if I read it correctly. I think we might need to use it soon, but for this particular task, the regexp is probably much the better solution. I don't know how many people use regexps for 'consuming' XML, but for many purposes I've found them to be much easier and light-weight than using a SAX parser. Sorry I took so long to respond, but I'm getting ready to move living quarters and I've been spending all my spare time hand-crafting a Debian install on a very old (6 years or so) Toshiba laptop w/32MB RAM and 1GB disk as a fairly complete development and 'Emacs support' environment. ;-D The more I use Debian, the more impressed I am with it, and clisp fits right in, since it requires only about ~2-3MB ram to load. Clisp 2.33 with regexp, postgresql and bindings/glibc has built successfully; 'make check' is running as I write this. FWIW, I'm using the Debian kernel 2.4bf (or is it bf2.4?) this time - it has the PCMCIA-support in the kernel, I think. Thanks again, Sam, +Chris P.S. Debian note: If one uses the text-mode browser w3m to browse http://packages.debian.org and download a package, w3m will display the package info and ask if you'd like to install the package. Even if one is not running as 'root', it gives one a chance to 'sudo' or equivalent. And it works! Woot! I imagine other browsers might support this as well, but this is certainly the first I'd heard of it. I imagine that some sort of 'MIME magic' is in play here. __ No single drop of water thinks it is responsible for the flood. -- Old adage |
From: Bruno H. <br...@cl...> - 2004-05-26 18:26:29
|
Sam wrote: > I am inclined to _remove_ this page at all because we already link to > the canonical regexp reference > <http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html> > and because lack of DocBook/XML source makes it a distribution nightmare. Please don't remove these two syntax descriptions. A manual page is something that a user can understand; the POSIX regexp reference is not readable like this. Bruno |
From: Sam S. <sd...@gn...> - 2004-05-27 04:28:03
|
> * Bruno Haible <oe...@py...t> [2004-05-26 20:11:36 +0200]: > > Sam wrote: >> I am inclined to _remove_ this page at all because we already link to >> the canonical regexp reference >> <http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html> >> and because lack of DocBook/XML source makes it a distribution nightmare. > > Please don't remove these two syntax descriptions. A manual page is > something that a user can understand; the POSIX regexp reference is > not readable like this. I am not sure what you mean here. At any rate, the regexp module comes with a 10 y.o. implementation whose origin is unclear. (One big problem is unicode). The man page has, apparently, the same origin. Note that we use the OS regexp when it is available, so this page is irrelevant most of the time. I would prefer either finding a unicode regexp and ignoring the OS, or dumping the bundled regexp and requiring OS to offer one. OTOH, win32 loses then. forget it, it's a can of worms. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> main(a){a="main(a){a=%c%s%c;printf(a,34,a,34);}";printf(a,34,a,34);} |
From: Bruno H. <br...@cl...> - 2004-05-27 12:37:59
|
Edi Weitz wrote: > Last time I checked it was comparable in speed with CLISP's regex > engine. Do you mean the speed when run in a Lisp implementation that compiles to native code, or when run in clisp? Doing character-at-a-time processing in byte-compiled clisp is quite slow; that's why we are looking for an implementation written in C/C++. Bruno |
From: Edi W. <ed...@ag...> - 2004-05-27 23:38:22
|
On Thu, 27 May 2004 14:29:17 +0200, Bruno Haible <br...@cl...> wrote: > Edi Weitz wrote: >> Last time I checked it was comparable in speed with CLISP's regex >> engine. > > Do you mean the speed when run in a Lisp implementation that > compiles to native code, or when run in clisp? I meant CL-PPCRE when run in CLISP. Of course it's much faster when compiled to native code. > Doing character-at-a-time processing in byte-compiled clisp is quite > slow; that's why we are looking for an implementation written in > C/C++. While this is generally true the situation is not as black-and-white as you seem to imply. First, regular expressions are not only about character-at-a-time processing. There are several cases where CL-PPCRE is actually (a lot) faster than CLISP's regex engine. (See end of mail for some benchmarks.) But apart from that I'd say that it is usually just fast enough[TM] (unless you're throwing certain regular expressions at enormous strings several times). Moreover: 1. It has more features (look-aheads, look-behinds, stand-alone expressions, ...) than the current engine. 2. It has a syntax (the one from Perl) most users will be familiar with. 3. It has an alternative S-expression syntax for regular expressions. These are of course much easier to manipulate programmatically from Lisp. 4. Because it's written in Lisp it'll use the same character encoding that CLISP uses independently of external settings like your locale. 5. If you get an error you get an error Lisp can handle. Disasters like this one can't happen: [1]> (regexp:regexp-compile "(a|(bc)){0,0}?xyz" :extended t) *** - handle_fault error2 ! address = 0x14 not in [0x20248000,0x203d5a30) ! SIGSEGV cannot be cured. Fault address = 0x14. Segmentation fault 6. It's written in Lisp (did I mention that already?) - for marketing reasons it might be not too bad an idea if the regex engine used by a Lisp implementation was also written in Lisp... :) Anyway, you decide... Cheers, Edi. Regarding the speed of CL-PPCRE I did some simple benchmarks based on <http://weitz.de/cl-ppcre/#bench>. The code I wrote can be found at <http://miles.agharta.de/bench.lisp>. This is on a Debian sid system with CLISP (2.33) from Debian. It should be noted that CL-PPCRE has been profiled with and optimized for CMUCL only. (With the help of Duane Rettig from Franz I've done some preliminary work to optimize it for AllegroCL as well but that's not part of the official distribution yet.) I'm pretty sure there are ways to tweak it for CLISP if someone who knows CLISP better than I does it. I'll gladly accept patches. edi@bird:~$ uname -a Linux bird 2.6.6 #1 Wed May 26 11:10:22 CEST 2004 i686 GNU/Linux edi@bird:~$ echo $LC_CTYPE en_US.UTF-8 edi@bird:~$ clisp WARNING: *FOREIGN-ENCODING*: reset to ASCII i i i i i i i ooooo o ooooooo ooooo ooooo I I I I I I I 8 8 8 8 8 o 8 8 I \ `+' / I 8 8 8 8 8 8 \ `-+-' / 8 8 8 ooooo 8oooo `-__|__-' 8 8 8 8 8 | 8 o 8 8 o 8 8 ------+------ ooooo 8oooooo ooo8ooo ooooo 8 Copyright (c) Bruno Haible, Michael Stoll 1992, 1993 Copyright (c) Bruno Haible, Marcus Daniels 1994-1997 Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998 Copyright (c) Bruno Haible, Sam Steingold 1999-2000 Copyright (c) Sam Steingold, Bruno Haible 2001-2004 ;; Loading file /home/edi/.clisprc ... ;; Loaded file /home/edi/.clisprc [1]> (lisp-implementation-version) "2.33 (2004-03-17) (built 3289141980) (memory 3294470374)" [2]> (require :cl-ppcre) ;; Loading file /usr/share/common-lisp/systems/cl-ppcre.asd ... ;; Loaded file /usr/share/common-lisp/systems/cl-ppcre.asd ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/packages.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/packages.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/specials.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/specials.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/util.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/util.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/errors.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/errors.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/lexer.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/lexer.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/parser.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/parser.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/regex-class.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/regex-class.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/convert.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/convert.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/optimize.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/optimize.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/closures.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/closures.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/repetition-closures.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/repetition-closures.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/scanner.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/scanner.fas ;; Loading file /usr/lib/common-lisp/clisp/cl-ppcre/api.fas ... ;; Loaded file /usr/lib/common-lisp/clisp/cl-ppcre/api.fas 0 errors, 0 warnings T [3]> (load (compile-file "/tmp/bench.lisp")) Compiling file /tmp/bench.lisp ... Wrote file /tmp/bench.fas 0 errors, 0 warnings ;; Loading file /tmp/bench.fas ... ;; Loaded file /tmp/bench.fas T [4]> (test) PPCRE wins by a factor of 31.2 PPCRE wins by a factor of 257.9 PPCRE wins by a factor of 2534.2 PPCRE wins by a factor of 20842.4 PPCRE wins by a factor of 3.6 PPCRE wins by a factor of 3.8 PPCRE wins by a factor of 3.9 PPCRE wins by a factor of 4.0 PPCRE wins by a factor of 9.5 PPCRE wins by a factor of 84.5 PPCRE wins by a factor of 846.4 PPCRE wins by a factor of 6337.0 CLISP wins by a factor of 1.3 CLISP wins by a factor of 1.3 CLISP wins by a factor of 1.3 CLISP wins by a factor of 1.2 CLISP wins by a factor of 2.1 CLISP wins by a factor of 2.3 CLISP wins by a factor of 2.3 CLISP wins by a factor of 2.0 PPCRE wins by a factor of 3.0 PPCRE wins by a factor of 3.2 PPCRE wins by a factor of 3.3 PPCRE wins by a factor of 3.3 PPCRE wins by a factor of 1.2 PPCRE wins by a factor of 1.2 PPCRE wins by a factor of 1.2 PPCRE wins by a factor of 1.2 CLISP wins by a factor of 2.3 CLISP wins by a factor of 2.6 CLISP wins by a factor of 2.7 CLISP wins by a factor of 2.4 PPCRE wins by a factor of 3.0 PPCRE wins by a factor of 3.1 PPCRE wins by a factor of 3.3 PPCRE wins by a factor of 3.3 PPCRE wins by a factor of 1.3 PPCRE wins by a factor of 1.5 PPCRE wins by a factor of 1.5 PPCRE wins by a factor of 1.5 CLISP wins by a factor of 1.9 CLISP wins by a factor of 1.9 CLISP wins by a factor of 1.8 PPCRE wins by a factor of 48.6 PPCRE wins by a factor of 617.9 PPCRE wins by a factor of 6168.3 CLISP wins by a factor of 14.6 CLISP wins by a factor of 20.1 CLISP wins by a factor of 21.3 PPCRE wins by a factor of 1.2 PPCRE wins by a factor of 1.8 PPCRE wins by a factor of 1.8 CLISP wins by a factor of 14.6 CLISP wins by a factor of 21.3 CLISP wins by a factor of 21.9 PPCRE wins by a factor of 1.3 PPCRE wins by a factor of 1.8 PPCRE wins by a factor of 1.9 PPCRE wins by a factor of 1.6 PPCRE wins by a factor of 2.0 PPCRE wins by a factor of 2.1 NIL [5]> |
From: Sam S. <sd...@gn...> - 2004-05-27 23:53:02
|
> * Edi Weitz <rqv@ntunegn.qr> [2004-05-28 01:38:15 +0200]: > > 2. It has a syntax (the one from Perl) most users will be familiar > with. could you please benchmark it against the CLISP PCRE module? -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Our business is run on trust. We trust you will pay in advance. |
From: Edi W. <ed...@ag...> - 2004-05-28 01:16:59
|
On Thu, 27 May 2004 19:52:59 -0400, Sam Steingold <sd...@gn...> wrote: > could you please benchmark it against the CLISP PCRE module? See below. I had to test on Cygwin because the Debian CLISP doesn't seem to include PCRE. To be fair to CL-PPCRE I'd like to note that the PCRE module seems pretty useless to me. You can completely kill CLISP (CLISP silently dies) with various small, legitimate regular expressions. Here's one example: edi@bird:~/lisp/cl-ppcre$ clisp -Kfull i i i i i i i ooooo o ooooooo ooooo ooooo I I I I I I I 8 8 8 8 8 o 8 8 I \ `+' / I 8 8 8 8 8 8 \ `-+-' / 8 8 8 ooooo 8oooo `-__|__-' 8 8 8 8 8 | 8 o 8 8 o 8 8 ------+------ ooooo 8oooooo ooo8ooo ooooo 8 Copyright (c) Bruno Haible, Michael Stoll 1992, 1993 Copyright (c) Bruno Haible, Marcus Daniels 1994-1997 Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998 Copyright (c) Bruno Haible, Sam Steingold 1999-2000 Copyright (c) Sam Steingold, Bruno Haible 2001-2004 ;; Loading file /home/edi/.clisprc ... ;; Loaded file /home/edi/.clisprc [1]> (pcre:pcre-exec (pcre:pcre-compile "(aa)(.*)") "aaxx") edi@bird:~/lisp/cl-ppcre$ You can also kill it with medium-sized target strings: edi@bird:~/lisp/cl-ppcre$ clisp -Kfull i i i i i i i ooooo o ooooooo ooooo ooooo I I I I I I I 8 8 8 8 8 o 8 8 I \ `+' / I 8 8 8 8 8 8 \ `-+-' / 8 8 8 ooooo 8oooo `-__|__-' 8 8 8 8 8 | 8 o 8 8 o 8 8 ------+------ ooooo 8oooooo ooo8ooo ooooo 8 Copyright (c) Bruno Haible, Michael Stoll 1992, 1993 Copyright (c) Bruno Haible, Marcus Daniels 1994-1997 Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998 Copyright (c) Bruno Haible, Sam Steingold 1999-2000 Copyright (c) Sam Steingold, Bruno Haible 2001-2004 ;; Loading file /home/edi/.clisprc ... ;; Loaded file /home/edi/.clisprc [1]> (defparameter *xxxx* (make-string 10000 :initial-element #\x)) *XXXX* [2]> (pcre:pcre-exec (pcre:pcre-compile "(.)*" :dotall t) *xxxx*) edi@bird:~/lisp/cl-ppcre$ Hmmm, I'm not impressed... I had to modify the benchmark and remove a couple of tests to make it work at all. The source code is at <http://miles.agharta.de/bench2.lisp>. Here are the results: edi@bird:~/lisp/cl-ppcre$ uname -a CYGWIN_NT-5.1 bird 1.5.10(0.116/4/2) 2004-05-25 22:07 i686 unknown unknown Cygwin edi@bird:~/lisp/cl-ppcre$ clisp -Kfull i i i i i i i ooooo o ooooooo ooooo ooooo I I I I I I I 8 8 8 8 8 o 8 8 I \ `+' / I 8 8 8 8 8 8 \ `-+-' / 8 8 8 ooooo 8oooo `-__|__-' 8 8 8 8 8 | 8 o 8 8 o 8 8 ------+------ ooooo 8oooooo ooo8ooo ooooo 8 Copyright (c) Bruno Haible, Michael Stoll 1992, 1993 Copyright (c) Bruno Haible, Marcus Daniels 1994-1997 Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998 Copyright (c) Bruno Haible, Sam Steingold 1999-2000 Copyright (c) Sam Steingold, Bruno Haible 2001-2004 ;; Loading file /home/edi/.clisprc ... ;; Loaded file /home/edi/.clisprc [1]> (lisp-implementation-version) "2.33 (2004-03-17) (built on winsteingoldlap [10.0.19.22])" [2]> (load "load.lisp") ;; Loading file load.lisp ... ;; Loading file /home/edi/lisp/cl-ppcre/packages.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/packages.fas ;; Loading file /home/edi/lisp/cl-ppcre/specials.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/specials.fas ;; Loading file /home/edi/lisp/cl-ppcre/util.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/util.fas ;; Loading file /home/edi/lisp/cl-ppcre/errors.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/errors.fas ;; Loading file /home/edi/lisp/cl-ppcre/lexer.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/lexer.fas ;; Loading file /home/edi/lisp/cl-ppcre/parser.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/parser.fas ;; Loading file /home/edi/lisp/cl-ppcre/regex-class.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/regex-class.fas ;; Loading file /home/edi/lisp/cl-ppcre/convert.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/convert.fas ;; Loading file /home/edi/lisp/cl-ppcre/optimize.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/optimize.fas ;; Loading file /home/edi/lisp/cl-ppcre/closures.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/closures.fas ;; Loading file /home/edi/lisp/cl-ppcre/repetition-closures.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/repetition-closures.fas ;; Loading file /home/edi/lisp/cl-ppcre/scanner.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/scanner.fas ;; Loading file /home/edi/lisp/cl-ppcre/api.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/api.fas ;; Loading file /home/edi/lisp/cl-ppcre/ppcre-tests.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/ppcre-tests.fas ;; Loaded file load.lisp T [3]> (load (compile-file "bench2.lisp")) Compiling file /home/edi/lisp/cl-ppcre/bench2.lisp ... Wrote file /home/edi/lisp/cl-ppcre/bench2.fas 0 errors, 0 warnings ;; Loading file /home/edi/lisp/cl-ppcre/bench2.fas ... ;; Loaded file /home/edi/lisp/cl-ppcre/bench2.fas T [4]> (test) CL-PPCRE wins by a factor of 2.8 CL-PPCRE wins by a factor of 44.0 PCRE wins by a factor of 2.4 PCRE wins by a factor of 1.3 PCRE wins by a factor of 1.3 CL-PPCRE wins by a factor of 3.0 PCRE wins by a factor of 12.0 PCRE wins by a factor of 22.3 PCRE wins by a factor of 14.2 PCRE wins by a factor of 23.8 PCRE wins by a factor of 2.5 PCRE wins by a factor of 1.8 PCRE wins by a factor of 7.2 PCRE wins by a factor of 5.4 PCRE wins by a factor of 12.3 PCRE wins by a factor of 18.9 PCRE wins by a factor of 2.3 PCRE wins by a factor of 1.2 PCRE wins by a factor of 15.8 PCRE wins by a factor of 24.9 PCRE wins by a factor of 13.5 CL-PPCRE wins by a factor of 1.6 PCRE wins by a factor of 12.0 PCRE wins by a factor of 8.9 PCRE wins by a factor of 12.5 PCRE wins by a factor of 11.0 PCRE wins by a factor of 1.4 CL-PPCRE wins by a factor of 1.1 NIL HTH, Edi. |
From: Sam S. <sd...@gn...> - 2004-06-10 15:13:36
|
> * Edi Weitz <rqv@ntunegn.qr> [2004-05-28 03:16:58 +0200]: > > [1]> (pcre:pcre-exec (pcre:pcre-compile "(aa)(.*)") "aaxx") this is a bug in the PCRE library. I reported it to the implementor. A workaround is to use malloc() instead of alloca(), patch appended. > [1]> (defparameter *xxxx* (make-string 10000 :initial-element #\x)) > *XXXX* > [2]> (pcre:pcre-exec (pcre:pcre-compile "(.)*" :dotall t) *xxxx*) this is a bug in the PCRE library. I reported it to the implementor. I do not know of a workaround. > [4]> (test) > CL-PPCRE wins by a factor of 2.8 > CL-PPCRE wins by a factor of 44.0 > PCRE wins by a factor of 2.4 > PCRE wins by a factor of 1.3 > PCRE wins by a factor of 1.3 > CL-PPCRE wins by a factor of 3.0 > PCRE wins by a factor of 12.0 > PCRE wins by a factor of 22.3 > PCRE wins by a factor of 14.2 > PCRE wins by a factor of 23.8 > PCRE wins by a factor of 2.5 > PCRE wins by a factor of 1.8 > PCRE wins by a factor of 7.2 > PCRE wins by a factor of 5.4 > PCRE wins by a factor of 12.3 > PCRE wins by a factor of 18.9 > PCRE wins by a factor of 2.3 > PCRE wins by a factor of 1.2 > PCRE wins by a factor of 15.8 > PCRE wins by a factor of 24.9 > PCRE wins by a factor of 13.5 > CL-PPCRE wins by a factor of 1.6 > PCRE wins by a factor of 12.0 > PCRE wins by a factor of 8.9 > PCRE wins by a factor of 12.5 > PCRE wins by a factor of 11.0 > PCRE wins by a factor of 1.4 > CL-PPCRE wins by a factor of 1.1 did you try passing ":study t" to pcre-compile? -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> My other CAR is a CDR. --- cpcre.c 13 May 2004 22:41:11 -0400 1.16 +++ cpcre.c 10 Jun 2004 11:00:55 -0400 @@ -339,7 +339,7 @@ end_system_call(); if (ret < 0) pcre_error(ret); ovector_size = 3 * (capture_count + 1); - ovector = alloca(ovector_size); + ovector = malloc(ovector_size); with_string_0(check_string(STACK_0),Symbol_value(S(utf_8)),subject, { begin_system_call(); /* subject_bytelen is the length of subject in bytes, @@ -364,6 +364,7 @@ VALUES1(popSTACK()); } } else pcre_error(ret); + free(ovector); skipSTACK(2); /* drop pattern & subject */ } |
From: Edi W. <ed...@ag...> - 2004-06-10 15:23:04
|
On Thu, 10 Jun 2004 11:13:14 -0400, Sam Steingold <sd...@gn...> wrote: >> * Edi Weitz <rqv@ntunegn.qr> [2004-05-28 03:16:58 +0200]: ^--------- That's not my email address... :) >> [1]> (pcre:pcre-exec (pcre:pcre-compile "(aa)(.*)") "aaxx") > > this is a bug in the PCRE library. I reported it to the > implementor. A workaround is to use malloc() instead of alloca(), > patch appended. > >> [1]> (defparameter *xxxx* (make-string 10000 :initial-element #\x)) >> *XXXX* >> [2]> (pcre:pcre-exec (pcre:pcre-compile "(.)*" :dotall t) *xxxx*) > > this is a bug in the PCRE library. I reported it to the > implementor. I do not know of a workaround. OK, good. I seem to be rather efficient at finding bugs in regex implementations... :) >> [4]> (test) >> CL-PPCRE wins by a factor of 2.8 >> CL-PPCRE wins by a factor of 44.0 >> PCRE wins by a factor of 2.4 >> PCRE wins by a factor of 1.3 >> PCRE wins by a factor of 1.3 >> CL-PPCRE wins by a factor of 3.0 >> PCRE wins by a factor of 12.0 >> PCRE wins by a factor of 22.3 >> PCRE wins by a factor of 14.2 >> PCRE wins by a factor of 23.8 >> PCRE wins by a factor of 2.5 >> PCRE wins by a factor of 1.8 >> PCRE wins by a factor of 7.2 >> PCRE wins by a factor of 5.4 >> PCRE wins by a factor of 12.3 >> PCRE wins by a factor of 18.9 >> PCRE wins by a factor of 2.3 >> PCRE wins by a factor of 1.2 >> PCRE wins by a factor of 15.8 >> PCRE wins by a factor of 24.9 >> PCRE wins by a factor of 13.5 >> CL-PPCRE wins by a factor of 1.6 >> PCRE wins by a factor of 12.0 >> PCRE wins by a factor of 8.9 >> PCRE wins by a factor of 12.5 >> PCRE wins by a factor of 11.0 >> PCRE wins by a factor of 1.4 >> CL-PPCRE wins by a factor of 1.1 > > did you try passing ":study t" to pcre-compile? Nope, I didn't know about that one. Just checked the PCRE man page and there it says: "At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character. A bitmap of possible starting characters is created." So, yes, this might help PCRE in a few cases but that was not my point, the results above already show that PCRE is usually faster than CL-PPCRE. My point was that a regex engine written in Lisp will be fast enough in almost all cases you can imagine and it's portable between different target platforms by definition. IIRC this thread started with a discussion of problems of the current approach (using a C library). These issues would be moot with a Lisp library. And, as I said, it'd be good for Lisp marketing-wise if it didn't have to rely on other programming languages to do the "real work." Just my EUR 0.02, Edi. |
From: Douglas P. <dg...@ma...> - 2004-06-10 15:32:43
|
Edi Weitz indited: > So, yes, this might help PCRE in a few cases but that was not my > point, the results above already show that PCRE is usually faster than > CL-PPCRE. My point was that a regex engine written in Lisp will be > fast enough in almost all cases you can imagine and it's portable > between different target platforms by definition. > > IIRC this thread started with a discussion of problems of the current > approach (using a C library). These issues would be moot with a Lisp > library. And, as I said, it'd be good for Lisp marketing-wise if it > didn't have to rely on other programming languages to do the "real > work." As a way of working out the bugs of C-library<->CLISP interactions, it seems to have hit on some. As for the the issues of regex, I fully agree with Edi, unless there is some major performance gain (which there isn't here), or some really obscure compatability issues, I don't see the point of any further effort in a non-Lisp regex. Just my buck-two-fitty, <D\'gou |
From: Chris H. <ha...@ve...> - 2004-05-27 07:21:29
|
On Wed, May 26, 2004 at 10:51:26PM -0400, Sam Steingold wrote: > > (BTW, how do I access the second match? Lisp newbie here, I'm afraid. > > I've tried all sort of things, but can't seem to figure it out. Use > > 'multiple-value-something' perhaps?) >=20 > yes, MULTIPLE-VALUE-BIND or MULTIPLE-VALUE-LIST >=20 Woot! MULTIPLE-VALUE-LIST works for me. I had looked at MULTIPLE-VALUE-BIND, but according to CLHS if one doesn't know the proper number of values ahead time, extra ones get discarded. I couldn't figure out how to the count, so I stopped looking there. I really must get into the habit of poking around in CLHS a bit more - esp. since I have a local copy - I might have found this for myself. > 1. you want to use EQ to check for EOF (this is actually a bug in your co= de!) >=20 > 2. also, you want to pass the stream itself as the 3rd argument to > READ-LINE (this is a standard idiom). >=20 > 3. if you want to read the whole file into one string, you should do it > like this: >=20 > (defun read-whole-file (name) > (with-open-file (s name) > (let ((ret (make-string (file-length s)))) > (read-sequence ret s) > ret))) >=20 > your method is extremely inefficient. >=20 With the emphasis on *extremely* - yikes! (Experienced lispers might want to avoid looking at this - too painful! I sheepishly wonder if I could possibly have made it even slower. :-}) My naive newbie version: [133]> (time (setf xml (get-file "srchsec-xmple.xml"))) Real time: 0.644871 sec. Run time: 0.42 sec. Space: 4916144 Bytes GC: 10, GC time: 0.32 sec. Sam's *much, much, much* better version. [134]> (time (setf xml (read-whole-file "srchsec-xmple.xml"))) Real time: 0.015718 sec. Run time: 0.02 sec. Space: 48392 Bytes A million thanks, Sam - very gracious and kind of you to share your experience and to set me straight. I'm sure that I have *lot* of lisp idioms such as this to learn - I guess the best place to get started is looking at other peoples' code and experimenting with my own. And CLHS, I suspect. > --=20 > Sam Steingold (http://www.podval.org/~sds) running w2k > <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> > <http://www.mideasttruth.com/> <http://www.honestreporting.com> > Do not tell me what to do and I will not tell you where to go. (Another great sig, Sam) Aloha, +Chris __ No single drop of water thinks it is responsible for the flood. -- Old adage |
From: Arseny S. <am...@ic...> - 2004-05-27 07:40:26
|
> I really must get into the habit of poking around in CLHS a bit more - > esp. since I have a local copy - I might have found this for myself. And don't forget about comp.lang.lisp newsgroup! -- Best regards, Arseny mailto:am...@ic... |
From: Bruno H. <br...@cl...> - 2004-05-27 12:29:59
|
Sam wrote: > I am not sure what you mean here. > At any rate, the regexp module comes with a 10 y.o. implementation > whose origin is unclear. The origin is simple: The code is the GNU regex 0.12 that was the common regular expression implementation in GNU programs for years. It's only 8-bit, though. The documentation has a part copied from GNU ed, a part from GNU emacs with modifications, and the Lisp interface description that I wrote. > (One big problem is unicode). Yes, this is the big problem. > I would prefer either finding a unicode regexp and ignoring the OS Yes. Did you try http://sourceforge.net/projects/ustring/ ? > or dumping the bundled regexp and requiring OS to offer one. This doesn't help, because the glibc regex is Unicode capable only within an UTF-8 locale. Bruno |
From: Sam S. <sd...@gn...> - 2004-05-27 15:17:32
|
> * Bruno Haible <oe...@py...t> [2004-05-27 14:21:08 +0200]: > >> I would prefer either finding a unicode regexp and ignoring the OS > Yes. Did you try http://sourceforge.net/projects/ustring/ ? project is dead - nothing has been done in 4 years. >> or dumping the bundled regexp and requiring OS to offer one. > > This doesn't help, because the glibc regex is Unicode capable only > within an UTF-8 locale. indeed. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Beauty is only a light switch away. |
From: Sam S. <sd...@gn...> - 2004-05-28 01:36:31
|
Edi, This is very suspicious. The fact that you outperform them by a factor of 10-100 means that Either they do something very stupid or you do something real smart or something very wrong (they cannot so something wrong because what they do is right by definition :-). I think you need to investigate what is going on further. Can you describe what regexes you handle so exceptionally well? -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> If you want it done right, you have to do it yourself |
From: Edi W. <ed...@ag...> - 2004-05-28 02:02:22
|
On Thu, 27 May 2004 21:36:27 -0400, Sam Steingold <sd...@gn...> wrote: > This is very suspicious. I think it's more suspicious that both PCRE and REGEXP are able to kill my Lisp... > The fact that you outperform them by a factor of 10-100 means that > Either they do something very stupid or you do something real smart > or something very wrong I'm pretty sure that CL-PPCRE basically works right (although there certainly are bugs like in almost any program). CL-PPCRE is used in "The Regex Coach"[1] which has been downloaded more than 80,000 times in the last year. That's a rather large number of testers, isn't it? > (they cannot so something wrong because what they do is right by > definition :-). Who is they and why are they right by definition? Which definition? CL-PPCRE purports to be compatible with Perl, so the defining instance for my library is Perl. Unfortunately, Perl itself has a couple of bugs: <http://weitz.de/cl-ppcre/#bug>. Or try this: edi@bird:~$ perl -v This is perl, v5.8.3 built for i386-linux-thread-multi Copyright 1987-2003, Larry Wall Perl may be copied only under the terms of either the Artistic License or= the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. edi@bird:~$ perl -lwe '$_=3D"a" x 3000 . "b"; /(a|)*/; print $1' edi@bird:~$ perl -lwe '$_=3D"a" x 30000 . "b"; /(a|)*/; print $1' Segmentation fault > I think you need to investigate what is going on further. I don't think so. > Can you describe what regexes you handle so exceptionally well? CL-PPCRE makes a couple of semi-smart decisions: 1. It can optimize the scan progress for anchored regular expressions and for regular expressions which basically say "don't test, just skip", like /.*/s. I think Perl and PCRE do similar things. Their C code is too complicated for me to understand it... :) 2. It employes Boyer-Moore-Horspool searching for constant parts of the regex. Again, I think Perl does the same. Don't know about PCRE. 3. It rewrites certain classes of regular expressions to avoid unnecessary capturing of register groups, something like this: (a)* -> (?:a*(a))? I think this is the biggest win compared to "na=EFve" engines like REGEXP. (Don't know if Perl does that. From the speed they're able to achieve I /guess/ they do something similar.) Cheers, Edi. [1] <http://weitz.de/regex-coach/> |
From: Sam S. <sd...@gn...> - 2004-06-10 15:22:14
|
> * Edi Weitz <rqv@ntunegn.qr> [2004-05-28 01:38:15 +0200]: > > 5. If you get an error you get an error Lisp can handle. Disasters > like this one can't happen: > > [1]> (regexp:regexp-compile "(a|(bc)){0,0}?xyz" :extended t) > > *** - handle_fault error2 ! address = 0x14 not in [0x20248000,0x203d5a30) ! > SIGSEGV cannot be cured. Fault address = 0x14. > Segmentation fault [1]> (regexp:regexp-compile "(a|(bc)){0,0}?xyz" :extended t) *** - REGEXP:REGEXP-COMPILE ("(a|(bc)){0,0}?xyz"): "repetition-operator operand invalid" The following restarts are available: USE-VALUE :R1 You may input a value to be used instead. ABORT :R2 ABORT Break 1 [2]> -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> Incorrect time syncronization. |
From: Edi W. <ed...@ag...> - 2004-06-10 15:24:41
|
On Thu, 10 Jun 2004 11:21:58 -0400, Sam Steingold <sd...@gn...> wrote: >> * Edi Weitz <rqv@ntunegn.qr> [2004-05-28 01:38:15 +0200]: >> >> 5. If you get an error you get an error Lisp can handle. Disasters >> like this one can't happen: >> >> [1]> (regexp:regexp-compile "(a|(bc)){0,0}?xyz" :extended t) >> >> *** - handle_fault error2 ! address = 0x14 not in [0x20248000,0x203d5a30) ! >> SIGSEGV cannot be cured. Fault address = 0x14. >> Segmentation fault > > [1]> (regexp:regexp-compile "(a|(bc)){0,0}?xyz" :extended t) > > *** - REGEXP:REGEXP-COMPILE ("(a|(bc)){0,0}?xyz"): "repetition-operator > operand invalid" > The following restarts are available: > USE-VALUE :R1 You may input a value to be used instead. > ABORT :R2 ABORT > > Break 1 [2]> With the same CLISP version that I used? |