From: skaller <sk...@us...> - 2004-09-09 20:29:39
|
I'm playing with my regexp package and I'd like to discuss architecture. (1) ENGINE The engine accepts a regular expression of the following type: type 'a regexp_t = [ | `REGEXP_char of int | `REGEXP_seq of 'a regexp_t * 'a regexp_t (** concatenation *) | `REGEXP_alt of 'a regexp_t * 'a regexp_t (** alternation *) | `REGEXP_aster of 'a regexp_t (** Kleene closure *) | `REGEXP_epsilon (** epsilon: null string *) | `REGEXP_code of 'a (** associated code *) ] You will note a 'character' is a fixed type, namely int, and that there is a user determined 'code' constructor whose purpose will be explained below. It returns (a) The alphabet, being a Set of int denoting each character which is recognized. (b) A count n of the number of states, which are numbered 0 to n-1. (c) A partial mapping from states to codes, represented by a hashtable, explained below. (d) A partial mapping from character * state pairs to states, represented by a hashtable, called the transition matrix. Here is the interface: val compile_regexp : 'a regexp_t -> CharSet.t * (* alphabet *) int * (* state count *) (int, 'a) Hashtbl.t * (* terminal codes *) (CharSet.elt * int, int) Hashtbl.t (* transition matrix *) If the code mapping fails, there is no code for that state. If the matrix lookup fails, the transition is to the 'error' state. Now for the fun part :) Most regexp systems include some weird notion of 'groups'. The engine above is not only much simpler, it is also better. Basically, a 'code' is a marked epsilon transition. By tacking codes after a regular expression, the terminal state for that expression is marked with an 'accepting' code. Whilst driving the matrix with a matcher, lexer, or other driver loop, you can check whether you have encountered a marked state, and if so, which one. Consider: \(1*\)-\(2*\) We will convert the groups to plain marks: {`m 1}1*{`m 2}-{`m 3}2{`m 4} and create an int option array of length 4 initialised to None. Whenever we get a marked state m1-m4 we store the string index into the array in the right slot. When we're finished .. hey presto, we have our groups. To implement this we could also use the code type: type code_t = [`Left of int | `Right of int] Another use of marks is for classification: 1*{`Ones}|2*{`Twos} When we get a match, we also know which alternative matched. Of course you can do this with groups too. The big difference between marks and groups is that the marks are an arbitrary (comparable) user defined type, and, what you do when you encounter one is up to you. Note groups and marks are equivalent -- regexp engines implement groups with marks, and marks can always be replaced by groups. However groups use a positional notation (counting brackets), whereas marks actually return an arbitrary code. Using a positional group encoding like Cameleon: string option array option isn't nearly as good as marks: in a production lexer with 50 tokens you'd have to search for Some string, and then translate the array index into a token (if you can remember that position 45 is the keyword 'match' for example .. :) In particular this regexp engine *automagically* supports 'in Caml' tokenisers which are are actually more expressive than Ocamllex. In my test engine I have written this: let digit = regexp_of_charset "0123456789" let lower = regexp_of_charset "abcdefghijklmnopqrstuvwxyz" let upper = regexp_of_charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ" let letter = alt upper lower let underscore = regexp_of_char '_' let idstart = alt letter underscore let idrest = alt idstart digit let space = regexp_of_char ' ' let white = many space let ident = seq idstart (aster idrest) let num = many digit let code x = `REGEXP_code x let tk_ident = seq ident (code `Ident) let tk_num = seq num (code `Num) let tk_white = seq white (code `White) let re = alt tk_ident (alt tk_num tk_white) let res = many re I'm using obvious named functions for combinators seq a b = `REGEXP_seq (a,b) etc, you could also define infix operators for alt and seq. Of course you can also do: fold_left seq epsilon [a;b;c;d] etc. Note this isn't Ocamllex code -- its pure Ocaml! Although the engine creates a Hashtbl, for which lookup isn't efficient, you can always transform it into an array and pay the extra space cost. Also, if you want to avoid the cost of compiling a regexp, you can write a pure ocaml program which does the conversion, and either prints out the matrix as Ocaml code, or simply Marshall it to disk. (2) DRIVERS To actually use the engine, we need some drivers such as 'match' and 'lex', and some convenient mark processing utilities. (3) PARSERS We actually use Ocaml as the parser. We can 'sugar' the syntax with any kind of function. For example we can create a type of regexp with groups instead of marks, and just translate the groups into appropriate marks. In order to provide Posix and Emacs looking regexps, we actually write a parser function that produces a regexp. There is a lot more work to produce usable sugar for this engine -- eg actually implement groups. I don't yet have everything because I'm not yet sure exactly what is needed. The goal is a simple, powerful, in Caml regular matching facility, written entirely in Ocaml, and obsoleting Str, PCRE, and Ocamllex. Any comments appreciated. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Jesse G. <je...@wi...> - 2004-09-10 15:03:11
|
skaller wrote: > I'm playing with my regexp package and I'd like > to discuss architecture. [...] > The goal is a simple, powerful, in Caml regular matching > facility, written entirely in Ocaml, and obsoleting Str, PCRE, > and Ocamllex. > > Any comments appreciated. 1.) I DEFINATELY think O'Caml needs a 100% pure O'Caml regex module. I've wished for this a number of times myself, and if we got it then we could remove the dependency on PCRE for things like Dbi's postgresql driver. 2.) I know next to nothing about regex implementations. I am simply a user. 3.) Do you have some code we can play with? I'd be able to comment on the interface if I could compile something... -- Jesse Guardiani, Systems Administrator WingNET Internet Services, P.O. Box 2605 // Cleveland, TN 37320-2605 423-559-LINK (v) 423-559-5145 (f) http://www.wingnet.net |
From: skaller <sk...@us...> - 2004-09-10 16:42:13
|
On Sat, 2004-09-11 at 01:02, Jesse Guardiani wrote: > skaller wrote: > > > I'm playing with my regexp package and I'd like > > to discuss architecture. > > [...] > > > The goal is a simple, powerful, in Caml regular matching > > facility, written entirely in Ocaml, and obsoleting Str, PCRE, > > and Ocamllex. > > > 3.) Do you have some code we can play with? I'd be able to comment > on the interface if I could compile something... Attached. dfa.ml and dfa.mli contain the basic engine. test_dfa.ml is a sample test routine, unfinished. The engine isn't necessarily the most efficient code, but it seems to work. No attempt is made to minimise the number of states. This engine only accepts one type of character: int. There are no other intrinsic limitations, however some of the data structures might get big. For large character sets, like Unicode, the engine as it stands will probably use all your memory and take too long to be practical. To fix this problem, we need character classes: typically, many characters are equivalent. This is done by a function classify : int -> int which maps a character to a class index which is then fed to the driver. The regexps also need to be built in terms of class indices. One way may be to add a regexp term `REGEXP_in_range (a,b) however, this does require some fiddling. EG if you have a range A-Z and a keyword 'IS' then actually you have 5 classes. It may be useful to change the character type from int to int * 'b where 'b is the actual character, and int is its class. The automaton only uses the class. However if you're processing a stream, and want to extract groups, you want an array of 'b, not an array of class indicies. Similarly condider a regexp for a decimal integer: {`Start} (9 {`Digit}) * where `Start sets a variable to 0, and `Digit multiples by 10 and adds the digit minus '0', thus computing the numeric value as an attribute for a some token. Clearly a class of 0-9 won't do even if 0-9 are equivalent, we need the actual character for the computation. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Richard J. <ri...@an...> - 2004-09-10 18:13:27
|
Not really sure I understand it from a user's point of view. For me what really matters is that I could write something like: if string =3D~ /(a+)(b*)/ then ... The regular expression (Perl-compatible, please!) ought to be compiled and checked at compile time, and the compiler should handle caching compiled expressions containing variables. Actually, the example syntax above isn't really great. I'd need some way to bind captures. The Pcre module has good and bad features. Good: It uses Perl- compatible regexps with fairly minimal fuss. Bad: All checking/ compilation is done at runtime; the syntax for accessing captures is very clunky. The camlp4 regexp module [1] looks much better, but the regexps aren't fully Perl compatible, and I'm not sure if compilation happens at compile time or runtime and how it handles caching. How will your proposal actually work (as in, how will it actually be used: can you give some example code)? Rich. [1] http://web.yl.is.s.u-tokyo.ac.jp/~oiwa/pub/caml/regexp-pp-0.9.3/README.= match-regexp --=20 Richard Jones. http://www.annexia.org/ http://www.j-london.com/ Merjis Ltd. http://www.merjis.com/ - improving website return on investment PTHRLIB is a library for writing small, efficient and fast servers in C. HTTP, CGI, DBI, lightweight threads: http://www.annexia.org/freeware/pthrli= b/ |
From: Martin J. <mar...@em...> - 2004-09-12 21:34:22
|
On Fri, 10 Sep 2004, Richard Jones wrote: > Not really sure I understand it from a user's point of view. > > For me what really matters is that I could write something like: > > if string =~ /(a+)(b*)/ then ... > > The regular expression (Perl-compatible, please!) ought to be compiled > and checked at compile time, and the compiler should handle caching > compiled expressions containing variables. > > Actually, the example syntax above isn't really great. I'd need some > way to bind captures. You can have a look at the solution I have implemented recently: http://martin.jambon.free.fr/micmatch.html The syntax is based on the syntax which is already in use in ocamllex. This is just a syntax: it currently produces bindings to the Str library, but it could use OCaml-PCRE or any other library. It is currently useful only for matching (from the first character), but it is integrated in the regular match ... with construct. I have projects (but not too much time these days) for the following: - bindings to OCaml-PCRE (including lazy quantifiers and possessive groups, and possibly assertions - all of this is not available in Str) - additional macros (REPLACE, SEARCH, and more) Good points: - early error detection (compilation-time checks) - syntax highlighting (since it uses an OCaml friendly syntax) - group binding and referencing is readable and safe - macros such as RE digit = ['0'-'9'] and RE num = digit+ can be defined and used Bad points: - complex compilation commands - it does not follow the Perl syntax (is it a real problem?) Cheers, Martin |
From: Nicolas C. <war...@fr...> - 2004-09-11 04:36:25
|
> I'm playing with my regexp package and I'd like > to discuss architecture. [...] > The goal is a simple, powerful, in Caml regular matching > facility, written entirely in Ocaml, and obsoleting Str, PCRE, > and Ocamllex. That's interesting, and I might help here. I'm currently working at designing a language that will be bootstrapped by OCaml, and I need to write a lexer and and LL(1) parser engine in OCaml for that . I do no care about groups since the regexp I'm needing in the language lexer won't be using them, but I'm interested in work about producing and optimizing the tables in realtime. The engine I'm thinking of will be available for the language programmer also so I might need groups after all :) What I like about your design is the possibility of getting compile time regexp ast checking - better than ocaml Str - as well as runtime table production - better than ocamllex . However having a too much abstract implementation for groups and marks might not be a good thing for the end user. You might provide also a string parser that converts/prints good-old regexp strings to/from ast. Nicolas |
From: skaller <sk...@us...> - 2004-09-11 01:18:17
|
On Sat, 2004-09-11 at 07:29, Nicolas Cannasse wrote: > > I'm playing with my regexp package and I'd like > > to discuss architecture. > [...] > > The goal is a simple, powerful, in Caml regular matching > > facility, written entirely in Ocaml, and obsoleting Str, PCRE, > > and Ocamllex. > > That's interesting, and I might help here. > I'm currently working at designing a language that will be bootstrapped by > OCaml, and I need to write a lexer and and LL(1) parser engine in OCaml for > that . Me too. In fact the code I posted is taken from the Felix compiler where I'm following the same development program. If we do some work generalising the engine, I'll probably paste the new code back into it. See below some comments on recursive transition machines (RTN). > The engine I'm thinking of will be available for the language programmer > also so I might need groups after all :) Groups are like tuples -- they're OK for small regexps. For more complex analysis, marks are more general. Marks allow you to match a string and return a trailing mark which indicates the class -- that is, a token. > What I like about your design is the possibility of getting compile time > regexp ast checking - better than ocaml Str - as well as runtime table > production - better than ocamllex . However having a too much abstract > implementation for groups and marks might not be a good thing for the end > user. You might provide also a string parser that converts/prints good-old > regexp strings to/from ast. It depends on the end users needs whether the combinator design for regexps is convenient. If you're an end user writing a programming language lexer it is nicer than string regexps. If you're doing a quick and dirty phone number extraction as in Bagley Shootout, a string encoding is easier to write -- hehe but not much. EVERY language that passes that test is WRONG! Even the specified correct answer is wrong. Clearly it is possible*** to write a function: perl_re_to_regexp: string -> regexp_t and of course also posix_re_to_regexp emacs_re_to_regexp ... and of course also unicode_re_to_regexp: ustring -> regexp etc etc and this *should* be done. I just haven't done it yet because there are some intermediate steps. For example the engine I showed uses arbitrary marks: groups are just a special case of that, but there needs to be an implementation supporting them. [Polymorphic variants are invaluable here!!] Another example: context. It makes no sense to check context when matching a string -- context is only meaningful in searching and lexing. You can search for the re R with H in front and T after it using a match like this: .* H\(R\)T.* which shows searching is just a special case of matching. *** There are a couple of caveats. Most string regexps can do a couple of things that aren't regular. For example match corresponding pairs. (I forget the syntax). This ability is strictly non-regular, and must NOT be supported. A convenient LL(1) parser is the correct way to do this. In fact, 'convenient' has a strong meaning. LL(1) parser engines are weak compared to a pushdown stack of regexps. In fact, a stack of regexps can parse any context free language. Its called a RTN -- recursive transition network. Python's parser is one. Basically RTN is specified using a meta-grammar such as EBNF : the RHS productions can use regular combinators (i.e you can have ? * and + as well as | and concatenation) However as well as matching tokens on the RHS, you can match a non-terminal, which recursively invokes another finite state automaton. RTN's mean the regular engine must be able to accept 'tokens' from the engine it controls as inputs, not 'characters' and of course 'tokens' must also be allowed to have attributes. The engine I posted only accepts integers as inputs -- it needs to be made polymorphic in the character type, without loss of efficiency.. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |