From: skaller <sk...@us...> - 2004-09-10 16:42:13
|
On Sat, 2004-09-11 at 01:02, Jesse Guardiani wrote: > skaller wrote: > > > I'm playing with my regexp package and I'd like > > to discuss architecture. > > [...] > > > The goal is a simple, powerful, in Caml regular matching > > facility, written entirely in Ocaml, and obsoleting Str, PCRE, > > and Ocamllex. > > > 3.) Do you have some code we can play with? I'd be able to comment > on the interface if I could compile something... Attached. dfa.ml and dfa.mli contain the basic engine. test_dfa.ml is a sample test routine, unfinished. The engine isn't necessarily the most efficient code, but it seems to work. No attempt is made to minimise the number of states. This engine only accepts one type of character: int. There are no other intrinsic limitations, however some of the data structures might get big. For large character sets, like Unicode, the engine as it stands will probably use all your memory and take too long to be practical. To fix this problem, we need character classes: typically, many characters are equivalent. This is done by a function classify : int -> int which maps a character to a class index which is then fed to the driver. The regexps also need to be built in terms of class indices. One way may be to add a regexp term `REGEXP_in_range (a,b) however, this does require some fiddling. EG if you have a range A-Z and a keyword 'IS' then actually you have 5 classes. It may be useful to change the character type from int to int * 'b where 'b is the actual character, and int is its class. The automaton only uses the class. However if you're processing a stream, and want to extract groups, you want an array of 'b, not an array of class indicies. Similarly condider a regexp for a decimal integer: {`Start} (9 {`Digit}) * where `Start sets a variable to 0, and `Digit multiples by 10 and adds the digit minus '0', thus computing the numeric value as an attribute for a some token. Clearly a class of 0-9 won't do even if 0-9 are equivalent, we need the actual character for the computation. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |