Thread: [Ocaml-lib-devel] regeps

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I'm playing with my regexp package and I'd like
to discuss architecture.

(1) ENGINE

The engine accepts a regular expression of the following type:

type 'a regexp_t =
  [
  | `REGEXP_char of int
  | `REGEXP_seq of 'a regexp_t * 'a regexp_t (** concatenation *)
  | `REGEXP_alt of 'a regexp_t * 'a regexp_t (** alternation *)
  | `REGEXP_aster of 'a regexp_t (** Kleene closure *)
  | `REGEXP_epsilon (** epsilon: null string *)
  | `REGEXP_code of 'a (** associated code *)
  ]

You will note a 'character' is a fixed type, namely int,
and that there is a user determined 'code' constructor
whose purpose will be explained below.

It returns

(a) The alphabet, being a Set of int denoting
each character which is recognized.

(b) A count n of the number of states, which are
numbered 0 to n-1.

(c) A partial mapping from states to codes,
represented by a hashtable, explained below.

(d) A partial mapping from character * state
pairs to states, represented by a hashtable,
called the transition matrix.

Here is the interface:

val compile_regexp : 
  'a regexp_t ->
  CharSet.t * (* alphabet *)
  int *  (* state count *)
  (int, 'a) Hashtbl.t *  (* terminal codes *)
  (CharSet.elt * int, int) Hashtbl.t (* transition matrix *)

If the code mapping fails, there is no code for that state.
If the matrix lookup fails, the transition is to 
the 'error' state.

Now for the fun part :)

Most regexp systems include some weird
notion of 'groups'.

The engine above is not only much simpler,
it is also better.

Basically, a 'code' is a marked epsilon transition.
By tacking codes after a regular expression,
the terminal state for that expression is marked
with an 'accepting' code.

Whilst driving the matrix with a matcher, lexer,
or other driver loop, you can check whether you have
encountered a marked state, and if so, which one.

Consider:

\(1*\)-\(2*\)

We will convert the groups to plain marks:

{`m 1}1*{`m 2}-{`m 3}2{`m 4}

and create an int option array of length 4 initialised
to None. Whenever we get a marked state m1-m4 we
store the string index into the array in the right
slot.

When we're finished .. hey presto, we have our groups.
To implement this we could also use the code type:

type code_t = [`Left of int | `Right of int]

Another use of marks is for classification:

1*{`Ones}|2*{`Twos}

When we get a match, we also know which alternative
matched. Of course you can do this with groups too.

The big difference between marks and groups is that
the marks are an arbitrary (comparable) user defined type, and,
what you do when you encounter one is up to you.

Note groups and marks are equivalent -- regexp engines implement
groups with marks, and marks can always be replaced by groups.
However groups use a positional notation (counting brackets),
whereas marks actually return an arbitrary code.

Using a positional group encoding like Cameleon:

	string option array option

isn't nearly as good as marks: in a production lexer
with 50 tokens you'd have to search for Some string,
and then translate the array index into a token (if you can
remember that position 45 is the keyword 'match' for
example .. :)

In particular this regexp engine *automagically* supports
'in Caml' tokenisers which are are actually more expressive
than Ocamllex. In my test engine I have written this:

let digit = regexp_of_charset "0123456789"
let lower = regexp_of_charset "abcdefghijklmnopqrstuvwxyz"
let upper = regexp_of_charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
let letter = alt upper lower
let underscore = regexp_of_char '_'
let idstart = alt letter underscore
let idrest = alt idstart digit
let space = regexp_of_char ' '
let white = many space
let ident = seq idstart (aster idrest)
let num = many digit
let code x = `REGEXP_code x

let tk_ident = seq ident (code `Ident)
let tk_num = seq num (code `Num)
let tk_white = seq white (code `White)

let re = alt tk_ident (alt tk_num tk_white)
let res = many re

I'm using obvious named functions for combinators 

  seq a b = `REGEXP_seq (a,b)

etc, you could also define infix operators for alt and seq.
Of course you can also do: fold_left seq epsilon [a;b;c;d] etc.
Note this isn't Ocamllex code -- its pure Ocaml!

Although the engine creates a Hashtbl, for which lookup
isn't efficient, you can always transform it into
an array and pay the extra space cost.

Also, if you want to avoid the cost of compiling a regexp,
you can write a pure ocaml program which does the conversion,
and either prints out the matrix as Ocaml code, or simply
Marshall it to disk.

(2) DRIVERS

To actually use the engine, we need some drivers such as
'match' and 'lex', and some convenient mark processing
utilities.

(3) PARSERS

We actually use Ocaml as the parser. We can 'sugar' the syntax
with any kind of function. For example we can create a 
type of regexp with groups instead of marks,
and just translate the groups into appropriate marks.

In order to provide Posix and Emacs looking regexps, we actually
write a parser function that produces a regexp.

There is a lot more work to produce usable sugar for
this engine -- eg actually implement groups. I don't yet
have everything because I'm not yet sure exactly what
is needed. 

The goal is a simple, powerful, in Caml regular matching
facility, written entirely in Ocaml, and obsoleting Str, PCRE,
and Ocamllex.

Any comments appreciated.

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Thread: [Ocaml-lib-devel] regeps

ocaml-lib-devel