Thread: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

ocaml-lib-devel

[Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Richard J. <ri...@an...> - 2004-05-27 13:16:52

Is it helpful if I produce a patch to implement these?  Or would you
prefer to write the worthy ones yourself?

** Std:

string_of_char c	Equivalent to String.make 1 c

identity x		Just returns x!

unique ()		Returns incrementing integers (has internal state)

input_file filename	Returns the contents of filename as a string

output_file filename str  Writes string str to new/replaced file filename

output_tempfile str	Creates a temporary file, containing str, and returns
			the filename

unlink_noerr filename	Removes a file, ignoring errors

** ExtChar (or perhaps better in UChar):

is_space, is_alnum, is_digit, is_xdigit, etc.  It's inexplicable why
these were left out of the standard OCaml library.

** ExtList:

List.take n xs		Return first n elements of xs

List.drop n xs		Drop first n elements of xs

List.range a b		Returns the list of integers [a, a+1, ..., b]

List.frequency ?cmp xs	Returns list [f1, x1; f2, x2; ...] where each
			f1, f2, ... is an integer recording the number of
			instances of x1, x2, ... in the original list.  My
			implementation of this assumes xs is sorted already.

List.group_by : ('a * 'b) list -> ('a * 'b list) list

(* This implementation by Isaac Trotts: *)
let group_by ls =
  let ls' =
    List.fold_left
      (fun acc (day1, x1) ->
         match acc with
             [] -> [day1, [x1]]
           | (day2, ls2) :: acctl ->
               if day1 = day2
               then (day1, x1 :: ls2) :: acctl
               else (day1, [x1]) :: acc)
      []
      ls
  in
  let ls' = List.rev ls' in
  List.map (fun (x, xs) -> x, List.rev xs) ls'


** ExtString:

String.contains_string	Returns true if a string contains a substring

A way to replace characters with strings.  I'm going to call this
String.replace, although another name might be preferable.  It should
allow me to do:

let escape_html = String.replace (function '>' -> "&gt;"
					 | '<' -> "&lt;"
					 | '&' -> "&amp;"
					 | c -> string_of_char c)

** Shared

(* A very simple implementation of weak shared strings, by Remi Vanicat *)

module StArg =
struct
  type t = string
  let equal = ( = )
  let hash = Hashtbl.hash
end
module StWeakHash = Weak.Make (StArg)

let table = StWeakHash.create 1024

let shared s =
  try
    StWeakHash.find table s
  with
      Not_found ->
        StWeakHash.add table s;
        s

** CGI functions

I have some common CGI functions which I could donate to ExtLib if
people feel that this is the right place for them.  In particular,
functions for:

* escaping HTML, URLs (see mod_caml's Cgi_escape module, documented
here: http://www.merjis.com/developers/mod_caml/html/Cgi_escape.html )

* parsing query strings

** Date and time handling functions

I have a smallish library of date and time handling functions which
I'm willing to donate if people think that would be useful.  My
library is particularly concerned with the complexity of ISO weeks
[http://en.wikipedia.org/wiki/ISO_8601#Week_-_day_format], but should
probably be extended to do other useful stuff.

** Postscript generation

Another library which I wrote and could donate creates EPS files from
simple drawing commands.

Rich.

-- 
Richard Jones. http://www.annexia.org/ http://www.j-london.com/
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
MOD_CAML lets you run type-safe Objective CAML programs inside the Apache
webserver. http://www.merjis.com/developers/mod_caml/

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Nicolas C. <war...@fr...> - 2004-05-27 14:30:49

> string_of_char c Equivalent to String.make 1 c

we have String.of_char but I agree it's nice to have another function doing
that (because of others string_of_*)

> identity x Just returns x!

I'm really not convinced about this one but it already have been asked
several times, so I'll add it.

> unique () Returns incrementing integers (has internal state)

that's nice !
I will add to Std

> input_file filename Returns the contents of filename as a string

ok

> output_file filename str  Writes string str to new/replaced file filename

ok

> output_tempfile str Creates a temporary file, containing str, and returns
> the filename

I'm not sure about this one, since a "temporary file" is somehow OS-specific

> unlink_noerr filename Removes a file, ignoring errors

something like : try Sys.remove fname with _ -> () ?

> ** ExtChar (or perhaps better in UChar):
>
> is_space, is_alnum, is_digit, is_xdigit, etc.  It's inexplicable why
> these were left out of the standard OCaml library.

you're welcome to send a full featured ExtChar module.

> ** ExtList:
>
> List.take n xs Return first n elements of xs
>
> List.drop n xs Drop first n elements of xs

Ok for theses two

> List.range a b Returns the list of integers [a, a+1, ..., b]

List.init (b-a) ((+) a)
List.init is more general usage

> List.frequency ?cmp xs Returns list [f1, x1; f2, x2; ...] where each
> f1, f2, ... is an integer recording the number of
> instances of x1, x2, ... in the original list.  My
> implementation of this assumes xs is sorted already.
>
> List.group_by : ('a * 'b) list -> ('a * 'b list) list
>
> (* This implementation by Isaac Trotts: *)
> let group_by ls =
>   let ls' =
>     List.fold_left
>       (fun acc (day1, x1) ->
>          match acc with
>              [] -> [day1, [x1]]
>            | (day2, ls2) :: acctl ->
>                if day1 = day2
>                then (day1, x1 :: ls2) :: acctl
>                else (day1, [x1]) :: acc)
>       []
>       ls
>   in
>   let ls' = List.rev ls' in
>   List.map (fun (x, xs) -> x, List.rev xs) ls'

This needs of course an optional comparison function, but it's difficult to
implement efficiently (tail recursive and removing the List.rev calls).

> ** ExtString:
>
> String.contains_string Returns true if a string contains a substring

now ExtString.exists

> A way to replace characters with strings.  I'm going to call this
> String.replace, although another name might be preferable.  It should
> allow me to do:
>
> let escape_html = String.replace (function '>' -> "&gt;"
> | '<' -> "&lt;"
> | '&' -> "&amp;"
> | c -> string_of_char c)

now ExtString.replace (intesting implementation)

I'll answer others in a separate mail.
Regards,

Nicolas Cannasse

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Richard J. <ri...@an...> - 2004-05-27 14:59:51

On Thu, May 27, 2004 at 03:53:41PM +0100, Richard Jones wrote:
> > > ** ExtChar (or perhaps better in UChar):
> > >
> > > is_space, is_alnum, is_digit, is_xdigit, etc.  It's inexplicable why
> > > these were left out of the standard OCaml library.
> > 
> > you're welcome to send a full featured ExtChar module.
> 
> OK, will look at this.  Do you think it should be ExtChar or UChar
> though?  Since so much of the code I now write uses UTF-8 exclusively
> I'm loathe to contribute any more 8-bit-char-specific code to the
> world ...

Actually I can answer my own question here.  We could define the
ExtChar.is_* functions to only work correctly on 7-bit ASCII.  They
would return false on any character codes >= 128.  This way they
should do the Right Thing when presented with UTF-8 strings too.

(CC-ing this to ocaml-i18n so that people who know what they're
talking about can comment).

Rich.

-- 
Richard Jones. http://www.annexia.org/ http://www.j-london.com/
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
MOD_CAML lets you run type-safe Objective CAML programs inside the Apache
webserver. http://www.merjis.com/developers/mod_caml/

Re: [Ocaml-i18n] Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Yamagata Y. <yor...@mb...> - 2004-06-03 11:37:13

From: Richard Jones <ri...@an...>
Subject: [Ocaml-i18n] Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...
Date: Thu, 27 May 2004 15:59:44 +0100

> On Thu, May 27, 2004 at 03:53:41PM +0100, Richard Jones wrote:
> > > > ** ExtChar (or perhaps better in UChar):
> > > >
> > > > is_space, is_alnum, is_digit, is_xdigit, etc.  It's inexplicable why
> > > > these were left out of the standard OCaml library.
> > > 
> > > you're welcome to send a full featured ExtChar module.
> > 
> > OK, will look at this.  Do you think it should be ExtChar or UChar
> > though?  Since so much of the code I now write uses UTF-8 exclusively
> > I'm loathe to contribute any more 8-bit-char-specific code to the
> > world ...
> 
> Actually I can answer my own question here.  We could define the
> ExtChar.is_* functions to only work correctly on 7-bit ASCII.  They
> would return false on any character codes >= 128.  This way they
> should do the Right Thing when presented with UTF-8 strings too.

I think the general consensus (of I18N experts) is that ISO-C
char. classes are not enough.  Unicode standard defines elaborate
character properties
(http://camomile.sourceforge.net/dochtml/UCharInfo.html).  You can
define ISO-C char. classes from these properties, though.  (glibc
actually does this).

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: <Wol...@un...> - 2004-05-27 15:01:50

Nicolas Cannasse wrote:

As an almost silent semi-outsider:

Wow this looks cool (my EUR0.02)

(snip)

>>List.range a b Returns the list of integers [a, a+1, ..., b]
> 
> 
> List.init (b-a) ((+) a)
> List.init is more general usage

I think (List.range a b) and/or (List.range a b increment) would be 
good, as pythonish people like me would look for a range function as the 
python idiom goes

for i in range(1,10):
	do_something(i)

Cheers,
Wolfgang

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Nicolas C. <war...@fr...> - 2004-05-27 15:06:59

> I think (List.range a b) and/or (List.range a b increment) would be 
> good, as pythonish people like me would look for a range function as the 
> python idiom goes
> 
> for i in range(1,10):
> do_something(i)

<adv>
then what you need is  Enum.range 1 10 <=> Enum.init 1 ((+)9)
since you want to iter...
</adv>

NC

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-27 15:32:10

On Fri, 2004-05-28 at 00:29, Nicolas Cannasse wrote:

> > unique () Returns incrementing integers (has internal state)
> 
> that's nice !
> I will add to Std

I don't agree. I'm against functions with intrinsic internal state.
If I found this function reviewing the lib I'd delete it without
even bothering to ask permission.

If you want such a function, provide a constructor for it:

val make_int_generator: unit -> (unit ->int)

but leave it up to the client to instantiate.

[Think "GLOBAL STATE IS BAD BAD BAD" and repeat it in your
sleep until you can't ever forget what a sin it is in a library :]

> > output_tempfile str Creates a temporary file, containing str, and returns
> > the filename
> 
> I'm not sure about this one, since a "temporary file" is somehow OS-specific

Yes, but that's the point of abstracting it.
Ocaml already has a way to construct temporaries
portably, so this is just a lightweight convenience
wrapper around it.

> you're welcome to send a full featured ExtChar module.

Please keep all character set depend stuff OUT of Extlib.
Please leave it to an expert package, such as Camomile.

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see inExtLib ...

From: Nicolas C. <war...@fr...> - 2004-05-27 15:40:42

> > > unique () Returns incrementing integers (has internal state)
> >
> > that's nice !
> > I will add to Std
>
> I don't agree. I'm against functions with intrinsic internal state.
> If I found this function reviewing the lib I'd delete it without
> even bothering to ask permission.
>
> If you want such a function, provide a constructor for it:
>
> val make_int_generator: unit -> (unit ->int)
>
> but leave it up to the client to instantiate.

Sorry but as a careful programmer, I'm against the fact of "protecting the
programmer from bad usages".
This function is useful if used carefuly (that means, not in a multithreaded
program). The fact that all identifiers are shared is not a problem as long
as you don't override the 2^31 ocaml limit.

Regards,
Nicolas Cannasse

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see inExtLib ...

From: skaller <sk...@us...> - 2004-05-27 16:07:42

On Fri, 2004-05-28 at 01:39, Nicolas Cannasse wrote:
>  If you want such a function, provide a constructor for it:
> >
> > val make_int_generator: unit -> (unit ->int)
> >
> > but leave it up to the client to instantiate.
> 
> Sorry but as a careful programmer, I'm against the fact of "protecting the
> programmer from bad usages".

Then go back to programming in C.

The function above is in the spirit of Ocaml.
It is completely re-entrant and purely functional
and requires the programmer to arrange sharing as they see fit.

If the programmer wants to use a global variable they can 
easily write:

let unique = make_int_generator ();;

Leave this decision to the programmer to shoot themselves
in the head if they want. Don't encourage it or provide
a function that a careful programmer would in fact find
useless .. perhaps belatedly.

Its bad enough the standard distro has crap like Str
module in it. Lets not make the same mistake.

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-27 16:14:49

On Fri, 2004-05-28 at 01:42, William D. Neumann wrote:

> I agree with this.  Though I'd prefer to see it with an optional argument
> to provide the starting point, i.e.
> 
> val make_int_generator: ?base:int -> unit -> (unit ->int)

Yeah that's a good idea IMHO.

BTW: of course I use fresh variable generators in 
my Felix compiler. At the moment I have a unique
one because it makes searching through generated
text easier. However the numbers start to get
a bit long quickly when you're alpha converting
types all the time, so being able to have several
generator around, and even create new ones,
would actually be quite useful.

Something else to add to the spec would be 
that the generator throws an exception (Invalid argument?)
when it runs out of ints. We don't want it to silently
wrap around .. which mine does :(

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: William D. N. <wne...@cs...> - 2004-05-27 16:21:41

On Thu, 28 May 2004, skaller wrote:

> Something else to add to the spec would be
> that the generator throws an exception (Invalid argument?)
> when it runs out of ints. We don't want it to silently
> wrap around .. which mine does :(

Very good point as well, though Failure might be a better exception to
throw... Neither is quite right, but both are better than coming up with a
new exception...

William D. Neumann

---

"Well I could be a genius, if I just put my mind to it.
And I...I could do anything, if only I could get 'round to it.
Oh we were brought up on the space-race, now they expect you to clean toilets.
When you've seen how big the world is, how can you make do with this?
If you want me, I'll be sleeping in - sleeping in throughout these glory days."

	-- Jarvis Cocker

Think of XML as Lisp for COBOL programmers.

	-- Tony-A (some guy on /.)

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Nicolas C. <war...@fr...> - 2004-05-27 14:43:56

> ** Shared
>
> (* A very simple implementation of weak shared strings, by Remi Vanicat *)

I'm not sure this is wide usage enough.

> ** CGI functions
>
> I have some common CGI functions which I could donate to ExtLib if
> people feel that this is the right place for them.  In particular,
> functions for:
>
> * escaping HTML, URLs (see mod_caml's Cgi_escape module, documented
> here: http://www.merjis.com/developers/mod_caml/html/Cgi_escape.html )
>
> * parsing query strings

This is interesting if you can build a whole module helping with filenames,
url, and escaping in general.

> ** Date and time handling functions
>
> I have a smallish library of date and time handling functions which
> I'm willing to donate if people think that would be useful.  My
> library is particularly concerned with the complexity of ISO weeks
> [http://en.wikipedia.org/wiki/ISO_8601#Week_-_day_format], but should
> probably be extended to do other useful stuff.

That's interesting , but I'm not very qualified to answer about the quality
of your implementation.
Peer review , anyone ?

> ** Postscript generation
>
> Another library which I wrote and could donate creates EPS files from
> simple drawing commands.

This one is definitly not in ExtLib :)

Regards,
Nicolas Cannasse

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-27 15:38:55

On Fri, 2004-05-28 at 00:42, Nicolas Cannasse wrote:

> > ** CGI functions

No. Extlib should try to be general purpose.
Remember it is supposed to be a replacement for
the core library.

There is a place for Internet/Web software, but not here.
I think the topic is far too heavy to venture into.

We should concentrate on algorithms and data structures,
plus some essential support for abstracted IO.

It just may be that parsing URL could be argued does
belong here since URL is a kind of modern equivalent
of 'filename'. The problem is once you start into
this domain, you end up needing a full scale XML
parser ... amoung 10,000 other functions to support
the huge number of standards out there for Internetting.

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Richard J. <ri...@an...> - 2004-05-27 17:02:14

On Fri, May 28, 2004 at 01:38:37AM +1000, skaller wrote:
[...]

I'll settle for a String.replace function (see original posting) which
would allow me to write the required escape_* functions quickly and
simply.

Rich.

-- 
Richard Jones. http://www.annexia.org/ http://www.j-london.com/
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
If I have not seen as far as others, it is because I have been
standing in the footprints of giants.  -- from Usenet

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-27 19:46:43

On Fri, 2004-05-28 at 03:02, Richard Jones wrote:
> On Fri, May 28, 2004 at 01:38:37AM +1000, skaller wrote:
> [...]
> 
> I'll settle for a String.replace function (see original posting) which
> would allow me to write the required escape_* functions quickly and
> simply.

I have no problem with String.replace!

Also, I have no problem with a charset type,
and a function:

isa chset ch

which tests if ch is in chset.

I even have an implementation of charset...
it works with my regexp package so I don't need
any 'isa' function, I can use regexps instead.

--------------------------------------

type charset_t
val charset_of_string: string -> charset_t
val charset_of_int_range: int -> int -> charset_t
val charset_of_range: string -> string -> charset_t
val charset_union: charset_t -> charset_t -> charset_t
val charset_inv: charset_t -> charset_t
val regexp_of_charset: charset_t -> regexp_t
val regexp_underscore: regexp_t
val eol: int
val regexp_dot: regexp_t

-------------- implementation -----------------------
type charset_t = bool array

let charset_of_string s = 
  let x = Array.make 256 false in
  for i  = 0 to String.length s - 1 do
    x.(Char.code s.[i]) <- true
  done;
  x


let charset_of_int_range x1 x2 =
  let x = Array.make 256 false in
  for i = x1 to x2 do
    x.(i) <- true 
  done
  ;
  x

let charset_of_range s1 s2 =
  if String.length s1 <> 1 
  then
    failwith "Charset range(first) requires string length 1"
  ;
  if String.length s2 <> 1 
  then
    failwith "Charset range(last) requires string length 1"
  ;
  let x1 = Char.code (s1.[0])
  and x2 = Char.code (s2.[0])
  in 
    charset_of_int_range x1 x2

let charset_union x1 x2 = 
  let x = Array.make 256 false in
  for i = 0 to 255 do
    x.(i) <- x1.(i) || x2.(i)
  done;
  x

let charset_inv y =
  let x = Array.make 256 false in
  for i = 0 to 255 do
    x.(i) <- not y.(i)
  done;
  x

let regexp_of_charset y =
  let res = ref REGEXP_epsilon in
  for i = 0 to 255 do
    if y.(i) then res := 
      let r = REGEXP_string (String.make 1 (Char.chr i)) in
      if !res = REGEXP_epsilon 
      then r 
      else REGEXP_alt ( !res, r)
  done
  ;
  !res

let regexp_underscore = 
  regexp_of_charset (charset_of_int_range 0 255)

let eol = Char.code '\n'

let regexp_dot = 
  regexp_of_charset 
  (
    charset_union 
      (charset_of_int_range 0 (eol - 1))
      (charset_of_int_range (eol + 1) 255)
  )


-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Richard J. <ri...@an...> - 2004-05-28 08:28:31

On Fri, May 28, 2004 at 05:46:33AM +1000, skaller wrote:
> On Fri, 2004-05-28 at 03:02, Richard Jones wrote:
> > On Fri, May 28, 2004 at 01:38:37AM +1000, skaller wrote:
> > [...]
> > 
> > I'll settle for a String.replace function (see original posting) which
> > would allow me to write the required escape_* functions quickly and
> > simply.
> 
> I have no problem with String.replace!
> 
> Also, I have no problem with a charset type,
> and a function:

It looks very much like your implementation assumes 8-bit characters.
Unless it was used carefully, it wouldn't work with UTF-8 strings, for
example.

Rich.

-- 
Richard Jones. http://www.annexia.org/ http://www.j-london.com/
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
"One serious obstacle to the adoption of good programming languages is
the notion that everything has to be sacrificed for speed. In computer
languages as in life, speed kills." -- Mike Vanier

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-28 10:04:26

On Fri, 2004-05-28 at 18:28, Richard Jones wrote:

> It looks very much like your implementation assumes 8-bit characters.

Yes it does. you wouldn't use that implementation
for 32 bits. You'd need a trie structure I think.

My regexp implementation uses ints, however,
it will die if the character set is too large:
it also needs a faster implementation for
32 bit chars.

I suspect:

(a) specifying an interface
(b) making a useful instance for 8 bits

is a good start though. 32 bit transition matrix
isn't so easy. you can't just use arrays of arrays,
but O(n) is out of the question: O(log n) is
has to be the worst case and it had better
be O(1) amortised over real data.

B-tree is O(1) so that's a possibility.
Trie might be faster though ..

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Richard J. <ri...@an...> - 2004-05-28 10:12:48

I was thinking more of multibyte characters.  It seems to me that
UTF-8 is going to be the only thing that matters in the future (aside
from the Microsoft world, of course, where they have as usual gone off
in their own strange direction).

Rich.

-- 
Richard Jones. http://www.annexia.org/ http://www.j-london.com/
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
MOD_CAML lets you run type-safe Objective CAML programs inside the Apache
webserver. http://www.merjis.com/developers/mod_caml/

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-28 14:30:21

On Fri, 2004-05-28 at 20:12, Richard Jones wrote:
> I was thinking more of multibyte characters. 

A finite state automaton processes an alphabet 
consisting a fixed set of symbols.

There is no such thing as a multibyte character.

You can convert the UTF-8 to UCS-4 on the fly
if you drive a 32 bit automaton with an C++ 
style iterator, or use some kind of lazy function,
such as an Enum, or you can convert to whole string first.

Either way you need a 32 bit DFA not an 8 bit one.

Alternatively you can specify the UTF-8 encoding
within an 8 bit DFA. I've done this. There are
a lot of states .. ocamllex can't handle
this at all (I've tried it).

But the bottom line, IMHO, is in practice you need
two kinds: 8 bit and 32 bit, and the implementations
of charsets, DFA, etc, are likely to be completely
different for the two kinds.

BTW: Felix 8 bit engine is blindingly fast:

  while(state && lexeme_end != buffer_end)
    state = matrix[*lexeme_end++][state];

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Brian H. <bh...@sp...> - 2004-05-28 17:15:19

On 29 May 2004, skaller wrote:

> Either way you need a 32 bit DFA not an 8 bit one.

This is exactly the problem.  For a k-bit input alphabet, you need at 
least one table of 2^k bytes, even if you're doing input compression.  For 
k=8 this is no problem- a table of 256 entries still qualifies as small.  
For k=16, this is doable, we're looking at a 64K table.  But for k=32 
we're looking at a table size of 4G- unmanagable on current systems.

I think the solution is UTF-16.  Almost all the characters of interest are 
single characters now, limiting the explosion of states.  But the table 
sizes are still manageable.  Oh, the iterim tables are going to be large- 
uncompressed tables of 100M or more won't be unusual for large DFAs.  
Although I'd be tempted to look at sparse table structures to be brutally 
honest.  Accessing the high characters will simply take multiple states.

> BTW: Felix 8 bit engine is blindingly fast:
> 
>   while(state && lexeme_end != buffer_end)
>     state = matrix[*lexeme_end++][state];
> 

To do this with 32-bit lexemes takes (2^32)*(numstates).  A better, if 
slightly slower, implementation is:

let rec loop state =
	let state' = state_compress.(state)
	and lexeme' = lexeme_compress.(next_lexeme ())
        in
        loop matrix.(lexeme').(state')

Yeah, you have two extra table lookups, but in general much smaller 
tables.

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-28 19:17:41

On Sat, 2004-05-29 at 03:22, Brian Hurt wrote:
> On 29 May 2004, skaller wrote:
> 
> > Either way you need a 32 bit DFA not an 8 bit one.
> 
> This is exactly the problem.   But for k=32 
> we're looking at a table size of 4G- unmanagable on current systems.

Ah no. We're not. :)))

The reason is : no one is going to write
regular expressions which distinguish all 2^32 code points.

I mean you'd be typing away for a long time
just to type in 4G characters .. it wouldn't fit
in your editor .. :)

So actually, there is no size problem at all with
practical 32 bit state machines, provided an 
efficient mapping from code point to equivalence
class can be constructed, and constructed efficiently.

Tries handle subranges efficiently, and so they're the obvious
choice for classification.

> I think the solution is UTF-16. 

No and yes. Forget UTF-16 per se. I'm going to take
your proposition to be that a 16 bit engine is enough.

You may be right. I think i might try it!

> Oh, the iterim tables are going to be large- 
> uncompressed tables of 100M or more won't be unusual for large DFAs.  

Yeah, that's a problem generating the DFA.

> Although I'd be tempted to look at sparse table structures to be brutally 
> honest.  

There's no choice is there? We already agree that 4G arrays aren't on :)
Tries manage sparse linear sequences
efficiently when they consist of subranges, which I think
is what you'll get with Unicode input.

> Accessing the high characters will simply take multiple states.
> 
> > BTW: Felix 8 bit engine is blindingly fast:
> > 
> >   while(state && lexeme_end != buffer_end)
> >     state = matrix[*lexeme_end++][state];
> > 
> 
> To do this with 32-bit lexemes takes (2^32)*(numstates).

No, only in worst case. It's slightly tricky to see but
this representation is actually classifying characters
automatically. Matrix here is NOT a matrix!!

Matrix is an array of pointers to arrays.
There are 256 pointers to arrays of n-states,
but that is NOT 256 arrays of n-states.

The actual cost is:

256 + nclasses * nstates

where nclasses is typically smaller than 256.

>   A better, if 
> slightly slower, implementation is:
> 
> let rec loop state =
> 	let state' = state_compress.(state)
> 	and lexeme' = lexeme_compress.(next_lexeme ())
>         in
>         loop matrix.(lexeme').(state')
> 
> Yeah, you have two extra table lookups, but in general much smaller 
> tables.

I'm already doing your 'lexeme compress'.
I don't actually know how to compress the state vectors.

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Brian H. <bh...@sp...> - 2004-05-28 21:45:15

On 29 May 2004, skaller wrote:

> On Sat, 2004-05-29 at 03:22, Brian Hurt wrote:
> > On 29 May 2004, skaller wrote:
> > 
> > > Either way you need a 32 bit DFA not an 8 bit one.
> > 
> > This is exactly the problem.   But for k=32 
> > we're looking at a table size of 4G- unmanagable on current systems.
> 
> Ah no. We're not. :)))

I think I see how we're talking past each other, and the problem's at my 
end.  I keep thinking in terms of classic lex/yacc implementations, which 
use no data structure other than arrays.

In any DFA designed by a human, you're not going to have a large number of 
character classes- a few hundred at most, even for the most extreme cases 
(most of which arise from using the lexer to parse out keywords).  The 
average is more likely to be only a few dozen.

So what you need to be able to do is to determine the class a character is 
in upon input.  Now, the lex scheme to do this is a simple array look-up.  
Which is great if the number of possible input characters is small, but 
blows complete chow if you have a large number of possible inputs.  The 
problem is I hadn't realized I was assuming this was the only way to do 
things.  Obviously, once I state it baldly like that, this is wrong.

By far the most common case is going to be that all characters above some 
reasonably small limit are all going to be of the same class.  "Reasonably 
small" meaning I don't mind having an array of that size- meaning 99.9% of 
the time you can get away with something as simple as:
	let c = input () in
	let c' = if c < limit then class.(c) else bigclass in
        ...

A limit of 65536 wouldn't be bad- indeed, I might be tempted to do a limit
of 256.  Even if you can't do this, this is a good way to do the "easy"  
transisitions- high valued characters can be looked up with a more
complicated algorithm (a trie, for example).

So now I can handle 4G character points without using 4G of data.

The same thing applies to the internal tables- use a sparse matrix 
representation.

> > I think the solution is UTF-16. 
> 
> No and yes. Forget UTF-16 per se. I'm going to take
> your proposition to be that a 16 bit engine is enough.
> 
> You may be right. I think i might try it!

No, I was thinking of UTF-16.  Which represents code points less than 
32768 as single 2-byte characters, and code points higher than that as 
multiple 2-byte characters, like UTF-8 does.  This means that recognizing 
characters higher than 32767 takes multiple states.

Now I've swung back around- I don't think this is worth it.  The extra 
states are too much overhead, it's cheaper to just deal with 32-bit code 
points.

> 
> > Oh, the iterim tables are going to be large- 
> > uncompressed tables of 100M or more won't be unusual for large DFAs.  
> 
> Yeah, that's a problem generating the DFA.

Use sparse matricies.  Especially if they're extended sparse matricies 
that can handle character classes, like [:digit:], [:space:], etc.  The 
default transisition, of course, would be to the error state.  I'd bet you 
wouldn't have more than a dozen or so transitions in any given state, 
meaning your internal table isn't more than a few 10's of K in size.

Note that once you're dealing with the sparse matrix representation, the 
size of the matrix is a function primarily of the number of states and the 
average number of transitions per state- not of the number of different 
possible code points.  So there's no advantage to only dealing with 16-bit 
characters and not 32-bit characters.

> I don't actually know how to compress the state vectors.

Same way you compress state vectors.  If you have two states which have 
exactly the same transitions (duplicate columns), you collapse them into 
one state class.

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-29 06:55:56

On Sat, 2004-05-29 at 07:52, Brian Hurt wrote:
> > > This is exactly the problem.   But for k=32 
> > > we're looking at a table size of 4G- unmanagable on current systems.
> > 
> > Ah no. We're not. :)))
> 
> I think I see how we're talking past each other, and the problem's at my 
> end.  I keep thinking in terms of classic lex/yacc implementations, which 
> use no data structure other than arrays.

Yes, but do not abandon this thought.
In the worst case, any data structure OTHER than an array
will be more expensive than an array. Arrays are good!

> In any DFA designed by a human, you're not going to have a large number of 
> character classes- a few hundred at most, 

Yes, that is the idea. If I can comment on an apparently
unrelated issue: people often say it isn't possible to
determine mechanically if a program will terminate.

They're WRONG!

This is a really serious mistake made by theoreticians.

There is a difference between *arbitrary program* and
*program I wrote which I know should terminate*.

I have a reason for believing my program should terminate:
I wrote it to do so. If a smart compiler can't prove it,
my code is too complex.

Ok, so this actually applies to the regexp thing.
However it isn't clear you are right that a small
(a few Meg) regexp specification necessarily
leads to a small ( a few Meg) number of character classes.

I think this is probably the case, but if the
number of classes is *exponential* in the size of
a regexp it won't be the case. If it is linear
it will be the case. 

The keywords example shows that the number of equivalence
classes can grow very fast..

An important thing here is: my system and ocamllex both
use regular *definitions* not merely regular expressions.
Regular expressions seem to grow fast when you have
regular definitions: the tree is expanded recursively
by substituting in the regular definitions:

regedef a = ..
regdef b = a a a ...
regdef c = a a a b b b ...
regdef d = c c c b b b ...
....

so we need to take this into account comparing the size of input
to the size of the output.

> So now I can handle 4G character points without using 4G of data.

I don't think it is enough to just use an array plus a list.
That's why I suggested a trie: it handles singleton classes
plus subranges efficiently.

The lookup is log n I think.

> No, I was thinking of UTF-16.  

I know, but UTF-16 assumes you're talking about Unicode
inputs. Regexps are useful for many other things too.
For example a Yacc style parser is driven by a DFA
generated by taking *tokens* as inputs.

> > I don't actually know how to compress the state vectors.
> 
> Same way you compress state vectors.  If you have two states which have 
> exactly the same transitions (duplicate columns), you collapse them into 
> one state class.

In practice, minimising a DFA rarely seems to save anything.
However it may be possible to compress individual transition
vectors, since 90% of the states you're going to are likely
to be 'error'. But I don't know how effective this would be.

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: Alain F. <Ala...@en...> - 2004-05-29 07:56:53

On 29 May 2004, skaller wrote:

> Tries handle subranges efficiently, and so they're the obvious
> choice for classification.

Trie usually denotes a tree-shaped datastructure that hold strings by
sharing prefixes. I guess this is not what you are talking about.

FWIW, my lexer generator ulex implements regular expression over Unicode
(parsing the encoding is done independently). The transition relation is
implemented as a mix of a dispatch table (for small code points), and a
tree of if-then-else branches to implement dichotomy (I guess this is what
you refer to as trie). Lookup has then O(log n) complexity, where n is the
number of different ranges for the current state (log n <= 21 for
Unicode), but if the input is only, say, in the Latin 1 character set
(initial 256 codes long prefix of Unicode), this is O(1).

My old lexer generator wlex used a different technique: it was a simple
patch to ocamllex to introduce an intermediate "classification" layer
between the lexbuf and the automaton; the layer would transform the
byte stream into a class stream, where classes are user-declared.
All the state of the ocamllex automaton would use the same classification
layer.

-- Alain

Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ...

From: skaller <sk...@us...> - 2004-05-29 09:18:25

On Sat, 2004-05-29 at 17:56, Alain Frisch wrote:
> On 29 May 2004, skaller wrote:
> 
> > Tries handle subranges efficiently, and so they're the obvious
> > choice for classification.
> 
> Trie usually denotes a tree-shaped datastructure that hold strings by
> sharing prefixes. I guess this is not what you are talking about.

Actually I think it is! What we're talking about is a small set
of subranges of the Unicode code point set, and how to
quickly determine if a 32 bit code point is in a particular
such subset of Unicode, or, more generally, which class
of a partition it is in.

The idea is that a subrange of characters like 'a-z' tend
to share a common prefix in terms of the bitstring
representing the binary code.

You could use a binary tree and index it one bit
at a time. A trie is just the same idea, except it
is smarter about organising the nodes to have more
than 2 branches, so the branch selection is slower,
but the depth of the tree is less so the overall
performance is much better.

-- 
John Skaller, mailto:sk...@us...
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

1 2 3 > >> (Page 1 of 3)