From: Richard J. <ri...@an...> - 2004-05-27 13:16:52
|
Is it helpful if I produce a patch to implement these? Or would you prefer to write the worthy ones yourself? ** Std: string_of_char c Equivalent to String.make 1 c identity x Just returns x! unique () Returns incrementing integers (has internal state) input_file filename Returns the contents of filename as a string output_file filename str Writes string str to new/replaced file filename output_tempfile str Creates a temporary file, containing str, and returns the filename unlink_noerr filename Removes a file, ignoring errors ** ExtChar (or perhaps better in UChar): is_space, is_alnum, is_digit, is_xdigit, etc. It's inexplicable why these were left out of the standard OCaml library. ** ExtList: List.take n xs Return first n elements of xs List.drop n xs Drop first n elements of xs List.range a b Returns the list of integers [a, a+1, ..., b] List.frequency ?cmp xs Returns list [f1, x1; f2, x2; ...] where each f1, f2, ... is an integer recording the number of instances of x1, x2, ... in the original list. My implementation of this assumes xs is sorted already. List.group_by : ('a * 'b) list -> ('a * 'b list) list (* This implementation by Isaac Trotts: *) let group_by ls = let ls' = List.fold_left (fun acc (day1, x1) -> match acc with [] -> [day1, [x1]] | (day2, ls2) :: acctl -> if day1 = day2 then (day1, x1 :: ls2) :: acctl else (day1, [x1]) :: acc) [] ls in let ls' = List.rev ls' in List.map (fun (x, xs) -> x, List.rev xs) ls' ** ExtString: String.contains_string Returns true if a string contains a substring A way to replace characters with strings. I'm going to call this String.replace, although another name might be preferable. It should allow me to do: let escape_html = String.replace (function '>' -> ">" | '<' -> "<" | '&' -> "&" | c -> string_of_char c) ** Shared (* A very simple implementation of weak shared strings, by Remi Vanicat *) module StArg = struct type t = string let equal = ( = ) let hash = Hashtbl.hash end module StWeakHash = Weak.Make (StArg) let table = StWeakHash.create 1024 let shared s = try StWeakHash.find table s with Not_found -> StWeakHash.add table s; s ** CGI functions I have some common CGI functions which I could donate to ExtLib if people feel that this is the right place for them. In particular, functions for: * escaping HTML, URLs (see mod_caml's Cgi_escape module, documented here: http://www.merjis.com/developers/mod_caml/html/Cgi_escape.html ) * parsing query strings ** Date and time handling functions I have a smallish library of date and time handling functions which I'm willing to donate if people think that would be useful. My library is particularly concerned with the complexity of ISO weeks [http://en.wikipedia.org/wiki/ISO_8601#Week_-_day_format], but should probably be extended to do other useful stuff. ** Postscript generation Another library which I wrote and could donate creates EPS files from simple drawing commands. Rich. -- Richard Jones. http://www.annexia.org/ http://www.j-london.com/ Merjis Ltd. http://www.merjis.com/ - improving website return on investment MOD_CAML lets you run type-safe Objective CAML programs inside the Apache webserver. http://www.merjis.com/developers/mod_caml/ |
From: Nicolas C. <war...@fr...> - 2004-05-27 14:30:49
|
> string_of_char c Equivalent to String.make 1 c we have String.of_char but I agree it's nice to have another function doing that (because of others string_of_*) > identity x Just returns x! I'm really not convinced about this one but it already have been asked several times, so I'll add it. > unique () Returns incrementing integers (has internal state) that's nice ! I will add to Std > input_file filename Returns the contents of filename as a string ok > output_file filename str Writes string str to new/replaced file filename ok > output_tempfile str Creates a temporary file, containing str, and returns > the filename I'm not sure about this one, since a "temporary file" is somehow OS-specific > unlink_noerr filename Removes a file, ignoring errors something like : try Sys.remove fname with _ -> () ? > ** ExtChar (or perhaps better in UChar): > > is_space, is_alnum, is_digit, is_xdigit, etc. It's inexplicable why > these were left out of the standard OCaml library. you're welcome to send a full featured ExtChar module. > ** ExtList: > > List.take n xs Return first n elements of xs > > List.drop n xs Drop first n elements of xs Ok for theses two > List.range a b Returns the list of integers [a, a+1, ..., b] List.init (b-a) ((+) a) List.init is more general usage > List.frequency ?cmp xs Returns list [f1, x1; f2, x2; ...] where each > f1, f2, ... is an integer recording the number of > instances of x1, x2, ... in the original list. My > implementation of this assumes xs is sorted already. > > List.group_by : ('a * 'b) list -> ('a * 'b list) list > > (* This implementation by Isaac Trotts: *) > let group_by ls = > let ls' = > List.fold_left > (fun acc (day1, x1) -> > match acc with > [] -> [day1, [x1]] > | (day2, ls2) :: acctl -> > if day1 = day2 > then (day1, x1 :: ls2) :: acctl > else (day1, [x1]) :: acc) > [] > ls > in > let ls' = List.rev ls' in > List.map (fun (x, xs) -> x, List.rev xs) ls' This needs of course an optional comparison function, but it's difficult to implement efficiently (tail recursive and removing the List.rev calls). > ** ExtString: > > String.contains_string Returns true if a string contains a substring now ExtString.exists > A way to replace characters with strings. I'm going to call this > String.replace, although another name might be preferable. It should > allow me to do: > > let escape_html = String.replace (function '>' -> ">" > | '<' -> "<" > | '&' -> "&" > | c -> string_of_char c) now ExtString.replace (intesting implementation) I'll answer others in a separate mail. Regards, Nicolas Cannasse |
From: Richard J. <ri...@an...> - 2004-05-27 14:59:51
|
On Thu, May 27, 2004 at 03:53:41PM +0100, Richard Jones wrote: > > > ** ExtChar (or perhaps better in UChar): > > > > > > is_space, is_alnum, is_digit, is_xdigit, etc. It's inexplicable why > > > these were left out of the standard OCaml library. > > > > you're welcome to send a full featured ExtChar module. > > OK, will look at this. Do you think it should be ExtChar or UChar > though? Since so much of the code I now write uses UTF-8 exclusively > I'm loathe to contribute any more 8-bit-char-specific code to the > world ... Actually I can answer my own question here. We could define the ExtChar.is_* functions to only work correctly on 7-bit ASCII. They would return false on any character codes >= 128. This way they should do the Right Thing when presented with UTF-8 strings too. (CC-ing this to ocaml-i18n so that people who know what they're talking about can comment). Rich. -- Richard Jones. http://www.annexia.org/ http://www.j-london.com/ Merjis Ltd. http://www.merjis.com/ - improving website return on investment MOD_CAML lets you run type-safe Objective CAML programs inside the Apache webserver. http://www.merjis.com/developers/mod_caml/ |
From: Yamagata Y. <yor...@mb...> - 2004-06-03 11:37:13
|
From: Richard Jones <ri...@an...> Subject: [Ocaml-i18n] Re: [Ocaml-lib-devel] Some (simple) functions I'd like to see in ExtLib ... Date: Thu, 27 May 2004 15:59:44 +0100 > On Thu, May 27, 2004 at 03:53:41PM +0100, Richard Jones wrote: > > > > ** ExtChar (or perhaps better in UChar): > > > > > > > > is_space, is_alnum, is_digit, is_xdigit, etc. It's inexplicable why > > > > these were left out of the standard OCaml library. > > > > > > you're welcome to send a full featured ExtChar module. > > > > OK, will look at this. Do you think it should be ExtChar or UChar > > though? Since so much of the code I now write uses UTF-8 exclusively > > I'm loathe to contribute any more 8-bit-char-specific code to the > > world ... > > Actually I can answer my own question here. We could define the > ExtChar.is_* functions to only work correctly on 7-bit ASCII. They > would return false on any character codes >= 128. This way they > should do the Right Thing when presented with UTF-8 strings too. I think the general consensus (of I18N experts) is that ISO-C char. classes are not enough. Unicode standard defines elaborate character properties (http://camomile.sourceforge.net/dochtml/UCharInfo.html). You can define ISO-C char. classes from these properties, though. (glibc actually does this). -- Yamagata Yoriyuki |
From:
<Wol...@un...> - 2004-05-27 15:01:50
|
Nicolas Cannasse wrote: As an almost silent semi-outsider: Wow this looks cool (my EUR0.02) (snip) >>List.range a b Returns the list of integers [a, a+1, ..., b] > > > List.init (b-a) ((+) a) > List.init is more general usage I think (List.range a b) and/or (List.range a b increment) would be good, as pythonish people like me would look for a range function as the python idiom goes for i in range(1,10): do_something(i) Cheers, Wolfgang |
From: Nicolas C. <war...@fr...> - 2004-05-27 15:06:59
|
> I think (List.range a b) and/or (List.range a b increment) would be > good, as pythonish people like me would look for a range function as the > python idiom goes > > for i in range(1,10): > do_something(i) <adv> then what you need is Enum.range 1 10 <=> Enum.init 1 ((+)9) since you want to iter... </adv> NC |
From: skaller <sk...@us...> - 2004-05-27 15:32:10
|
On Fri, 2004-05-28 at 00:29, Nicolas Cannasse wrote: > > unique () Returns incrementing integers (has internal state) > > that's nice ! > I will add to Std I don't agree. I'm against functions with intrinsic internal state. If I found this function reviewing the lib I'd delete it without even bothering to ask permission. If you want such a function, provide a constructor for it: val make_int_generator: unit -> (unit ->int) but leave it up to the client to instantiate. [Think "GLOBAL STATE IS BAD BAD BAD" and repeat it in your sleep until you can't ever forget what a sin it is in a library :] > > output_tempfile str Creates a temporary file, containing str, and returns > > the filename > > I'm not sure about this one, since a "temporary file" is somehow OS-specific Yes, but that's the point of abstracting it. Ocaml already has a way to construct temporaries portably, so this is just a lightweight convenience wrapper around it. > you're welcome to send a full featured ExtChar module. Please keep all character set depend stuff OUT of Extlib. Please leave it to an expert package, such as Camomile. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Nicolas C. <war...@fr...> - 2004-05-27 15:40:42
|
> > > unique () Returns incrementing integers (has internal state) > > > > that's nice ! > > I will add to Std > > I don't agree. I'm against functions with intrinsic internal state. > If I found this function reviewing the lib I'd delete it without > even bothering to ask permission. > > If you want such a function, provide a constructor for it: > > val make_int_generator: unit -> (unit ->int) > > but leave it up to the client to instantiate. Sorry but as a careful programmer, I'm against the fact of "protecting the programmer from bad usages". This function is useful if used carefuly (that means, not in a multithreaded program). The fact that all identifiers are shared is not a problem as long as you don't override the 2^31 ocaml limit. Regards, Nicolas Cannasse |
From: skaller <sk...@us...> - 2004-05-27 16:07:42
|
On Fri, 2004-05-28 at 01:39, Nicolas Cannasse wrote: > If you want such a function, provide a constructor for it: > > > > val make_int_generator: unit -> (unit ->int) > > > > but leave it up to the client to instantiate. > > Sorry but as a careful programmer, I'm against the fact of "protecting the > programmer from bad usages". Then go back to programming in C. The function above is in the spirit of Ocaml. It is completely re-entrant and purely functional and requires the programmer to arrange sharing as they see fit. If the programmer wants to use a global variable they can easily write: let unique = make_int_generator ();; Leave this decision to the programmer to shoot themselves in the head if they want. Don't encourage it or provide a function that a careful programmer would in fact find useless .. perhaps belatedly. Its bad enough the standard distro has crap like Str module in it. Lets not make the same mistake. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: skaller <sk...@us...> - 2004-05-27 16:14:49
|
On Fri, 2004-05-28 at 01:42, William D. Neumann wrote: > I agree with this. Though I'd prefer to see it with an optional argument > to provide the starting point, i.e. > > val make_int_generator: ?base:int -> unit -> (unit ->int) Yeah that's a good idea IMHO. BTW: of course I use fresh variable generators in my Felix compiler. At the moment I have a unique one because it makes searching through generated text easier. However the numbers start to get a bit long quickly when you're alpha converting types all the time, so being able to have several generator around, and even create new ones, would actually be quite useful. Something else to add to the spec would be that the generator throws an exception (Invalid argument?) when it runs out of ints. We don't want it to silently wrap around .. which mine does :( -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: William D. N. <wne...@cs...> - 2004-05-27 16:21:41
|
On Thu, 28 May 2004, skaller wrote: > Something else to add to the spec would be > that the generator throws an exception (Invalid argument?) > when it runs out of ints. We don't want it to silently > wrap around .. which mine does :( Very good point as well, though Failure might be a better exception to throw... Neither is quite right, but both are better than coming up with a new exception... William D. Neumann --- "Well I could be a genius, if I just put my mind to it. And I...I could do anything, if only I could get 'round to it. Oh we were brought up on the space-race, now they expect you to clean toilets. When you've seen how big the world is, how can you make do with this? If you want me, I'll be sleeping in - sleeping in throughout these glory days." -- Jarvis Cocker Think of XML as Lisp for COBOL programmers. -- Tony-A (some guy on /.) |
From: Nicolas C. <war...@fr...> - 2004-05-27 14:43:56
|
> ** Shared > > (* A very simple implementation of weak shared strings, by Remi Vanicat *) I'm not sure this is wide usage enough. > ** CGI functions > > I have some common CGI functions which I could donate to ExtLib if > people feel that this is the right place for them. In particular, > functions for: > > * escaping HTML, URLs (see mod_caml's Cgi_escape module, documented > here: http://www.merjis.com/developers/mod_caml/html/Cgi_escape.html ) > > * parsing query strings This is interesting if you can build a whole module helping with filenames, url, and escaping in general. > ** Date and time handling functions > > I have a smallish library of date and time handling functions which > I'm willing to donate if people think that would be useful. My > library is particularly concerned with the complexity of ISO weeks > [http://en.wikipedia.org/wiki/ISO_8601#Week_-_day_format], but should > probably be extended to do other useful stuff. That's interesting , but I'm not very qualified to answer about the quality of your implementation. Peer review , anyone ? > ** Postscript generation > > Another library which I wrote and could donate creates EPS files from > simple drawing commands. This one is definitly not in ExtLib :) Regards, Nicolas Cannasse |
From: skaller <sk...@us...> - 2004-05-27 15:38:55
|
On Fri, 2004-05-28 at 00:42, Nicolas Cannasse wrote: > > ** CGI functions No. Extlib should try to be general purpose. Remember it is supposed to be a replacement for the core library. There is a place for Internet/Web software, but not here. I think the topic is far too heavy to venture into. We should concentrate on algorithms and data structures, plus some essential support for abstracted IO. It just may be that parsing URL could be argued does belong here since URL is a kind of modern equivalent of 'filename'. The problem is once you start into this domain, you end up needing a full scale XML parser ... amoung 10,000 other functions to support the huge number of standards out there for Internetting. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Richard J. <ri...@an...> - 2004-05-27 17:02:14
|
On Fri, May 28, 2004 at 01:38:37AM +1000, skaller wrote: [...] I'll settle for a String.replace function (see original posting) which would allow me to write the required escape_* functions quickly and simply. Rich. -- Richard Jones. http://www.annexia.org/ http://www.j-london.com/ Merjis Ltd. http://www.merjis.com/ - improving website return on investment If I have not seen as far as others, it is because I have been standing in the footprints of giants. -- from Usenet |
From: skaller <sk...@us...> - 2004-05-27 19:46:43
|
On Fri, 2004-05-28 at 03:02, Richard Jones wrote: > On Fri, May 28, 2004 at 01:38:37AM +1000, skaller wrote: > [...] > > I'll settle for a String.replace function (see original posting) which > would allow me to write the required escape_* functions quickly and > simply. I have no problem with String.replace! Also, I have no problem with a charset type, and a function: isa chset ch which tests if ch is in chset. I even have an implementation of charset... it works with my regexp package so I don't need any 'isa' function, I can use regexps instead. -------------------------------------- type charset_t val charset_of_string: string -> charset_t val charset_of_int_range: int -> int -> charset_t val charset_of_range: string -> string -> charset_t val charset_union: charset_t -> charset_t -> charset_t val charset_inv: charset_t -> charset_t val regexp_of_charset: charset_t -> regexp_t val regexp_underscore: regexp_t val eol: int val regexp_dot: regexp_t -------------- implementation ----------------------- type charset_t = bool array let charset_of_string s = let x = Array.make 256 false in for i = 0 to String.length s - 1 do x.(Char.code s.[i]) <- true done; x let charset_of_int_range x1 x2 = let x = Array.make 256 false in for i = x1 to x2 do x.(i) <- true done ; x let charset_of_range s1 s2 = if String.length s1 <> 1 then failwith "Charset range(first) requires string length 1" ; if String.length s2 <> 1 then failwith "Charset range(last) requires string length 1" ; let x1 = Char.code (s1.[0]) and x2 = Char.code (s2.[0]) in charset_of_int_range x1 x2 let charset_union x1 x2 = let x = Array.make 256 false in for i = 0 to 255 do x.(i) <- x1.(i) || x2.(i) done; x let charset_inv y = let x = Array.make 256 false in for i = 0 to 255 do x.(i) <- not y.(i) done; x let regexp_of_charset y = let res = ref REGEXP_epsilon in for i = 0 to 255 do if y.(i) then res := let r = REGEXP_string (String.make 1 (Char.chr i)) in if !res = REGEXP_epsilon then r else REGEXP_alt ( !res, r) done ; !res let regexp_underscore = regexp_of_charset (charset_of_int_range 0 255) let eol = Char.code '\n' let regexp_dot = regexp_of_charset ( charset_union (charset_of_int_range 0 (eol - 1)) (charset_of_int_range (eol + 1) 255) ) -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Richard J. <ri...@an...> - 2004-05-28 08:28:31
|
On Fri, May 28, 2004 at 05:46:33AM +1000, skaller wrote: > On Fri, 2004-05-28 at 03:02, Richard Jones wrote: > > On Fri, May 28, 2004 at 01:38:37AM +1000, skaller wrote: > > [...] > > > > I'll settle for a String.replace function (see original posting) which > > would allow me to write the required escape_* functions quickly and > > simply. > > I have no problem with String.replace! > > Also, I have no problem with a charset type, > and a function: It looks very much like your implementation assumes 8-bit characters. Unless it was used carefully, it wouldn't work with UTF-8 strings, for example. Rich. -- Richard Jones. http://www.annexia.org/ http://www.j-london.com/ Merjis Ltd. http://www.merjis.com/ - improving website return on investment "One serious obstacle to the adoption of good programming languages is the notion that everything has to be sacrificed for speed. In computer languages as in life, speed kills." -- Mike Vanier |
From: skaller <sk...@us...> - 2004-05-28 10:04:26
|
On Fri, 2004-05-28 at 18:28, Richard Jones wrote: > It looks very much like your implementation assumes 8-bit characters. Yes it does. you wouldn't use that implementation for 32 bits. You'd need a trie structure I think. My regexp implementation uses ints, however, it will die if the character set is too large: it also needs a faster implementation for 32 bit chars. I suspect: (a) specifying an interface (b) making a useful instance for 8 bits is a good start though. 32 bit transition matrix isn't so easy. you can't just use arrays of arrays, but O(n) is out of the question: O(log n) is has to be the worst case and it had better be O(1) amortised over real data. B-tree is O(1) so that's a possibility. Trie might be faster though .. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Richard J. <ri...@an...> - 2004-05-28 10:12:48
|
I was thinking more of multibyte characters. It seems to me that UTF-8 is going to be the only thing that matters in the future (aside from the Microsoft world, of course, where they have as usual gone off in their own strange direction). Rich. -- Richard Jones. http://www.annexia.org/ http://www.j-london.com/ Merjis Ltd. http://www.merjis.com/ - improving website return on investment MOD_CAML lets you run type-safe Objective CAML programs inside the Apache webserver. http://www.merjis.com/developers/mod_caml/ |
From: skaller <sk...@us...> - 2004-05-28 14:30:21
|
On Fri, 2004-05-28 at 20:12, Richard Jones wrote: > I was thinking more of multibyte characters. A finite state automaton processes an alphabet consisting a fixed set of symbols. There is no such thing as a multibyte character. You can convert the UTF-8 to UCS-4 on the fly if you drive a 32 bit automaton with an C++ style iterator, or use some kind of lazy function, such as an Enum, or you can convert to whole string first. Either way you need a 32 bit DFA not an 8 bit one. Alternatively you can specify the UTF-8 encoding within an 8 bit DFA. I've done this. There are a lot of states .. ocamllex can't handle this at all (I've tried it). But the bottom line, IMHO, is in practice you need two kinds: 8 bit and 32 bit, and the implementations of charsets, DFA, etc, are likely to be completely different for the two kinds. BTW: Felix 8 bit engine is blindingly fast: while(state && lexeme_end != buffer_end) state = matrix[*lexeme_end++][state]; -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Brian H. <bh...@sp...> - 2004-05-28 17:15:19
|
On 29 May 2004, skaller wrote: > Either way you need a 32 bit DFA not an 8 bit one. This is exactly the problem. For a k-bit input alphabet, you need at least one table of 2^k bytes, even if you're doing input compression. For k=8 this is no problem- a table of 256 entries still qualifies as small. For k=16, this is doable, we're looking at a 64K table. But for k=32 we're looking at a table size of 4G- unmanagable on current systems. I think the solution is UTF-16. Almost all the characters of interest are single characters now, limiting the explosion of states. But the table sizes are still manageable. Oh, the iterim tables are going to be large- uncompressed tables of 100M or more won't be unusual for large DFAs. Although I'd be tempted to look at sparse table structures to be brutally honest. Accessing the high characters will simply take multiple states. > BTW: Felix 8 bit engine is blindingly fast: > > while(state && lexeme_end != buffer_end) > state = matrix[*lexeme_end++][state]; > To do this with 32-bit lexemes takes (2^32)*(numstates). A better, if slightly slower, implementation is: let rec loop state = let state' = state_compress.(state) and lexeme' = lexeme_compress.(next_lexeme ()) in loop matrix.(lexeme').(state') Yeah, you have two extra table lookups, but in general much smaller tables. -- "Usenet is like a herd of performing elephants with diarrhea -- massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it." - Gene Spafford Brian |
From: skaller <sk...@us...> - 2004-05-28 19:17:41
|
On Sat, 2004-05-29 at 03:22, Brian Hurt wrote: > On 29 May 2004, skaller wrote: > > > Either way you need a 32 bit DFA not an 8 bit one. > > This is exactly the problem. But for k=32 > we're looking at a table size of 4G- unmanagable on current systems. Ah no. We're not. :))) The reason is : no one is going to write regular expressions which distinguish all 2^32 code points. I mean you'd be typing away for a long time just to type in 4G characters .. it wouldn't fit in your editor .. :) So actually, there is no size problem at all with practical 32 bit state machines, provided an efficient mapping from code point to equivalence class can be constructed, and constructed efficiently. Tries handle subranges efficiently, and so they're the obvious choice for classification. > I think the solution is UTF-16. No and yes. Forget UTF-16 per se. I'm going to take your proposition to be that a 16 bit engine is enough. You may be right. I think i might try it! > Oh, the iterim tables are going to be large- > uncompressed tables of 100M or more won't be unusual for large DFAs. Yeah, that's a problem generating the DFA. > Although I'd be tempted to look at sparse table structures to be brutally > honest. There's no choice is there? We already agree that 4G arrays aren't on :) Tries manage sparse linear sequences efficiently when they consist of subranges, which I think is what you'll get with Unicode input. > Accessing the high characters will simply take multiple states. > > > BTW: Felix 8 bit engine is blindingly fast: > > > > while(state && lexeme_end != buffer_end) > > state = matrix[*lexeme_end++][state]; > > > > To do this with 32-bit lexemes takes (2^32)*(numstates). No, only in worst case. It's slightly tricky to see but this representation is actually classifying characters automatically. Matrix here is NOT a matrix!! Matrix is an array of pointers to arrays. There are 256 pointers to arrays of n-states, but that is NOT 256 arrays of n-states. The actual cost is: 256 + nclasses * nstates where nclasses is typically smaller than 256. > A better, if > slightly slower, implementation is: > > let rec loop state = > let state' = state_compress.(state) > and lexeme' = lexeme_compress.(next_lexeme ()) > in > loop matrix.(lexeme').(state') > > Yeah, you have two extra table lookups, but in general much smaller > tables. I'm already doing your 'lexeme compress'. I don't actually know how to compress the state vectors. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Brian H. <bh...@sp...> - 2004-05-28 21:45:15
|
On 29 May 2004, skaller wrote: > On Sat, 2004-05-29 at 03:22, Brian Hurt wrote: > > On 29 May 2004, skaller wrote: > > > > > Either way you need a 32 bit DFA not an 8 bit one. > > > > This is exactly the problem. But for k=32 > > we're looking at a table size of 4G- unmanagable on current systems. > > Ah no. We're not. :))) I think I see how we're talking past each other, and the problem's at my end. I keep thinking in terms of classic lex/yacc implementations, which use no data structure other than arrays. In any DFA designed by a human, you're not going to have a large number of character classes- a few hundred at most, even for the most extreme cases (most of which arise from using the lexer to parse out keywords). The average is more likely to be only a few dozen. So what you need to be able to do is to determine the class a character is in upon input. Now, the lex scheme to do this is a simple array look-up. Which is great if the number of possible input characters is small, but blows complete chow if you have a large number of possible inputs. The problem is I hadn't realized I was assuming this was the only way to do things. Obviously, once I state it baldly like that, this is wrong. By far the most common case is going to be that all characters above some reasonably small limit are all going to be of the same class. "Reasonably small" meaning I don't mind having an array of that size- meaning 99.9% of the time you can get away with something as simple as: let c = input () in let c' = if c < limit then class.(c) else bigclass in ... A limit of 65536 wouldn't be bad- indeed, I might be tempted to do a limit of 256. Even if you can't do this, this is a good way to do the "easy" transisitions- high valued characters can be looked up with a more complicated algorithm (a trie, for example). So now I can handle 4G character points without using 4G of data. The same thing applies to the internal tables- use a sparse matrix representation. > > I think the solution is UTF-16. > > No and yes. Forget UTF-16 per se. I'm going to take > your proposition to be that a 16 bit engine is enough. > > You may be right. I think i might try it! No, I was thinking of UTF-16. Which represents code points less than 32768 as single 2-byte characters, and code points higher than that as multiple 2-byte characters, like UTF-8 does. This means that recognizing characters higher than 32767 takes multiple states. Now I've swung back around- I don't think this is worth it. The extra states are too much overhead, it's cheaper to just deal with 32-bit code points. > > > Oh, the iterim tables are going to be large- > > uncompressed tables of 100M or more won't be unusual for large DFAs. > > Yeah, that's a problem generating the DFA. Use sparse matricies. Especially if they're extended sparse matricies that can handle character classes, like [:digit:], [:space:], etc. The default transisition, of course, would be to the error state. I'd bet you wouldn't have more than a dozen or so transitions in any given state, meaning your internal table isn't more than a few 10's of K in size. Note that once you're dealing with the sparse matrix representation, the size of the matrix is a function primarily of the number of states and the average number of transitions per state- not of the number of different possible code points. So there's no advantage to only dealing with 16-bit characters and not 32-bit characters. > I don't actually know how to compress the state vectors. Same way you compress state vectors. If you have two states which have exactly the same transitions (duplicate columns), you collapse them into one state class. -- "Usenet is like a herd of performing elephants with diarrhea -- massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it." - Gene Spafford Brian |
From: skaller <sk...@us...> - 2004-05-29 06:55:56
|
On Sat, 2004-05-29 at 07:52, Brian Hurt wrote: > > > This is exactly the problem. But for k=32 > > > we're looking at a table size of 4G- unmanagable on current systems. > > > > Ah no. We're not. :))) > > I think I see how we're talking past each other, and the problem's at my > end. I keep thinking in terms of classic lex/yacc implementations, which > use no data structure other than arrays. Yes, but do not abandon this thought. In the worst case, any data structure OTHER than an array will be more expensive than an array. Arrays are good! > In any DFA designed by a human, you're not going to have a large number of > character classes- a few hundred at most, Yes, that is the idea. If I can comment on an apparently unrelated issue: people often say it isn't possible to determine mechanically if a program will terminate. They're WRONG! This is a really serious mistake made by theoreticians. There is a difference between *arbitrary program* and *program I wrote which I know should terminate*. I have a reason for believing my program should terminate: I wrote it to do so. If a smart compiler can't prove it, my code is too complex. Ok, so this actually applies to the regexp thing. However it isn't clear you are right that a small (a few Meg) regexp specification necessarily leads to a small ( a few Meg) number of character classes. I think this is probably the case, but if the number of classes is *exponential* in the size of a regexp it won't be the case. If it is linear it will be the case. The keywords example shows that the number of equivalence classes can grow very fast.. An important thing here is: my system and ocamllex both use regular *definitions* not merely regular expressions. Regular expressions seem to grow fast when you have regular definitions: the tree is expanded recursively by substituting in the regular definitions: regedef a = .. regdef b = a a a ... regdef c = a a a b b b ... regdef d = c c c b b b ... .... so we need to take this into account comparing the size of input to the size of the output. > So now I can handle 4G character points without using 4G of data. I don't think it is enough to just use an array plus a list. That's why I suggested a trie: it handles singleton classes plus subranges efficiently. The lookup is log n I think. > No, I was thinking of UTF-16. I know, but UTF-16 assumes you're talking about Unicode inputs. Regexps are useful for many other things too. For example a Yacc style parser is driven by a DFA generated by taking *tokens* as inputs. > > I don't actually know how to compress the state vectors. > > Same way you compress state vectors. If you have two states which have > exactly the same transitions (duplicate columns), you collapse them into > one state class. In practice, minimising a DFA rarely seems to save anything. However it may be possible to compress individual transition vectors, since 90% of the states you're going to are likely to be 'error'. But I don't know how effective this would be. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |
From: Alain F. <Ala...@en...> - 2004-05-29 07:56:53
|
On 29 May 2004, skaller wrote: > Tries handle subranges efficiently, and so they're the obvious > choice for classification. Trie usually denotes a tree-shaped datastructure that hold strings by sharing prefixes. I guess this is not what you are talking about. FWIW, my lexer generator ulex implements regular expression over Unicode (parsing the encoding is done independently). The transition relation is implemented as a mix of a dispatch table (for small code points), and a tree of if-then-else branches to implement dichotomy (I guess this is what you refer to as trie). Lookup has then O(log n) complexity, where n is the number of different ranges for the current state (log n <= 21 for Unicode), but if the input is only, say, in the Latin 1 character set (initial 256 codes long prefix of Unicode), this is O(1). My old lexer generator wlex used a different technique: it was a simple patch to ocamllex to introduce an intermediate "classification" layer between the lexbuf and the automaton; the layer would transform the byte stream into a class stream, where classes are user-declared. All the state of the ocamllex automaton would use the same classification layer. -- Alain |
From: skaller <sk...@us...> - 2004-05-29 09:18:25
|
On Sat, 2004-05-29 at 17:56, Alain Frisch wrote: > On 29 May 2004, skaller wrote: > > > Tries handle subranges efficiently, and so they're the obvious > > choice for classification. > > Trie usually denotes a tree-shaped datastructure that hold strings by > sharing prefixes. I guess this is not what you are talking about. Actually I think it is! What we're talking about is a small set of subranges of the Unicode code point set, and how to quickly determine if a 32 bit code point is in a particular such subset of Unicode, or, more generally, which class of a partition it is in. The idea is that a subrange of characters like 'a-z' tend to share a common prefix in terms of the bitstring representing the binary code. You could use a binary tree and index it one bit at a time. A trie is just the same idea, except it is smarter about organising the nodes to have more than 2 branches, so the branch selection is slower, but the depth of the tree is less so the overall performance is much better. -- John Skaller, mailto:sk...@us... voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net |