From: Richard J. <ri...@an...> - 2006-03-08 15:33:26
|
This is a problem I often have - count "things" in an imperative way. As an example, read a document and count the frequency of each word in the document. The way I normally solve it is something like this: let results = Hashtbl.create 31 in List.iter ( fun word -> try let count = Hashtbl.find results word in Hashtbl.replace results word (count+1) with Not_found -> Hashtbl.add results word 1 ) words; let results = Hashtbl.fold (fun word count xs -> (count, word) :: xs) results [] in (* ... *) It's not particularly elegant ... Is there a better structure that I should be using, or should we add one to Extlib? Rich. PS. Note that "words" is only an example. In real life I'm processing gigabytes of "things", and they don't live in a convenient list in memory either -- hence the imperative approach. -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com |
From: Brian H. <bh...@sp...> - 2006-03-08 16:54:57
|
On Wed, 8 Mar 2006, Richard Jones wrote: > > This is a problem I often have - count "things" in an imperative way. > As an example, read a document and count the frequency of each word in > the document. That's not a bad way. If you want to do it functionally, you could use a Map instead of a hash table: module StringMap = Map.Make(String); let count_words (f: unit -> string option) = let rec loop cnt = match (f ()) with | None -> List.rev (StringMap.fold (fun k c l -> (k, c) :: l) cnt []) | Some(word) -> let cnt = try let c = StringMap.find word cnt in StringMap.add word (c+1) cnt with | Not_found -> StringMap.add word 1 cnt in loop cnt in loop StringMap.empty ;; Now, the disadvantage of doing it this way is that it'll probably be somewhat slower than using a hash table- doing the find and add are O(log N) operations vr.s O(1) operations for hash tables, however the Map based implementation has lower constant factors (comparing two strings is generally cheap compared to the cost of calculating a hash), so the performance loss isn't as large as you might think. The advantage of doing things this way is that the strings are returned in sorted order. I don't know if this is an advantage or not. I'd also be tempted to return the StringMap instead of the equivelent list- walking a string map is about as expensive as walking a list (OK, a little more expensive, but not much). And keeping it in the map allows for O(log N) accesses to any given element. Wether I'd do this or not depends upon what uses the calling code has for the table. I don't think this is a common enough problem to be worth putting into a standard library, however. Brian |
From: Martin J. <mar...@em...> - 2006-03-08 19:13:45
|
On Wed, 8 Mar 2006, Richard Jones wrote: > > This is a problem I often have - count "things" in an imperative way. > As an example, read a document and count the frequency of each word in > the document. > > The way I normally solve it is something like this: > > let results = Hashtbl.create 31 in > List.iter ( > fun word -> > try > let count = Hashtbl.find results word in > Hashtbl.replace results word (count+1) > with Not_found -> > Hashtbl.add results word 1 > ) words; > let results = > Hashtbl.fold (fun word count xs -> (count, word) :: xs) results [] in > (* ... *) > > It's not particularly elegant ... > > Is there a better structure that I should be using, or should we add > one to Extlib? Given the implementation of Hashtbl (buckets = immutable lists), it should be faster to not replace the table entry, but simply to increment an int ref. It also saves one hash operation instead of two, when the counter already exists. let r = try Hashtbl.find tbl key with Not_found -> let r = ref 0 in Hashtbl.add key r; r in incr r Martin > > Rich. > > PS. Note that "words" is only an example. In real life I'm processing > gigabytes of "things", and they don't live in a convenient list in > memory either -- hence the imperative approach. > > -- > Richard Jones, CTO Merjis Ltd. > Merjis - web marketing and technology - http://merjis.com > Team Notepad - intranets and extranets for business - http://team-notepad.com > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > ocaml-lib-devel mailing list > oca...@li... > https://lists.sourceforge.net/lists/listinfo/ocaml-lib-devel > -- Martin Jambon, PhD http://martin.jambon.free.fr Edit http://wikiomics.org, bioinformatics wiki |
From: Bruno De F. <br...@de...> - 2006-03-09 09:19:05
|
Hello Richard, On 08 Mar 2006, at 16:33, Richard Jones wrote: > It's not particularly elegant ... > > Is there a better structure that I should be using, or should we add > one to Extlib? There is definitely a way to do it in a more elegant functional style, provided you have a general group_by function (which I think Extlib still lacks, and which I therefore try to plug here): (*s [group_by f l] creates an associative list that groups the elements of l according to their image under f. *) val group_by : ('a -> 'b) -> 'a list -> ('b * 'a list) list For example: # group_by String.length ["aa";"bbb";"abc";"bb"] ;; - : (int * string list) list = [(3, ["abc"; "bbb"]); (2, ["bb"; "aa"])] Now, with two more auxiliary functions: let identity x = x ;; (* Already present in Std *) let map_snd f (a,b) = (a, f b) ;; A concise solution to your problem can be given as: let results = List.map (map_snd List.length) (group_by identity words) ;; While this is a nice prototype, I obviously doubt this is what you want when counting "gigabytes of things". But then again, it's not entirely clear what you're asking for... For reference, this is an implementation of group_by: let group_by f list = List.fold_left (fun accu el -> let img = f el and found = ref false in let new_accu = List.rev_map (fun grp -> if !found || (fst grp) <> img then grp else begin found := true; (img,el::(snd grp)) end ) accu in if !found then new_accu else (img,[el]) :: accu ) [] list ;; Perhaps a more general solution would have a signature like: val group_by : ('a -> 'b) -> 'a Enum.t -> ('b, 'a Enum.t) Hashtbl.t Bye, Bruno |
From: Richard J. <ri...@an...> - 2006-03-09 09:49:16
|
On Thu, Mar 09, 2006 at 10:18:55AM +0100, Bruno De Fraine wrote: > While this is a nice prototype, I obviously doubt this is what you > want when counting "gigabytes of things". But then again, it's not > entirely clear what you're asking for... Specifically we're parsing Apache logfiles from [a very large company which you will have heard of]. They produce about 1 GB of raw logfiles / day, which we read in, line at a time, and attempt to deduce interesting things. There's no possibility of fitting the logfiles into memory. Much of the problem involves counting how many times certain events happen. Rich. -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com |
From: Bruno De F. <br...@de...> - 2006-03-09 11:33:57
|
Hello, On 09 Mar 2006, at 10:48, Richard Jones wrote: > Specifically we're parsing Apache logfiles from [a very large company > which you will have heard of]. They produce about 1 GB of raw > logfiles / day, which we read in, line at a time, and attempt to > deduce interesting things. There's no possibility of fitting the > logfiles into memory. Much of the problem involves counting how many > times certain events happen. OK, I see. But perhaps I should ask you then why exactly you think the current solution(s) are not elegant? By the way, notice that you can build something very similar to Brian Hurt's solution basically by using Enum.fold and Enum.from: module StringMap = Map.Make(String); let make_histogram : string Enum.t -> int StringMap.t = Enum.fold (fun word cnt -> try let c = StringMap.find word cnt in StringMap.add word (c+1) cnt with Not_found -> StringMap.add word 1 cnt ) StringMap.empty ;; let map_to_assoc_list m = StringMap.fold (fun k c l -> (k, c) :: l) m [] ;; let count_words (f: unit -> string) = map_to_assoc_list (make_histogram (Enum.from f)) ;; The argument to count_words should raise an exception when it is exhausted: # count_words (fun () -> try read_line () with End_of_file -> raise Enum.No_more_elements ) ;; aa bb abc bb abc <CTRL+D> - : (StringMap.key * int) list = [("aa", 1); ("abc", 2); ("bb", 2)] Bye, Bruno |
From: Richard J. <ri...@an...> - 2006-06-09 10:27:42
|
FWIW I wrote a simple Counter module, based on Hashtbl, which I find useful. I wonder if this would be a good addition to Extlib? Rich. ---------------------------------------------------------------------- (** Basic counting module. * $Id: counter.mli,v 1.2 2006/06/08 15:24:47 rich Exp $ *) type 'a t (** Count items of type ['a]. *) val create : unit -> 'a t (** Create a new counter. *) val incr : 'a t -> 'a -> unit (** [incr counter thing] adds one to the count of [thing]s in [counter]. *) val decr : 'a t -> 'a -> unit (** [decr counter thing] subtracts one from the count of [thing]s in [counter]. *) val add : 'a t -> 'a -> int -> unit (** [add counter thing n] adds [n] to the count of [thing]s in [counter]. *) val sub : 'a t -> 'a -> int -> unit (** [sub counter thing n] subtracts [n] from the count of [thing]s in [counter]. *) val set : 'a t -> 'a -> int -> unit (** [set counter thing n] sets the count of [thing]s to [n]. *) val get : 'a t -> 'a -> int (** [get counter thing] returns the count of [thing]s. *) val read : 'a t -> (int * 'a) list (** [read counter] reads the frequency of each thing. They are sorted * with the thing appearing most frequently first. *) val length : 'a t -> int (** Return the number of distinct things. See also {!Counter.total} *) val total : 'a t -> int (** Return the number of things counted (the total number of counts). * See also {!Counter.length} *) val clear : 'a t -> unit (** [clear counter] clears the counter. *) ---------------------------------------------------------------------- (* Basic counting module. * $Id: counter.ml,v 1.2 2006/06/08 15:24:47 rich Exp $ *) type 'a t = ('a, int ref) Hashtbl.t let create () = Hashtbl.create 13 let get_ref counter thing = try Hashtbl.find counter thing with Not_found -> let r = ref 0 in Hashtbl.add counter thing r; r let incr counter thing = let r = get_ref counter thing in incr r let decr counter thing = let r = get_ref counter thing in decr r let add counter thing n = let r = get_ref counter thing in r := !r + n let sub counter thing n = let r = get_ref counter thing in r := !r - n let set counter thing n = let r = get_ref counter thing in r := n let get counter thing = try !(Hashtbl.find counter thing) with Not_found -> 0 let read counter = let counts = Hashtbl.fold (fun thing r xs -> (!r, thing) :: xs) counter [] in List.sort (fun (a, _) (b, _) -> compare (b : int) (a : int)) counts let length = Hashtbl.length let total counter = let total = ref 0 in Hashtbl.iter (fun _ r -> total := !total + !r) counter; !total let clear counter = Hashtbl.clear counter ---------------------------------------------------------------------- -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com |
From: Amit D. <ami...@gm...> - 2006-06-15 15:46:09
|
I actually have similar code, but I tend to use it over floats more than ints. Maybe this calls for a functor-based design? module type NUMBER = sig type t val zero : t val unit : t val add : t -> t -> t val negate : t -> t end module Counter( N : NUMBER ) = struct let get_ref counter thing = try Hashtbl.find counter thing with Not_found -> let r = ref N.zero in Hashtbl.add counter thing r; r let incr counter thing = let r = get_ref counter thing in r := N.add !r N.one (...etc...) end On 6/9/06, Richard Jones < ri...@an...> wrote: > > > FWIW I wrote a simple Counter module, based on Hashtbl, which I find > useful. I wonder if this would be a good addition to Extlib? > > Rich. > > ---------------------------------------------------------------------- > (** Basic counting module. > * $Id: counter.mli,v 1.2 2006/06/08 15:24:47 rich Exp $ > *) > > type 'a t > (** Count items of type ['a]. *) > > val create : unit -> 'a t > (** Create a new counter. *) > > val incr : 'a t -> 'a -> unit > (** [incr counter thing] adds one to the count of [thing]s in [counter]. > *) > > val decr : 'a t -> 'a -> unit > (** [decr counter thing] subtracts one from the count of [thing]s in > [counter]. *) > > val add : 'a t -> 'a -> int -> unit > (** [add counter thing n] adds [n] to the count of [thing]s in [counter]. > *) > > val sub : 'a t -> 'a -> int -> unit > (** [sub counter thing n] subtracts [n] from the count of [thing]s in > [counter]. *) > > val set : 'a t -> 'a -> int -> unit > (** [set counter thing n] sets the count of [thing]s to [n]. *) > > val get : 'a t -> 'a -> int > (** [get counter thing] returns the count of [thing]s. *) > > val read : 'a t -> (int * 'a) list > (** [read counter] reads the frequency of each thing. They are sorted > * with the thing appearing most frequently first. > *) > > val length : 'a t -> int > (** Return the number of distinct things. See also {!Counter.total} *) > > val total : 'a t -> int > (** Return the number of things counted (the total number of counts). > * See also {!Counter.length} > *) > > val clear : 'a t -> unit > (** [clear counter] clears the counter. *) > ---------------------------------------------------------------------- > (* Basic counting module. > * $Id: counter.ml,v 1.2 2006/06/08 15:24:47 rich Exp $ > *) > > type 'a t = ('a, int ref) Hashtbl.t > > let create () = > Hashtbl.create 13 > > let get_ref counter thing = > try > Hashtbl.find counter thing > with > Not_found -> > let r = ref 0 in > Hashtbl.add counter thing r; > r > > let incr counter thing = > let r = get_ref counter thing in > incr r > > let decr counter thing = > let r = get_ref counter thing in > decr r > > let add counter thing n = > let r = get_ref counter thing in > r := !r + n > > let sub counter thing n = > let r = get_ref counter thing in > r := !r - n > > let set counter thing n = > let r = get_ref counter thing in > r := n > > let get counter thing = > try > !(Hashtbl.find counter thing) > with > Not_found -> 0 > > let read counter = > let counts = > Hashtbl.fold (fun thing r xs -> (!r, thing) :: xs) counter [] in > List.sort (fun (a, _) (b, _) -> compare (b : int) (a : int)) counts > > let length = Hashtbl.length > > let total counter = > let total = ref 0 in > Hashtbl.iter (fun _ r -> total := !total + !r) counter; > !total > > let clear counter = > Hashtbl.clear counter > ---------------------------------------------------------------------- > > -- > Richard Jones, CTO Merjis Ltd. > Merjis - web marketing and technology - http://merjis.com > Team Notepad - intranets and extranets for business - > http://team-notepad.com > > > _______________________________________________ > ocaml-lib-devel mailing list > oca...@li... > https://lists.sourceforge.net/lists/listinfo/ocaml-lib-devel > |
From: Richard J. <ri...@an...> - 2006-06-16 12:09:12
|
On Thu, Jun 15, 2006 at 04:46:03PM +0100, Amit Dubey wrote: > I actually have similar code, but I tend to use it over floats more than > ints. Maybe this calls for a functor-based design? I think it's a fine idea. Are there performance implications? For reasons which I can't really fathom, functors don't seem to be fully "inlined" when they are created, presumably resulting in extra overhead. What do other people think about the general concept? Rich. -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com |
From: Bardur A. <sp...@sc...> - 2006-06-16 14:30:29
|
Richard Jones wrote: > On Thu, Jun 15, 2006 at 04:46:03PM +0100, Amit Dubey wrote: >> I actually have similar code, but I tend to use it over floats more than >> ints. Maybe this calls for a functor-based design? > > I think it's a fine idea. Are there performance implications? For > reasons which I can't really fathom, functors don't seem to be fully > "inlined" when they are created, presumably resulting in extra > overhead. > Yup, I believe OCaml currently uses dynamic dispatch for functors. I don't think there are any theoretical reasons that this necessarily has to be the case. > What do other people think about the general concept? +1, I like it and I think it should be added to ExtLib. I've used similar code a number of times, but have used the code reuse technique known as Copy and Paste, just because I couldn't be bothered to package something up. Even though it may be a tad slower (no inlining), it does allow you to create Counters which count objects of an identical type *without* being "compatible". This allows a bit more type safety if you want it. Finally, I don't like the name Counter; it's a bit too generic for my tastes. Maybe call it Histogram instead? Cheers, -- Bardur Arantsson <bardurREMOVE@THISscientician.net> "Mr. T to pity fool." http://www.theonion.com |
From: Bardur A. <sp...@sc...> - 2006-06-16 14:49:14
|
Bardur Arantsson wrote: > Richard Jones wrote: >> On Thu, Jun 15, 2006 at 04:46:03PM +0100, Amit Dubey wrote: >>> I actually have similar code, but I tend to use it over floats more than >>> ints. Maybe this calls for a functor-based design? >> I think it's a fine idea. Are there performance implications? For >> reasons which I can't really fathom, functors don't seem to be fully >> "inlined" when they are created, presumably resulting in extra >> overhead. >> > [--snip--] I said > Even though it may be a tad slower (no inlining), it does allow you to > create Counters which count objects of an identical type *without* being > "compatible". This allows a bit more type safety if you want it. I meant to say: > I would use the functor approach. Even though it may be [...] Cheers, -- Bardur Arantsson <bardurREMOVE@THISscientician.net> Sticks and stones may break my bones, but hollow-points expand on impact. |