From: Bruno De F. <br...@de...> - 2006-03-09 11:33:57
|
Hello, On 09 Mar 2006, at 10:48, Richard Jones wrote: > Specifically we're parsing Apache logfiles from [a very large company > which you will have heard of]. They produce about 1 GB of raw > logfiles / day, which we read in, line at a time, and attempt to > deduce interesting things. There's no possibility of fitting the > logfiles into memory. Much of the problem involves counting how many > times certain events happen. OK, I see. But perhaps I should ask you then why exactly you think the current solution(s) are not elegant? By the way, notice that you can build something very similar to Brian Hurt's solution basically by using Enum.fold and Enum.from: module StringMap = Map.Make(String); let make_histogram : string Enum.t -> int StringMap.t = Enum.fold (fun word cnt -> try let c = StringMap.find word cnt in StringMap.add word (c+1) cnt with Not_found -> StringMap.add word 1 cnt ) StringMap.empty ;; let map_to_assoc_list m = StringMap.fold (fun k c l -> (k, c) :: l) m [] ;; let count_words (f: unit -> string) = map_to_assoc_list (make_histogram (Enum.from f)) ;; The argument to count_words should raise an exception when it is exhausted: # count_words (fun () -> try read_line () with End_of_file -> raise Enum.No_more_elements ) ;; aa bb abc bb abc <CTRL+D> - : (StringMap.key * int) list = [("aa", 1); ("abc", 2); ("bb", 2)] Bye, Bruno |