From: Nicolas C. <war...@fr...> - 2003-05-22 08:42:14
|
Added : val ends_with : string -> string -> bool val find : string -> string -> int (* find a substring *) and implemented all ExtString functions. Nicolas Cannasse |
From: Nicolas C. <war...@fr...> - 2003-05-22 09:54:49
|
> > val find : string -> string -> int (* find a substring *) > > Using KMP or some other technique? (i.e. if it takes O(m * n), where m and > n are lengths of strings respectively? or O(log m * n + m)) Right now only a naive O(m*n) technique. Please check the CVS, and submit me some code if you have a better implementation. Nicolas Cannasse |
From: Michal M. <mal...@pl...> - 2003-05-22 12:10:32
|
On Thu, May 22, 2003 at 06:51:04PM +0900, Nicolas Cannasse wrote: > > > val find : string -> string -> int (* find a substring *) > > > > Using KMP or some other technique? (i.e. if it takes O(m * n), where m and > > n are lengths of strings respectively? or O(log m * n + m)) > > Right now only a naive O(m*n) technique. Please check the CVS, and submit me > some code if you have a better implementation. You can have a look at http://caml.inria.fr/Examples/oc/basics/kmp.ml I tried rewrite of while loop to tail recursion, but it looks ugly. -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : PLD Linux ::::::::: Wroclaw University, CS Dept : {E-,w}-- {b++,e}>+++ h |
From: Brian H. <bri...@ql...> - 2003-05-22 15:37:40
|
On Thu, 22 May 2003, Michal Moskal wrote: > On Thu, May 22, 2003 at 06:51:04PM +0900, Nicolas Cannasse wrote: > > > > val find : string -> string -> int (* find a substring *) > > > > > > Using KMP or some other technique? (i.e. if it takes O(m * n), where m and > > > n are lengths of strings respectively? or O(log m * n + m)) > > > > Right now only a naive O(m*n) technique. Please check the CVS, and submit me > > some code if you have a better implementation. > > You can have a look at http://caml.inria.fr/Examples/oc/basics/kmp.ml > > I tried rewrite of while loop to tail recursion, but it looks ugly. > > Sanity check this code, if you will: let kmp p = let m = String.length p in let init_next p = let next = Array.create m 0 in let rec loop i j = if i >= (m - 1) then next else if p.(i) = p.(j) then begin next.(i+1) <- j+1; loop (i+1) (j+1) end else if j = 0 then begin next.(i+1) <- 0; loop (i+1) j end else loop i next.(j) in loop 1 0 in let next = init_next p in let n = String.length s in let rec loop2 i j = if (j >= m) || (i >= n) then if j >= m then i - m else raise Not_found else if s.(i) = p.(j) then loop2 (i+1) (j+1) else if j = 0 then loop2 (i+1) j else loop2 i next.(j) in loop2 0 0 ;; I don't think it's that ugly. And it gets rid of all the references- which probably means a speed up (haven't tested it yet though). Although one thing we might want to consider: actually exporting the kmp_init function and kmp function seperately. This way if I wanted to find the same substring in a lot of different strings, I don't have to constantly recreate the next table. Maybe something like: let _kmp_init_next p m = let next = Array.create m 0 in let rec loop i j = if i >= (m - 1) then next else if p.(i) = p.(j) then begin next.(i+1) <- j+1; loop (i+1) (j+1) end else if j = 0 then begin next.(i+1) <- 0; loop (i+1) j end else loop i next.(j) in loop 1 0 let kmp_init_next p = _kmp_init_next p (String.length p) let _kmp next p s m = let n = String.length s in let rec loop2 i j = if (j >= m) || (i >= n) then if j >= m then i - m else raise Not_found else if s.(i) = p.(j) then loop2 (i+1) (j+1) else if j = 0 then loop2 (i+1) j else loop2 i next.(j) in loop2 0 0 let kmp next p s = _kmp next p s (String.length p) let find p s = let m = String.length p in let next = _kmp_init_next p m in _kmp next p s m Where kmp_init_next and kmp would be exported in the .mli file. You could then write code like: let find_all_substring substr strlist = let next = kmp_init_next substr in List.find_all (fun s -> try let _ = kmp next substr s in true with Not_found -> false) strlist ;; Brian |
From: Remi V. <van...@la...> - 2003-05-22 16:57:09
|
Brian Hurt <bri...@ql...> writes: > On Thu, 22 May 2003, Michal Moskal wrote: > >> On Thu, May 22, 2003 at 06:51:04PM +0900, Nicolas Cannasse wrote: >> > > > val find : string -> string -> int (* find a substring *) >> > > >> > > Using KMP or some other technique? (i.e. if it takes O(m * n), where m and >> > > n are lengths of strings respectively? or O(log m * n + m)) >> > >> > Right now only a naive O(m*n) technique. Please check the CVS, and submit me >> > some code if you have a better implementation. >> >> You can have a look at http://caml.inria.fr/Examples/oc/basics/kmp.ml >> >> I tried rewrite of while loop to tail recursion, but it looks ugly. >> >> > > Sanity check this code, if you will: > > let kmp p = > let m = String.length p in > let init_next p = > let next = Array.create m 0 in > let rec loop i j = > if i >= (m - 1) then next > else if p.(i) = p.(j) then > begin > next.(i+1) <- j+1; > loop (i+1) (j+1) > end > else if j = 0 then > begin > next.(i+1) <- 0; > loop (i+1) j > end > else > loop i next.(j) > in > loop 1 0 > in > let next = init_next p in > let n = String.length s in > let rec loop2 i j = > if (j >= m) || (i >= n) then > if j >= m then i - m > else raise Not_found > else if s.(i) = p.(j) then > loop2 (i+1) (j+1) > else if j = 0 then > loop2 (i+1) j > else > loop2 i next.(j) > in > loop2 0 0 > ;; > > I don't think it's that ugly. And it gets rid of all the references- > which probably means a speed up (haven't tested it yet though). Although > one thing we might want to consider: actually exporting the kmp_init > function and kmp function seperately. This way if I wanted to find the > same substring in a lot of different strings, I don't have to constantly > recreate the next table. Maybe something like: well, in the original code, one have : let kmp p = let next = init_next p and m = String.length p in function s -> So Partial application of the first argument already compute the next table. and so you can call the computed function on several different string. [...] -- Rémi Vanicat va...@la... http://dept-info.labri.u-bordeaux.fr/~vanicat |
From: Michal M. <mal...@pl...> - 2003-05-22 18:04:54
|
On Thu, May 22, 2003 at 06:57:01PM +0200, Remi Vanicat wrote: > well, in the original code, one have : > > let kmp p = > let next = init_next p > and m = String.length p in > function s -> > > So Partial application of the first argument already compute the next > table. and so you can call the computed function on several different > string. With the minor performance hit of additional closure allocation. -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : PLD Linux ::::::::: Wroclaw University, CS Dept : {E-,w}-- {b++,e}>+++ h |
From: Brian H. <bri...@ql...> - 2003-05-22 18:05:34
|
On Thu, 22 May 2003, Remi Vanicat wrote: > well, in the original code, one have : > > let kmp p = > let next = init_next p > and m = String.length p in > function s -> > > So Partial application of the first argument already compute the next > table. and so you can call the computed function on several different > string. Check. I'd missed that. On the other hand, you compute the length of p twice- once in kmp and once in init_next. Which is what I was trying to eliminate. Maybe I was being stupid- how expensive is a String.length? I was assuming it was O(n), linear with the length of the string. Could we still preserve this, and only do a single string match, with code something like: let find p = let m = String.length p in let next = String.KMP.compile' p m in fun s -> String.KMP.match' next m s Where compile' and match' are versions of compile and match that take pre-computed string lengths- let compile p = compile' p (String.length p) and match pat s = match' pat (Array.length pat) s ? Brian |
From: Michal M. <mal...@pl...> - 2003-05-22 18:53:24
|
On Thu, May 22, 2003 at 01:19:21PM -0500, Brian Hurt wrote: > On Thu, 22 May 2003, Remi Vanicat wrote: > > > well, in the original code, one have : > > > > let kmp p = > > let next = init_next p > > and m = String.length p in > > function s -> > > > > So Partial application of the first argument already compute the next > > table. and so you can call the computed function on several different > > string. > > Check. I'd missed that. On the other hand, you compute the length of p > twice- once in kmp and once in init_next. Which is what I was trying to > eliminate. Maybe I was being stupid- how expensive is a String.length? I > was assuming it was O(n), linear with the length of the string. String.length is O(1) (with very small ,,1'' :-) -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : PLD Linux ::::::::: Wroclaw University, CS Dept : {E-,w}-- {b++,e}>+++ h |
From: Brian H. <bri...@ql...> - 2003-05-22 18:48:16
|
On Thu, 22 May 2003, Michal Moskal wrote: > String.length is O(1) (with very small ,,1'' :-) You learn something new everyday. In this case, my optimizations to not call String.length are false optimizations and should be removed. Especially as they open the door for "ill-defined behavior" (aka bugs). Brian |
From: Michal M. <mal...@pl...> - 2003-05-22 17:01:56
|
On Thu, May 22, 2003 at 10:51:25AM -0500, Brian Hurt wrote: > On Thu, 22 May 2003, Michal Moskal wrote: > > > On Thu, May 22, 2003 at 06:51:04PM +0900, Nicolas Cannasse wrote: > > > > > val find : string -> string -> int (* find a substring *) > > > > > > > > Using KMP or some other technique? (i.e. if it takes O(m * n), where m and > > > > n are lengths of strings respectively? or O(log m * n + m)) > > > > > > Right now only a naive O(m*n) technique. Please check the CVS, and submit me > > > some code if you have a better implementation. > > > > You can have a look at http://caml.inria.fr/Examples/oc/basics/kmp.ml > > > > I tried rewrite of while loop to tail recursion, but it looks ugly. > > > > > > Sanity check this code, if you will: > > let kmp p = > let m = String.length p in > let init_next p = > let next = Array.create m 0 in > let rec loop i j = > if i >= (m - 1) then next ^^^^^^^ I'm not quite sure how good is ocamlc wrt moving constants out of loop, but probably moving this substraction out of the loop would be The Right Thing. > I don't think it's that ugly. Matter of taste I believe. > And it gets rid of all the references- > which probably means a speed up (haven't tested it yet though). Maybe refs are optimized away by the compiler? But I don't know. > Although > one thing we might want to consider: actually exporting the kmp_init > function and kmp function seperately. This way if I wanted to find the > same substring in a lot of different strings, I don't have to constantly > recreate the next table. Maybe something like: [snip] I think better as submodule: extString.mli (or whatever): module KMP = sig type pattern val compile : string -> pattern val match : pattern -> string -> int end val find : string -> string -> int -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : PLD Linux ::::::::: Wroclaw University, CS Dept : {E-,w}-- {b++,e}>+++ h |
From: Brian H. <bri...@ql...> - 2003-05-22 17:58:16
|
On Thu, 22 May 2003, Michal Moskal wrote: > > Sanity check this code, if you will: > > > > if i >= (m - 1) then next > ^^^^^^^ > > I'm not quite sure how good is ocamlc wrt moving constants out of loop, > but probably moving this substraction out of the loop would be The Right > Thing. For a simple decrement I wouldn't worry too much about strength reduction. At worst it's a single clock cycle cost- i.e. too low to notice generally. > > And it gets rid of all the references- > > which probably means a speed up (haven't tested it yet though). > > Maybe refs are optimized away by the compiler? But I don't know. I don't know either. > > > Although > > one thing we might want to consider: actually exporting the kmp_init > > function and kmp function seperately. This way if I wanted to find the > > same substring in a lot of different strings, I don't have to constantly > > recreate the next table. Maybe something like: > [snip] > > I think better as submodule: Yes. Can we do modules in modules? I.e.: String.KMP.compile and String.KMP.match, like: let pat = String.KMP.compile p in String.find_all (fun s -> try let _ = String.KMP.match pat s in true with Not_found -> false) strlist Brian |
From: Michal M. <mal...@pl...> - 2003-05-22 18:04:45
|
On Thu, May 22, 2003 at 01:12:02PM -0500, Brian Hurt wrote: > > I think better as submodule: > > Yes. Can we do modules in modules? Sure. -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : PLD Linux ::::::::: Wroclaw University, CS Dept : {E-,w}-- {b++,e}>+++ h |
From: Hideo B. <ba...@im...> - 2003-05-22 18:17:21
|
Hello, A minor comment. At Thu, 22 May 2003 10:51:25 -0500 (CDT), Brian Hurt wrote: > Sanity check this code, if you will: > > let kmp p = (snip) The algorithm looks to me like "Morris-Pratt" (MP), and not "Knuth-Morris-Pratt" (KMP). KMP tries to exploit a bit more information in the pattern in the preprocessing stage to reduce redundant comparisons during the matching. This may be a different issue but, I have heard (and read) that Boyer-Moore style algorithms are much faster in practice, often having a sub linear runtime (but I have never done any benchmarks myself). There are so many zillions of different algorithms! e.g.: http://www-igm.univ-mlv.fr/~lecroq/string/index.html but I guess almost everyone will only need one reasonable implementation. Just something to think about. -- Hideo Bannai <ba...@im...> Human Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan. Phone: +81-3-5449-5615 Fax : +81-3-5449-5442 |