From: Jesse G. <je...@wi...> - 2004-06-30 01:10:34
|
Hello, I recently had need of a list intersection function. I asked the beginners list how best to approach the problem, then I came up with one myself, and two other people submitted functions. Here is my function: let intersect il1 il2 = let f i = begin let ret = List.exists (fun x -> x=i ) il2 in match ret with false -> None | true -> Some i end in ExtList.List.filter_map f il1 ;; I'm currently asking permission to post the tail recursive version someone else came up with. Is anyone interested in adding this to extlib? Looks fairly useful to me, and the type is polymorphic, so it looks general purpose enough: val intersect : 'a list -> 'a list -> 'a list = <fun> -- Jesse Guardiani, Systems Administrator WingNET Internet Services, P.O. Box 2605 // Cleveland, TN 37320-2605 423-559-LINK (v) 423-559-5145 (f) http://www.wingnet.net |
From: Nicolas C. <war...@fr...> - 2004-06-30 07:05:35
|
> Hello, > > I recently had need of a list intersection function. > I asked the beginners list how best to approach the > problem, then I came up with one myself, and two > other people submitted functions. Here is my function: [...] I would prefer not to include such a function in ExtLib. The main reason is about complexity : intersect on lists is O(n2) , so very costly. If you need such operation, you should really use another data structure, such as Set or Map, or ExtLib PMap if you don't want to deal with functors. Regards, Nicolas Cannasse |
From: Jesse G. <je...@wi...> - 2004-06-30 13:54:55
|
Nicolas Cannasse wrote: >> Hello, >> >> I recently had need of a list intersection function. >> I asked the beginners list how best to approach the >> problem, then I came up with one myself, and two >> other people submitted functions. Here is my function: > [...] > > I would prefer not to include such a function in ExtLib. The main reason > is about complexity : intersect on lists is O(n2) , so very costly. If you > need such operation, you should really use another data structure, such as > Set or Map, or ExtLib PMap if you don't want to deal with functors. Are you saying that the following code (for int lists) is faster than the generic list intersect function I posted for large lists? module IntSet = Set.Make ( struct type t = int let compare = Pervasives.compare end );; let list_to_set = List.fold_left (fun x y -> IntSet.add y x) IntSet.empty ;; let inter_int_list il1 il2 = let t1 = list_to_set il1 in let t2 = list_to_set il2 in IntSet.elements (IntSet.inter t1 t2); ;; And if so, why? -- Jesse Guardiani, Systems Administrator WingNET Internet Services, P.O. Box 2605 // Cleveland, TN 37320-2605 423-559-LINK (v) 423-559-5145 (f) http://www.wingnet.net |
From: Bardur A. <oca...@sc...> - 2004-06-30 14:03:17
|
On Wed, Jun 30, 2004 at 09:54:49AM -0400, Jesse Guardiani wrote: > Nicolas Cannasse wrote: > > >> Hello, > >> > >> I recently had need of a list intersection function. > >> I asked the beginners list how best to approach the > >> problem, then I came up with one myself, and two > >> other people submitted functions. Here is my function: > > [...] > > > > I would prefer not to include such a function in ExtLib. The main reason > > is about complexity : intersect on lists is O(n2) , so very costly. If you > > need such operation, you should really use another data structure, such as > > Set or Map, or ExtLib PMap if you don't want to deal with functors. > > Are you saying that the following code (for int lists) is faster than the > generic list intersect function I posted for large lists? > > module IntSet = Set.Make ( > struct > type t = int > let compare = Pervasives.compare > end > );; > > > let list_to_set = List.fold_left > (fun x y -> IntSet.add y x) > IntSet.empty > ;; > > > let inter_int_list il1 il2 = > let t1 = list_to_set il1 in > let t2 = list_to_set il2 in > IntSet.elements (IntSet.inter t1 t2); > ;; > > And if so, why? > Undoubtedly (for large enough n): The set-using method is O(n*log n): It performs O(n) set inserts which are O(log n) each. The list-only method is O(n^2): It basically goes over the other list for each element in the first list (that's what List.exists does). However, you might actually do slightly better (in practise, not theoretically) than the Set-based approach by just sorting the lists and doing pairwise comparisons on the resulting lists. Of course there are ways of sorting which can be faster than O(n log n) -- radix sort comes to mind -- but they are subject to various constraints. -- Bardur Arantsson <ba...@im...> <ba...@sc...> - My cat's breath smells like cat food. Ralph Wiggum | The Simpsons |
From: Brian H. <bh...@sp...> - 2004-06-30 14:51:36
|
On Wed, 30 Jun 2004, Jesse Guardiani wrote: > Are you saying that the following code (for int lists) is faster than the > generic list intersect function I posted for large lists? Yes. Even faster would be to use a specialized compare function, like: module IntSet = Set.Make ( struct type t = int let compare (x: int) y = if (x = y) then 0 else if (x < y) then -1 else 1 end );; > let list_to_set = List.fold_left > (fun x y -> IntSet.add y x) > IntSet.empty > ;; This is O(N log N). > > > let inter_int_list il1 il2 = > let t1 = list_to_set il1 in > let t2 = list_to_set il2 in > IntSet.elements (IntSet.inter t1 t2); > ;; This is O(N log N) as well. Maybe O(N), I haven't looked. Converting the set back to a list is O(N) as well. How much do you know about big-O notation? Sounds like little or nothing. Here's the core of the idea. Measure how long the algorithms take for various lengths of time, and you'll come up with some formula- given n, the number of elements in the set, computing the intersection will take a*n^2 + b*n + c seconds, for some constants a, b, and c, using your list method. Using a set, the formula will instead look like x*n*log(n) + y*n + z, for some constants x, y, and z. No matter what the constants are, there will be an n such that a*n^2 + b*n + c > x*n*log(n) + y*n + z. At that point or larger, the set based implementation starts being faster. This becomes more obvious if you look at how the two functions scale. If n elements takes t time, then how long does 2*n elements take? With the list-based algorithm, it'll take 4*t time, more or less. With the set based implementation, it'll take 2*t + c time, for some small c. The time the list based implementation takes grows faster than the set based implementation, and sooner or later it'll overtake and pass it. Worse yet, this generally happens sooner than you think. I'd predict that the list based implementation and the set based implementation will take the same amount of time for n somewhere between 3 and 100. Let's be generous and assume the balance point is n-64 (note: I haven't measured this, so don't depend upon this number! I'm just picking a number to make my math easier). At n=64 elements, both algorithms take the same time- 100 microseconds. How much time do they take at n=1024 elements? The list based algorithm will take about 25,600 microseconds. The Set based implementation, on the other hand, will take only about 2,700 microseconds (I get this number the following way- I know that k*64*log2(64) = 100 usec, solving for k = 0.26, and then I plug that into k*1024*log2(1024)) . At 64 elements they're the same speed, at 1024 elements, the set based implementation is 10 times faster. At 2^20th elements, the set based implementation is almost 5,000 times faster. This is why experienced programmers are much more interested in optimizing algorithsm- it's where the big payoffs are. Basically, you're not going to see 10-fold improvements in speed on the same machine. Maybe, just maybe, badly interepreted code vr.s super optimized compiled code has a 10-fold speed difference. Not much more than that. And 10-fold increases in computer performance take the better part of a decade- basically, 350MHz PIIIs vr.s 3.5GHz P4s. How long ago were 350MHz PIIIs hot? Six years? Seven years? At n=1024, the set based intersection on a 350MHz PIII is about as fast as the list based intersection on a 3.5GHz P4. -- "Usenet is like a herd of performing elephants with diarrhea -- massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it." - Gene Spafford Brian |
From: Jesse G. <je...@wi...> - 2004-06-30 15:19:16
|
Brian Hurt wrote: > On Wed, 30 Jun 2004, Jesse Guardiani wrote: > >> Are you saying that the following code (for int lists) is faster than the >> generic list intersect function I posted for large lists? > > Yes. Even faster would be to use a specialized compare function, like: > > module IntSet = Set.Make ( > struct > type t = int > let compare (x: int) y = > if (x = y) then 0 > else if (x < y) then -1 > else 1 > end > );; > >> let list_to_set = List.fold_left >> (fun x y -> IntSet.add y x) >> IntSet.empty >> ;; > > This is O(N log N). > >> >> >> let inter_int_list il1 il2 = >> let t1 = list_to_set il1 in >> let t2 = list_to_set il2 in >> IntSet.elements (IntSet.inter t1 t2); >> ;; > > This is O(N log N) as well. Maybe O(N), I haven't looked. > > Converting the set back to a list is O(N) as well. > > How much do you know about big-O notation? Sounds like little or nothing. > Here's the core of the idea. Measure how long the algorithms take for > various lengths of time, and you'll come up with some formula- given n, > the number of elements in the set, computing the intersection will take > a*n^2 + b*n + c seconds, for some constants a, b, and c, using your list > method. Using a set, the formula will instead look like x*n*log(n) + y*n > + z, for some constants x, y, and z. > > No matter what the constants are, there will be an n such that > a*n^2 + b*n + c > x*n*log(n) + y*n + z. At that point or larger, the set > based implementation starts being faster. This becomes more obvious if > you look at how the two functions scale. If n elements takes t time, then > how long does 2*n elements take? With the list-based algorithm, it'll > take 4*t time, more or less. With the set based implementation, it'll > take 2*t + c time, for some small c. The time the list based > implementation takes grows faster than the set based implementation, and > sooner or later it'll overtake and pass it. > > Worse yet, this generally happens sooner than you think. I'd predict that > the list based implementation and the set based implementation will take > the same amount of time for n somewhere between 3 and 100. Let's be > generous and assume the balance point is n-64 (note: I haven't measured > this, so don't depend upon this number! I'm just picking a number to make > my math easier). At n=64 elements, both algorithms take the same time- > 100 microseconds. How much time do they take at n=1024 elements? The > list based algorithm will take about 25,600 microseconds. The Set based > implementation, on the other hand, will take only about 2,700 microseconds > (I get this number the following way- I know that k*64*log2(64) = 100 > usec, solving for k = 0.26, and then I plug that into k*1024*log2(1024)) . > At 64 elements they're the same speed, at 1024 elements, the set based > implementation is 10 times faster. At 2^20th elements, the set based > implementation is almost 5,000 times faster. > > This is why experienced programmers are much more interested in optimizing > algorithsm- it's where the big payoffs are. Basically, you're not going > to see 10-fold improvements in speed on the same machine. Maybe, just > maybe, badly interepreted code vr.s super optimized compiled code has a > 10-fold speed difference. Not much more than that. And 10-fold increases > in computer performance take the better part of a decade- basically, > 350MHz PIIIs vr.s 3.5GHz P4s. How long ago were 350MHz PIIIs hot? Six > years? Seven years? At n=1024, the set based intersection on a 350MHz > PIII is about as fast as the list based intersection on a 3.5GHz P4. That's all incredibly interesting. I plan to print it out and study it, but do you "experienced programmers" ever actually write code that DOES something? Or do you just fiddle with algorithms all day? :) I'm going to use my simplified list intersect function for now (my lists are very small, usually less than 10 elements), and later when I have some time I'll benchmark the two algorithms and see where they intersect. If the intersection point is high enough, then it may well be worth it to provide two versions of the function designed for different sized lists. I just don't know how much time I can spare to benchmark algorithms when I should be writing code that does actual work. Sometimes you have to leave theory behind and get your hands dirty... -- Jesse Guardiani, Systems Administrator WingNET Internet Services, P.O. Box 2605 // Cleveland, TN 37320-2605 423-559-LINK (v) 423-559-5145 (f) http://www.wingnet.net |
From: Bardur A. <oca...@sc...> - 2004-06-30 15:55:38
|
On Wed, Jun 30, 2004 at 11:19:07AM -0400, Jesse Guardiani wrote: > Brian Hurt wrote: > > > On Wed, 30 Jun 2004, Jesse Guardiani wrote: > > [--snip--] > That's all incredibly interesting. I plan to print it out and study it, > but do you "experienced programmers" ever actually write code that > DOES something? Or do you just fiddle with algorithms all day? :) Hehe, maybe it's just that some of us work on problems of a scale where you _have_ to think about these things. ;) > > I'm going to use my simplified list intersect function for now (my lists > are very small, usually less than 10 elements), and later when I have some > time I'll benchmark the two algorithms and see where they intersect. If the > intersection point is high enough, then it may well be worth it to provide > two versions of the function designed for different sized lists. My advice would be this: Just use your list implementation, and if it actually turns out to be performance bottleneck, just pre-sort the lists and use a recursive "pairwise" traversal to extract equal elements (if you don't know what I mean -- it's very similar to the merging step in mergesort). That's a trivial amount of code, easy to get right, and will probably run just as fast (in practice, at least) as a Set-based implementation. -- Bardur Arantsson <ba...@im...> <ba...@sc...> My nose hair is accelerating for reasons best known to itself. I used to cut it once every 30 years, now it's like once a month. I presume the body knows what it's doing, I'm very baffled. I wonder what's going to happen to it if it's going to need long nasal hair to deal with it. Billy Connoly |
From: Dustin S. <du...@sp...> - 2004-06-30 18:36:18
|
On Jun 30, 2004, at 8:19, Jesse Guardiani wrote: > I just don't know how much time I can spare to benchmark algorithms > when > I should be writing code that does actual work. Sometimes you have to > leave theory behind and get your hands dirty... It doesn't really work that way. A lot of stuff is very intuitive, and becomes more intuitive with experience. You know that a hash table lookup is roughly O(1) and traversing a linked list looking for a match is O(n). You also know that for small values of n, it doesn't really matter which you use (and the list may be faster). The trick, though, is that it does matter when you're creating a library for other people to use because you can't generally make the same assumptions about how the data structures will be used. I don't personally feel that every function ever added to a library has to be as efficient as possible right away, but if it's O(n) (or worse) when an O(log n) (or better) is possible, then things go bad when data sets start to grow. With experience, you keep this in mind any time you're writing code. I don't formally benchmark until stuff is pretty broken. In my own projects, when things are fast enough, they're fast enough. I did a graph utility that I use at work that should've been something near enough to O(n) and worked perfectly for a very long time. One day, instead of being a tiny fraction of a second, it was taking several minutes to run. It turned out that I had a small bug that was causing it to be more like O(n!) (although it's still unclear to me how it seemed to work so well before). I never formally analyzed the algorithm, but I knew what to expect. When things went bad, I knew it wasn't the algorithm that was the problem, so I just started looking to see where the application wasn't doing what I intended. -- Dustin Sallings |