From: <donsourceforgexx@is...>  20041029 21:55:05

Bruno Haible writes: > Hashtables are certainly good for big lists. However, small lists (of size > < 10) are the most frequent ones. Are you sure that your change didn't slow > down the frequent case? > > I would recommend to find out the threshold from which the hash table solution > is more efficient, and continue to use the old listbased (and nonconsing!) > approach for the "small lists" case. Even for the simple cases (eq/eql) there are interesting issues. One is how to measure, another is when/where to measure. I suggest not measuring at compile time, since this would give different algorithms for different builds, and that's a problem for such things as debugging. I'm not even sure I'd want different algorithms on different architectures unless they're radically different from any I know. I suggest a simple model like this: time to create hash table = m1 + m2 x size time to add to hash table = m3 time to test membership in table = m4 time to test membership in list = m5 x size where all of the m's are to me measured relatively infrequently (less than once per clisp version), stored in source files and treated as constants. The list version cost of setdifference is then estimated as (* (length x) (* m5 (length y))) while the table version cost is estimated as (+ m1 (* (+ m2 m3) (length y)) (* (length x) m4)) You might want to compute a few more constants that allow you to avoid the computation of expected costs in at least the smallest cases: First, of course, if x or y is actually empty then you know the answer. Next if m4 > (* m5 (length y)) ;; list vs last term of table then clearly list is better. That can be tested by (< (length y) #.(floor m4 m5)) ;; floor so as to compare ints Similarly, if (+ m2 m3) > (* m5 (length x)) ;; list vs 2nd term of table which can be tested by (< (length x) #.(floor (+ m2 m3) m5)) I figure in the cases where both x and y are nonempty the cost is already at least (length x) + (length y) so you haven't lost much by testing those two things. In the case where neither is true I think it's worth your time to evaluate the estimated cost formulae. ==== Things are much more complicated for equal and other more complex tests, since the costs of the test and hash depend on the arguments and the two algorithms do different numbers of those operations! I can imagine all sorts of approaches, such as  assume the cost of test and hash (m3, m4, m5) are very high  assume the cost of test and hash are very low  examine the data at run time to decide  execute both in parallel (so you're at least within a factor of 2 of the better one) Note that it's even possible for the cost of hashing to be very high while the cost of testing is very low (if all of the objects to be compared are large but differ in the first place you look). I suggest in any case that you implement the two algorithms as separate functions and make them both available in the ext package so the user who knows which one he wants can just call it by name. Off hand I think I'd prefer the approach of assuming test and hash are both expensive but I expect there are lots of arguments that I've not yet considered. I hadn't actually intended for this to turn into a research project but I'm afraid it's already too late. 