Bruno Haible writes:
> Hashtables are certainly good for big lists. However, small lists (of size
> < 10) are the most frequent ones. Are you sure that your change didn't slow
> down the frequent case?
>
> I would recommend to find out the threshold from which the hash table solution
> is more efficient, and continue to use the old listbased (and nonconsing!)
> approach for the "small lists" case.
Even for the simple cases (eq/eql) there are interesting issues.
One is how to measure, another is when/where to measure.
I suggest not measuring at compile time, since this would
give different algorithms for different builds, and that's
a problem for such things as debugging. I'm not even sure
I'd want different algorithms on different architectures
unless they're radically different from any I know.
I suggest a simple model like this:
time to create hash table = m1 + m2 x size
time to add to hash table = m3
time to test membership in table = m4
time to test membership in list = m5 x size
where all of the m's are to me measured relatively
infrequently (less than once per clisp version), stored
in source files and treated as constants.
The list version cost of setdifference is then estimated
as (* (length x) (* m5 (length y)))
while the table version cost is estimated as
(+ m1 (* (+ m2 m3) (length y)) (* (length x) m4))
You might want to compute a few more constants that allow
you to avoid the computation of expected costs in at least
the smallest cases:
First, of course, if x or y is actually empty then you
know the answer.
Next
if m4 > (* m5 (length y)) ;; list vs last term of table
then clearly list is better. That can be tested by
(< (length y) #.(floor m4 m5)) ;; floor so as to compare ints
Similarly,
if (+ m2 m3) > (* m5 (length x)) ;; list vs 2nd term of table
which can be tested by
(< (length x) #.(floor (+ m2 m3) m5))
I figure in the cases where both x and y are nonempty the cost
is already at least (length x) + (length y) so you haven't lost
much by testing those two things.
In the case where neither is true I think it's worth your time
to evaluate the estimated cost formulae.
====
Things are much more complicated for equal and other more complex
tests, since the costs of the test and hash depend on the arguments
and the two algorithms do different numbers of those operations!
I can imagine all sorts of approaches, such as
 assume the cost of test and hash (m3, m4, m5) are very high
 assume the cost of test and hash are very low
 examine the data at run time to decide
 execute both in parallel
(so you're at least within a factor of 2 of the better one)
Note that it's even possible for the cost of hashing
to be very high while the cost of testing is very low
(if all of the objects to be compared are large but differ
in the first place you look).
I suggest in any case that you implement the two algorithms
as separate functions and make them both available in the
ext package so the user who knows which one he wants can
just call it by name.
Off hand I think I'd prefer the approach of assuming test and hash
are both expensive but I expect there are lots of arguments that
I've not yet considered.
I hadn't actually intended for this to turn into a research project
but I'm afraid it's already too late.
