[flint-devel] Multivariate polynomial multiplication - timings

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

Michael Abshoff suggested I post something here about what I've been
working on as there may be some people with some expertise who can
help me with a problem I have encountered, to do with heaps.

I've been implementing Pearce and Monagan's heap algorithm for
multiplying multivariate polynomials in FLINT. See their paper here:

http://www.cecm.sfu.ca/~monaganm/papers/PseudoHeap.pdf

Roman Pearce has also been quite helpful in giving me clues. However
one aspect of the algorithm can still be improved and there may be
people on this list who have some ideas.

First I'll give the timings. I downloaded the latest sdmp and with
Roman's help, did some timings of sdmp on sage.math (note sdmp is not
open source and runs in Maple).

The two benchmarks I have been focusing on are:

1) Fateman's benchmark: multiply f * g where f = (1+x+y+z+t)^20 and g = f + 1.

2) Very sparse 5 variable benchmark: f * g where f =
(1+x+y^2+z^3+t^5+u^7)^12 and g = (1+u+t^2+z^3+y^5+x^7)^12

Currently sdmp takes 2.7s for Fateman's benchmark and 2.8s for the
very sparse 5 variable benchmark on sage.math (a 2.66 GHz Xeon -
Dunnington core).

FLINT takes 2.8s for Fateman's benchmark and 4.8s for the very sparse
5 variable benchmark.

Coefficient arithmetic is not relevant here. Without doing any
coefficient arithmetic at all, and just computing the monomials of the
output the 5 var benchmark still takes 4.2s in FLINT.

The algorithm for computing the monomials uses a heap. Basically the
heap stores pairs of pointers, one to a monomial of f and one to a
monomial of g. At each iteration the root of the heap is removed and
the corresponding monomials are multiplied and written out. After
terms f_i and g_j are multiplied the pair f_i, g_{j+1} is then added
to the heap to replace the pair f_i, g_j.

The heap is basically implemented as an array indexed from m=1. The
children of element m of the heap are at array indices 2*m and 2*m +
1.

But to get really good timings one allows multiple pairs to "chain" at
the same location in the heap. This happens if one is adding f_i, g_j
to the heap where the product of those monomials will be equal to an
existing pair in the heap. Chaining pairs which render equal product
monomials in this way means that large numbers of equal monomials can
be computed for the cost of a single heap extraction at basically the
same time, for which there is a dramatic time saving. It also makes
the heap much smaller, as now many entries can be chained all at the
same location of the heap. That makes searching through the heap much
less time consuming. Coefficient arithmetic can also be made faster by
accumulating the sum of products yielding the same monomial in a
single temporary.

In our case, chaining is important from the point of view of making
the heap smaller, but not from the point of view of speeding up
coefficient arithmetic (as mentioned, coefficient arithmetic is
clearly not the cause of the inefficiency in the current FLINT code).
The lower number of heap extractions probably also helps, but in the
sparse case very little chaining actually occurs, so it isn't clear if
this is even relevant.

Pearce and Monagan mention numerous improvements in their paper which
are designed to improve the efficiency of heap chaining. I believe I
have implemented all these correctly.

The difference between the benchmarks 1 and 2 is the sparsity.
Fateman's benchmark is quite dense. Each output term is equal to the
sum of about 831 products of pairs of input monomials f_i * g_j on
average, whereas for the very sparse 5 var bench only 3 pairs
contribute to each output term on average.

Now, chaining is not supposed to be perfect. One doesn't get all 831
pairs chained at the root of the heap when it comes time to compute
that output entry. However all the contributing pairs will be chained
in a small number of heap locations at the top of the heap (but not
necessarily right on the root). As the heap is designed to deliver
pairs in order, all pairs for a given output term will come out before
pairs for the next term, they just may not all be chained at exactly
on the same (root) entry.

My hypothesis is that in the very sparse case, the generic case is
that each output term is built from a number (2-3) **unchained**
individual entries from the top of the heap. This means that in order
to improve the performance, one must have really fast heap handling
code. That is where someone else may be able to help me. It *is*
possible that I am not getting sufficient heap chaining, but then the
question would be, "why not"? How would one improve things so that
more heap chaining occurs on average for the sparse benchmark. I am
assuming in the following that this is not the problem.

Here is what I currently do when a new entry is added to the heap.

1) As per the recommendation in Pearce and Monagan, after the previous
term has been fully computed, there will be a single gap in the heap
at the root. Instead of reheapifying, we insert a new term in this
spot. This entry is then filtered down through the heap. At first we
need to find where in the heap this term should go. We swap it down
through the heap until it reaches a spot where it is bigger than its
parent (unlike Pearce and Monagan I made the arbitrary choice to have
smallest entries at the top of the heap). A number of comparisons are
required at each level. We need to compare with both children of the
current node. We want to know if the new term is equal to one of the
children, or if it is perhaps less than both children. Essentially I
determine the smallest child first, with a single comparison. If the
new term is smaller than the smallest child, it is already in the
right position. If not, then I do a comparison for equality with this
smallest child. If bigger, I swap it down. Checking equality with the
largest child slows things down, so I don't do that.

2) Now the new node is in the right place, except that something can
go wrong. If the new term actually equals an existing term in the
heap, it gets chained with that term. That leaves a gap in the heap.
It is from this point on that I am unsure how to proceed. At present I
fill this gap with the entry at the bottom of the heap. This then gets
filtered up first, then back down until it reaches the bottom.
Filtering up only costs one comparison per level as one only needs to
compare with the parent. Equality is checked on the way up, but not on
the way back down, as that is apparently slower. If an equal entry is
found we chain, leaving another gap, and so the same procedure has to
be iterated, placing the heap bottom in the gap, etc, until no gap
remains. This all seems very inefficient, but I don't see a better way
of filling gaps. If anyone has any ideas, please let me know.

3) If more than one entry is added to the heap, pretty much the same
procedure is applied, except this time we don't start with an empty
spot at the top of the heap. This time if equality occurs then we are
happy, however if no equal terms are encountered on the way down, the
new entry eventually pushes a term out the bottom, dislodging it. We
extend the heap by one place and put the dislodged entry at the new
heap bottom, then sift it up first, then back down again, into place.
In this case it is faster to not do any tests for equality, so no
chaining occurs at this step and we avoid having to fill gaps.

I am convinced the problem with the sparse benchmark lies with step 2,
in the filling of empty spots caused by chaining, and having to
iterate this procedure until the empty spot is vanquished.

All swaps in the above are done virtually (just move the entry in the
heap into the gap and keep the entry we are trying to place in a
temporary), so I believe these swaps have all been done efficiently.
Roman mentioned having to use particular compiler options to get his
heap code to be compiled optimally, and he mentioned that gotos made
some parts of the code faster. I've experimented with these, but that
is only giving 1% or so, not a factor of nearly 2. Whatever is wrong
is not minor, but a major pain in the proverbial. I'm pretty convinced
it is number 2 above.

Did anyone pay enough attention in CS classes to know how to do this
more efficiently?

Bill.