|
From: Bill H. <goo...@go...> - 2009-02-05 02:02:33
|
Hi all, Michael Abshoff suggested I post something here about what I've been working on as there may be some people with some expertise who can help me with a problem I have encountered, to do with heaps. I've been implementing Pearce and Monagan's heap algorithm for multiplying multivariate polynomials in FLINT. See their paper here: http://www.cecm.sfu.ca/~monaganm/papers/PseudoHeap.pdf Roman Pearce has also been quite helpful in giving me clues. However one aspect of the algorithm can still be improved and there may be people on this list who have some ideas. First I'll give the timings. I downloaded the latest sdmp and with Roman's help, did some timings of sdmp on sage.math (note sdmp is not open source and runs in Maple). The two benchmarks I have been focusing on are: 1) Fateman's benchmark: multiply f * g where f = (1+x+y+z+t)^20 and g = f + 1. 2) Very sparse 5 variable benchmark: f * g where f = (1+x+y^2+z^3+t^5+u^7)^12 and g = (1+u+t^2+z^3+y^5+x^7)^12 Currently sdmp takes 2.7s for Fateman's benchmark and 2.8s for the very sparse 5 variable benchmark on sage.math (a 2.66 GHz Xeon - Dunnington core). FLINT takes 2.8s for Fateman's benchmark and 4.8s for the very sparse 5 variable benchmark. Coefficient arithmetic is not relevant here. Without doing any coefficient arithmetic at all, and just computing the monomials of the output the 5 var benchmark still takes 4.2s in FLINT. The algorithm for computing the monomials uses a heap. Basically the heap stores pairs of pointers, one to a monomial of f and one to a monomial of g. At each iteration the root of the heap is removed and the corresponding monomials are multiplied and written out. After terms f_i and g_j are multiplied the pair f_i, g_{j+1} is then added to the heap to replace the pair f_i, g_j. The heap is basically implemented as an array indexed from m=1. The children of element m of the heap are at array indices 2*m and 2*m + 1. But to get really good timings one allows multiple pairs to "chain" at the same location in the heap. This happens if one is adding f_i, g_j to the heap where the product of those monomials will be equal to an existing pair in the heap. Chaining pairs which render equal product monomials in this way means that large numbers of equal monomials can be computed for the cost of a single heap extraction at basically the same time, for which there is a dramatic time saving. It also makes the heap much smaller, as now many entries can be chained all at the same location of the heap. That makes searching through the heap much less time consuming. Coefficient arithmetic can also be made faster by accumulating the sum of products yielding the same monomial in a single temporary. In our case, chaining is important from the point of view of making the heap smaller, but not from the point of view of speeding up coefficient arithmetic (as mentioned, coefficient arithmetic is clearly not the cause of the inefficiency in the current FLINT code). The lower number of heap extractions probably also helps, but in the sparse case very little chaining actually occurs, so it isn't clear if this is even relevant. Pearce and Monagan mention numerous improvements in their paper which are designed to improve the efficiency of heap chaining. I believe I have implemented all these correctly. The difference between the benchmarks 1 and 2 is the sparsity. Fateman's benchmark is quite dense. Each output term is equal to the sum of about 831 products of pairs of input monomials f_i * g_j on average, whereas for the very sparse 5 var bench only 3 pairs contribute to each output term on average. Now, chaining is not supposed to be perfect. One doesn't get all 831 pairs chained at the root of the heap when it comes time to compute that output entry. However all the contributing pairs will be chained in a small number of heap locations at the top of the heap (but not necessarily right on the root). As the heap is designed to deliver pairs in order, all pairs for a given output term will come out before pairs for the next term, they just may not all be chained at exactly on the same (root) entry. My hypothesis is that in the very sparse case, the generic case is that each output term is built from a number (2-3) **unchained** individual entries from the top of the heap. This means that in order to improve the performance, one must have really fast heap handling code. That is where someone else may be able to help me. It *is* possible that I am not getting sufficient heap chaining, but then the question would be, "why not"? How would one improve things so that more heap chaining occurs on average for the sparse benchmark. I am assuming in the following that this is not the problem. Here is what I currently do when a new entry is added to the heap. 1) As per the recommendation in Pearce and Monagan, after the previous term has been fully computed, there will be a single gap in the heap at the root. Instead of reheapifying, we insert a new term in this spot. This entry is then filtered down through the heap. At first we need to find where in the heap this term should go. We swap it down through the heap until it reaches a spot where it is bigger than its parent (unlike Pearce and Monagan I made the arbitrary choice to have smallest entries at the top of the heap). A number of comparisons are required at each level. We need to compare with both children of the current node. We want to know if the new term is equal to one of the children, or if it is perhaps less than both children. Essentially I determine the smallest child first, with a single comparison. If the new term is smaller than the smallest child, it is already in the right position. If not, then I do a comparison for equality with this smallest child. If bigger, I swap it down. Checking equality with the largest child slows things down, so I don't do that. 2) Now the new node is in the right place, except that something can go wrong. If the new term actually equals an existing term in the heap, it gets chained with that term. That leaves a gap in the heap. It is from this point on that I am unsure how to proceed. At present I fill this gap with the entry at the bottom of the heap. This then gets filtered up first, then back down until it reaches the bottom. Filtering up only costs one comparison per level as one only needs to compare with the parent. Equality is checked on the way up, but not on the way back down, as that is apparently slower. If an equal entry is found we chain, leaving another gap, and so the same procedure has to be iterated, placing the heap bottom in the gap, etc, until no gap remains. This all seems very inefficient, but I don't see a better way of filling gaps. If anyone has any ideas, please let me know. 3) If more than one entry is added to the heap, pretty much the same procedure is applied, except this time we don't start with an empty spot at the top of the heap. This time if equality occurs then we are happy, however if no equal terms are encountered on the way down, the new entry eventually pushes a term out the bottom, dislodging it. We extend the heap by one place and put the dislodged entry at the new heap bottom, then sift it up first, then back down again, into place. In this case it is faster to not do any tests for equality, so no chaining occurs at this step and we avoid having to fill gaps. I am convinced the problem with the sparse benchmark lies with step 2, in the filling of empty spots caused by chaining, and having to iterate this procedure until the empty spot is vanquished. All swaps in the above are done virtually (just move the entry in the heap into the gap and keep the entry we are trying to place in a temporary), so I believe these swaps have all been done efficiently. Roman mentioned having to use particular compiler options to get his heap code to be compiled optimally, and he mentioned that gotos made some parts of the code faster. I've experimented with these, but that is only giving 1% or so, not a factor of nearly 2. Whatever is wrong is not minor, but a major pain in the proverbial. I'm pretty convinced it is number 2 above. Did anyone pay enough attention in CS classes to know how to do this more efficiently? Bill. |