Hi there,
attached you'll find an example where the polynomial factorization fails. Apparently, this bug depends on some non-determinism, as it only fails occasionally calling reduce for the given example several times in a row. The output, which it produces in this case is as follows:
Reduce (Free CSL version, revision 3753), 03-Aug-16 ...
* Factorizer error: (UNEXPECTED DIVISION FAILURE (((x41 . 1) ((x85 . 7) .
2506534) ((x85 . 6) . -502670) ((x85 . 5) . -2748781) ((x85 . 4) . 1476202) ((
x85 . 3) . 2812397) ((x85 . 2) . -1068956) ((x85 . 1) . -1208963) . 1992504) ((
x85 . 9) . -2506534) ((x85 . 8) . 8022272) ((x85 . 7) . -3772297) ((x85 . 6) .
-8717205) ((x85 . 5) . 7113771) ((x85 . 4) . 6553743) ((x85 . 3) . -7622699) ((
x85 . 2) . -3481481) ((x85 . 1) . 8395438) . -3985008) 5764801)
*** End-of-file read
Thanks a lot in advance!
Best,
Florian
Thank you for the report..
The factorizer works internally by making an image of the multivariate input by substituting integer values for all bar one variable and then reducing coefficients modulo some prime. It can then factorize the univariate polynomial over the finite field modulo that prime, and reconstruct full factors. Indeed it makes its choices of substitutions and primes (partly) at random so it is deliberatly non-deterministic. It has a LOT of messy code that tries to spot special cases and preserve sparseness during the reconstruction and it was a horrible thing to devug all those years ago when it was new - but the non-determinism was useful for soak testing because a single top-level example tested different paths on different runs. It could well have a residual bug (or several!). And your report with a nice example and a setting of the random seed is just what I would like...
I have tested it a bit and it is really odd! The example fails under both CSL and PSL (on some but not all builtds) but at present on 64-bit windows it looks as if merely going "tr next!-random!-number;" changes the behaviour and it works. Right now that muddles me in that merely tracing something ought not to have an effect on outcomes, and if it does that would normally signal a bug at the Lisp (CSL) level - but here the PSL code also fails (with or without tracing).
I will be taking my summer trip soonish and so may well not have time to sort this out before I go away, but It is now on my pile of things to be resolved.
Arthur
I am attaching a proper divisor of Florian's example, which fails for random seed 3. Maybe that is easier to track.
The problem is not new. I would like to know: Is this a mathematical problem (like a choice of "unlucky primes" with Berlekamp), or is it a software bug. In the former case, we could run the relevant part under errorset and restart.
Thank you for the updated example.
It does not change the fact that I am about to go away on a summer trip,
and am really not likely to have time to fix this before I go!
You say "the problem is not new" and I wish to refute that statement
because previously reported issues where investigated and fixed. So while
this may be something that leads to the same error message as case that
have arisen in the past that does not mean that it is a fresh report of a
bug that was previou7sly ignored.
I had tried to give some background in my previous response, but I clearly
need to repeat it.
The factorizer works by starting with what is in general a multivariate
polynomial over Z (the integers). It substitutes integer values for all
bar one of the variables and reduced modulo a prime p do that it gets a
univariate polynomial over Z_p. As you indicate, Berlakamp's algorithm can
find the factors of such a polynomial if p is reasonably small. If for
some reasons you needed to use large primes you would need to look into
alternatives.
Some substitutions and some primes are degenerate and you can tell that
their use is silly from the versy start. Eg if you try to use a prime that
divides the leading coefficient of your polynomial that is a mess, but is
easy to avoid.
If you are "lucky" the modular factors are reasonable models of a true
factorization. Sometimes you will not be! The Hensel Construction can be
used both to lift the initial factorization to be modulo p^k for
sufficiently large k that the true factors would have integer values less
than p^k. It can also be used to reinstate the dependency on all the
variables that were initially substituted away. If you are "unlucky" then
the lifted factorization end up not looking like a factoriztaion over the
integers, but a combinatorial search (which is the nasty bit from an
abstract complexity perspective) can be used to sort things out.
That algorithms there are "not too bad". But then things get horrid. If
you apply the Hensel construction in a direct way especially to a case
where the univariate image split into more factors than there are over the
integers then the reconstruction often leads to dense multivariate
polynomials and the cost consequences of that are really severe. But by
inspecting the spare structure of the things you are working with it is
sometimes/often possible to predict or guess the sparse structure of the
desired result and end up finding the final factorization without the
exponential blow-up that degenerating from sparse to dense polynomials
involves. This involves a further level of "being lucky", and as before if
you are not that fact will emerge later and it is then necessary to change
tack. The code (obviously) does that.
There are multiple places where it is possible to (try to) optimise things
by discarding bad choices early or picking what might be hoped to be good
paths. The stage at which bad consequences of these emergs would all be
straightforward and tidy if one were prepared to expand up to degree and
coefficient bounds that are often larger than necessary. As I ha
dindicated in my previous message the code was not nice to debug! If you
look at it you will fin dsit was mostly written over the period 1978-1981
and so when it is necessary to look at it afresh there is an issue of over
35 years of other activity to think back past and also a significant issue
that the style of coding that seemed a good idea 35+ years ago sometimes
now seems archaic and awkward.
So yes, as will be obvious, the division failure report is as in the style
of an "assert" in the code triggered when the consequence of a bug
somewher in the code emerges. The packages/factor director has 8000 lines
of code (the factorizer and ezgcd code share a lot of technology). I will
get to it when I can. On average I am remarkably pleased by how FEW times
there have been bugs reported in this code over the years, since I believe
that factorization is reasonably often uised by all sorts of both large
and small calculations.
Arthur
On Thu, 4 Aug 2016, Thomas Sturm wrote:
The polynomial from Thomas seems to me to merely discard the trivial
factors that are manifest in the originally given case. And when I try it
on my Windows machine it does not even report an error to me!
However further testing shows greater pain in that sometimes the code
manages to find just trivial factors such as x42, x53, x85-1 and x85 but
SOMETIMES it also manages to extract x41x85^2 - 3x41x85 + 2x41 - 1 as
a nontrivial factor.
In many respects reporting something as irreducible when it is not is more
of a genuine problem than visibly crashing, but both can relate to some
coding glitch in the application of heuristics that exist to optimise
performance.
Yes, we have
and my g is the product of the first two factors. On OSX
thows the UNEXPECTED DIVISION FAILURE with both CSL and PSL.
There is an issue where both CSL and PSL have preferred optimisation over
checking utterly everything as it goes, and so a glitch gets shown up in
an odd manner. I will be awan on by summer break before I can chase this
down further, but hope that a progress report will encourage....
Via a function called diff!-over!-k!-mod!-p the code ends up computing
quotient!-mod!-p(4480, 7) when the modulus is set to 5764801. This
misfortune for it here is that the modulus is a power of 7 so this will
fail. The code (in modpoly.red) is half careful and uses
safe!_modular!-reciprocal to compute the reciprocal of 7, and when that
failes a flag called exact!-quotient!-flag is cleared and the value nil is
returned rather than a numeric result.
The sad thing is that I believe that this nil is then being passed on to
other bits of code that expect a number mod 5764801 as argument and do not
check - the effect is (I currently believe) that the bit-pettern used to
represent NIL is getting treated as an integer. Because this address may
not be utterly consistent as between machines can help explain why my
tests on Windows were not showing the crashes reported on other platforms,
and why muliple runs of the same test were not consistent - but multiple
tries of the factorization example within one running Reduce did all
behave the same and in particular trying the factorization many thousands
of times did not provoke failure.
As one can then imagine, continuing with a garbage value could lead either
to something that ouight to divide exactly not doing so or to the system
failing to find factors that it ought to... or potentially almost any bad
behaviour you can think of!
When it comes to it it is basically instances where some intermediate
value divides exactly into something "by accident" that underpin most of
the cases where it is necessary to take recovery action. I have never seen
a case where that happenstance has led to escaping from the Lisp world
abstraction to the underlying bit-pattern world before and I view it as
somewhat entertaining that neither CSL nor PSL gave a direct moan at a
non-numeric argument somewhere soon after this glitch.
This report tells you at a microscopic level where the issue arose... some
way back from the visible division failure that you reported.
At one level quotient!_mod!-p this is called from somewhere in
multihen.red and it will not be too hard to sort out where (when I have
time). setting "on trfac" amd "on trallfac" gives voluminous trace output
that is typically possible to check by hand but painfully tedious to work
through.
I observe a place where there is a potential attempt to divide by a factor
of the modulus. It is then easy to check for that and back off declaring
the case "unlucky", and that avoids the visible crash. However it can
leave me with a case where a polynomial that is composite does not get its
factors found. I need to delve deeper and I have just 100% run out of time
for that right now. If anybody else out there understands the Hensel
Construction and the sort of optimisations Paul Wang proposed to help cope
with sparse reconstruction I will point them at say around line 250 of
multihen.red where the call to diff!-over!-k!-mod!-p can cause trouble if
k divides into the current modulus. I would be grateful to anybody who can
track through just what is needed there and propose a fix - which once
found is liable to be quite short.
[citation of original mail removed by TS]
Last edit: Thomas Sturm 2016-08-06
I have just checked in some changes to the factorizer. Let me first just
note that the visible message "division failure" is a rather generic trap
for when a division that was expected to be exact is not. In this case it
arose as a somewhat indirect consequence of a case where an attempt to
preserve sparseness during the reconstruction "guessed wrong" leading to
the Hensel Construction being used to generate higher degree
approximations to what it was hoped would be factors than the value of the
prime originally used in the Berlakamp code. I spent some time seeing if I
could see a nice way to distinguish between this situation and one where
the image factors were not images of true factors at all and perhasps
partly because of how long ago all that was done I did not manage. If the
Hensel Construction fails when you have two putative factors f1 and f2 the
degree overshoot normally means that your image factorization has found
factors not that do not life to Z[x,y,...] so you can deduce that the
input polynomial had fewer factors (and in this case that would mean it
was irreducible). The "guessed" stuff to do with sparseness preserving
needs a different recovery where you unwind the guess and try again.
My resolution is to force the primes used in the univariate modular
factorization step (Berlekamp) to be larger than the degree of your input
in variable other than its main one. That ensures that a calculation that
goes
(dF/dx)/k
for k being the step number in the reconstruction should never reach a
stage where k is the same as your prime (which was leading to a division
faiulure there, and disaster).
The new version runs the existing Reduce regression tests OK and also (I
believe) the example that triggered the bug report. But all that code is
quite delicate so I HOPE teher are no adverse consequences. I guess I
could imagine some if people have rather high degree multivariate
polynomials to try to factorize and that forced the choice of prime to run
into trouble.
Fingers crossed!
I'm not sure this is an issue of optimization.
I agree that the problem comes from dividing by a factor (7) of the current modulus(5764801=7^8), and diff!-over!-k!-mod!-p subsequently returning nil.
However, since the return value of diff!-over!-k!-mod!-p is taken as the residue in the current Hensel step, it looks to me like the nil is interpreted as zero residue instead of a failure to compute it correctly.
Rainer
Bug has been fixed.