REDUCE / Bugs / #72 Unexpected division failure in polynomial factorization

Arthur Norman - 2016-08-03

Thank you for the report..

The factorizer works internally by making an image of the multivariate input by substituting integer values for all bar one variable and then reducing coefficients modulo some prime. It can then factorize the univariate polynomial over the finite field modulo that prime, and reconstruct full factors. Indeed it makes its choices of substitutions and primes (partly) at random so it is deliberatly non-deterministic. It has a LOT of messy code that tries to spot special cases and preserve sparseness during the reconstruction and it was a horrible thing to devug all those years ago when it was new - but the non-determinism was useful for soak testing because a single top-level example tested different paths on different runs. It could well have a residual bug (or several!). And your report with a nice example and a setting of the random seed is just what I would like...

I have tested it a bit and it is really odd! The example fails under both CSL and PSL (on some but not all builtds) but at present on 64-bit windows it looks as if merely going "tr next!-random!-number;" changes the behaviour and it works. Right now that muddles me in that merely tracing something ought not to have an effect on outcomes, and if it does that would normally signal a bug at the Lisp (CSL) level - but here the PSL code also fails (with or without tracing).

I will be taking my summer trip soonish and so may well not have time to sort this out before I go away, but It is now on my pile of things to be resolved.

Arthur

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas Sturm - 2016-08-04

I am attaching a proper divisor of Florian's example, which fails for random seed 3. Maybe that is easier to track.

The problem is not new. I would like to know: Is this a mathematical problem (like a choice of "unlucky primes" with Berlekamp), or is it a software bug. In the former case, we could run the relevant part under errorset and restart.

fctrf-bug.red

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Arthur Norman - 2016-08-04
  
  Thank you for the updated example.
  It does not change the fact that I am about to go away on a summer trip,
  and am really not likely to have time to fix this before I go!
  
  You say "the problem is not new" and I wish to refute that statement
  because previously reported issues where investigated and fixed. So while
  this may be something that leads to the same error message as case that
  have arisen in the past that does not mean that it is a fresh report of a
  bug that was previou7sly ignored.
  
  I had tried to give some background in my previous response, but I clearly
  need to repeat it.
  
  The factorizer works by starting with what is in general a multivariate
  polynomial over Z (the integers). It substitutes integer values for all
  bar one of the variables and reduced modulo a prime p do that it gets a
  univariate polynomial over Z_p. As you indicate, Berlakamp's algorithm can
  find the factors of such a polynomial if p is reasonably small. If for
  some reasons you needed to use large primes you would need to look into
  alternatives.
  
  Some substitutions and some primes are degenerate and you can tell that
  their use is silly from the versy start. Eg if you try to use a prime that
  divides the leading coefficient of your polynomial that is a mess, but is
  easy to avoid.
  
  If you are "lucky" the modular factors are reasonable models of a true
  factorization. Sometimes you will not be! The Hensel Construction can be
  used both to lift the initial factorization to be modulo p^k for
  sufficiently large k that the true factors would have integer values less
  than p^k. It can also be used to reinstate the dependency on all the
  variables that were initially substituted away. If you are "unlucky" then
  the lifted factorization end up not looking like a factoriztaion over the
  integers, but a combinatorial search (which is the nasty bit from an
  abstract complexity perspective) can be used to sort things out.
  
  That algorithms there are "not too bad". But then things get horrid. If
  you apply the Hensel construction in a direct way especially to a case
  where the univariate image split into more factors than there are over the
  integers then the reconstruction often leads to dense multivariate
  polynomials and the cost consequences of that are really severe. But by
  inspecting the spare structure of the things you are working with it is
  sometimes/often possible to predict or guess the sparse structure of the
  desired result and end up finding the final factorization without the
  exponential blow-up that degenerating from sparse to dense polynomials
  involves. This involves a further level of "being lucky", and as before if
  you are not that fact will emerge later and it is then necessary to change
  tack. The code (obviously) does that.
  
  There are multiple places where it is possible to (try to) optimise things
  by discarding bad choices early or picking what might be hoped to be good
  paths. The stage at which bad consequences of these emergs would all be
  straightforward and tidy if one were prepared to expand up to degree and
  coefficient bounds that are often larger than necessary. As I ha
  dindicated in my previous message the code was not nice to debug! If you
  look at it you will fin dsit was mostly written over the period 1978-1981
  and so when it is necessary to look at it afresh there is an issue of over
  35 years of other activity to think back past and also a significant issue
  that the style of coding that seemed a good idea 35+ years ago sometimes
  now seems archaic and awkward.
  
  So yes, as will be obvious, the division failure report is as in the style
  of an "assert" in the code triggered when the consequence of a bug
  somewher in the code emerges. The packages/factor director has 8000 lines
  of code (the factorizer and ezgcd code share a lot of technology). I will
  get to it when I can. On average I am remarkably pleased by how FEW times
  there have been bugs reported in this code over the years, since I believe
  that factorization is reasonably often uised by all sorts of both large
  and small calculations.
  
  Arthur
  
  On Thu, 4 Aug 2016, Thomas Sturm wrote:
  
  I am attaching a proper divisor of Florian's example, which fails for random
  seed 3. Maybe that is easier to track.
  
  The problem is not new. I would like to know: Is this a mathematical problem
  (like a choice of "unlucky primes" with Berlekamp), or is it a software bug.
  In the former case, we could run the relevant part under errorset and
  restart.
  
  Attachments:
  
  fctrf-bug.red (506 Bytes; application/octet-stream)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Arthur Norman - 2016-08-04
    
    The polynomial from Thomas seems to me to merely discard the trivial
    factors that are manifest in the originally given case. And when I try it
    on my Windows machine it does not even report an error to me!
    
    However further testing shows greater pain in that sometimes the code
    manages to find just trivial factors such as x42, x53, x85-1 and x85 but
    SOMETIMES it also manages to extract x41x85^2 - 3x41x85 + 2x41 - 1 as
    a nontrivial factor.
    
    In many respects reporting something as irreducible when it is not is more
    of a genuine problem than visibly crashing, but both can relate to some
    coding glitch in the application of heuristics that exist to optimise
    performance.
    
    Arthur
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yes, we have

4: factorize f;

{{3*x41**4*x85**3 - 3*x41**4*x85**2 - 3*x41**4*x85 + 3*x41**4 + 2*x41**3*x85**4

 - 11*x41**3*x85**3 + 18*x41**3*x85**2 - x41**3*x85 - 8*x41**3 + 5*x41**2*x85**3
 - 15*x41**2*x85**2 + 6*x41**2*x85 + 8*x41**2 + 2*x41*x85**2 - 2*x41*x85 - 4*x41
 + 1,
1},
{x41*x85**2 - 3*x41*x85 + 2*x41 - 1,1},
{x41,1},
{x53,1},
{x85 - 1,1},
{x85,1}}$

and my g is the product of the first two factors. On OSX

random_new_seed 3; factorize g;

thows the UNEXPECTED DIVISION FAILURE with both CSL and PSL.

Arthur Norman - 2016-08-05

There is an issue where both CSL and PSL have preferred optimisation over
checking utterly everything as it goes, and so a glitch gets shown up in
an odd manner. I will be awan on by summer break before I can chase this
down further, but hope that a progress report will encourage....

Via a function called diff!-over!-k!-mod!-p the code ends up computing
quotient!-mod!-p(4480, 7) when the modulus is set to 5764801. This
misfortune for it here is that the modulus is a power of 7 so this will
fail. The code (in modpoly.red) is half careful and uses
safe!_modular!-reciprocal to compute the reciprocal of 7, and when that
failes a flag called exact!-quotient!-flag is cleared and the value nil is
returned rather than a numeric result.

The sad thing is that I believe that this nil is then being passed on to
other bits of code that expect a number mod 5764801 as argument and do not
check - the effect is (I currently believe) that the bit-pettern used to
represent NIL is getting treated as an integer. Because this address may
not be utterly consistent as between machines can help explain why my
tests on Windows were not showing the crashes reported on other platforms,
and why muliple runs of the same test were not consistent - but multiple
tries of the factorization example within one running Reduce did all
behave the same and in particular trying the factorization many thousands
of times did not provoke failure.

As one can then imagine, continuing with a garbage value could lead either
to something that ouight to divide exactly not doing so or to the system
failing to find factors that it ought to... or potentially almost any bad
behaviour you can think of!

When it comes to it it is basically instances where some intermediate
value divides exactly into something "by accident" that underpin most of
the cases where it is necessary to take recovery action. I have never seen
a case where that happenstance has led to escaping from the Lisp world
abstraction to the underlying bit-pattern world before and I view it as
somewhat entertaining that neither CSL nor PSL gave a direct moan at a
non-numeric argument somewhere soon after this glitch.

This report tells you at a microscopic level where the issue arose... some
way back from the visible division failure that you reported.
At one level quotient!_mod!-p this is called from somewhere in
multihen.red and it will not be too hard to sort out where (when I have
time). setting "on trfac" amd "on trallfac" gives voluminous trace output
that is typically possible to check by hand but painfully tedious to work
through.

I observe a place where there is a potential attempt to divide by a factor
of the modulus. It is then easy to check for that and back off declaring
the case "unlucky", and that avoids the visible crash. However it can
leave me with a case where a polynomial that is composite does not get its
factors found. I need to delve deeper and I have just 100% run out of time
for that right now. If anybody else out there understands the Hensel
Construction and the sort of optimisations Paul Wang proposed to help cope
with sparse reconstruction I will point them at say around line 250 of
multihen.red where the call to diff!-over!-k!-mod!-p can cause trouble if
k divides into the current modulus. I would be grateful to anybody who can
track through just what is needed there and propose a fix - which once
found is liable to be quite short.

Arthur

[citation of original mail removed by TS]

Last edit: Thomas Sturm 2016-08-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Arthur Norman - 2016-09-01
  
  I have just checked in some changes to the factorizer. Let me first just
  note that the visible message "division failure" is a rather generic trap
  for when a division that was expected to be exact is not. In this case it
  arose as a somewhat indirect consequence of a case where an attempt to
  preserve sparseness during the reconstruction "guessed wrong" leading to
  the Hensel Construction being used to generate higher degree
  approximations to what it was hoped would be factors than the value of the
  prime originally used in the Berlakamp code. I spent some time seeing if I
  could see a nice way to distinguish between this situation and one where
  the image factors were not images of true factors at all and perhasps
  partly because of how long ago all that was done I did not manage. If the
  Hensel Construction fails when you have two putative factors f1 and f2 the
  degree overshoot normally means that your image factorization has found
  factors not that do not life to Z[x,y,...] so you can deduce that the
  input polynomial had fewer factors (and in this case that would mean it
  was irreducible). The "guessed" stuff to do with sparseness preserving
  needs a different recovery where you unwind the guess and try again.
  
  My resolution is to force the primes used in the univariate modular
  factorization step (Berlekamp) to be larger than the degree of your input
  in variable other than its main one. That ensures that a calculation that
  goes
  (dF/dx)/k
  for k being the step number in the reconstruction should never reach a
  stage where k is the same as your prime (which was leading to a division
  faiulure there, and disaster).
  
  The new version runs the existing Reduce regression tests OK and also (I
  believe) the example that triggered the bug report. But all that code is
  quite delicate so I HOPE teher are no adverse consequences. I guess I
  could imagine some if people have rather high degree multivariate
  polynomials to try to factorize and that forced the choice of prime to run
  into trouble.
  
  Fingers crossed!
  
  Arthur
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rainer Schöpf - 2016-08-06

I'm not sure this is an issue of optimization.

I agree that the problem comes from dividing by a factor (7) of the current modulus(5764801=7^8), and diff!-over!-k!-mod!-p subsequently returning nil.

However, since the return value of diff!-over!-k!-mod!-p is taken as the residue in the current Hensel step, it looks to me like the nil is interpreted as zero residue instead of a failure to compute it correctly.

Rainer

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rainer Schöpf - 2016-09-06

labels: --> factorizer

status: open --> accepted

assigned_to: Arthur Norman

Group: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rainer Schöpf - 2016-11-06

status: accepted --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rainer Schöpf - 2016-11-06

Bug has been fixed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unexpected division failure in polynomial factorization

A Portable General-Purpose Computer Algebra System

Group

Searches

Help

#72 Unexpected division failure in polynomial factorization

Discussion