Menu

#72 Unexpected division failure in polynomial factorization

None
closed
factorizer (1)
9
2016-11-06
2016-08-03
No

Hi there,

attached you'll find an example where the polynomial factorization fails. Apparently, this bug depends on some non-determinism, as it only fails occasionally calling reduce for the given example several times in a row. The output, which it produces in this case is as follows:

Reduce (Free CSL version, revision 3753), 03-Aug-16 ...

* Factorizer error: (UNEXPECTED DIVISION FAILURE (((x41 . 1) ((x85 . 7) .
2506534) ((x85 . 6) . -502670) ((x85 . 5) . -2748781) ((x85 . 4) . 1476202) ((
x85 . 3) . 2812397) ((x85 . 2) . -1068956) ((x85 . 1) . -1208963) . 1992504) ((
x85 . 9) . -2506534) ((x85 . 8) . 8022272) ((x85 . 7) . -3772297) ((x85 . 6) .
-8717205) ((x85 . 5) . 7113771) ((x85 . 4) . 6553743) ((x85 . 3) . -7622699) ((
x85 . 2) . -3481481) ((x85 . 1) . 8395438) . -3985008) 5764801)

*** End-of-file read

Thanks a lot in advance!

Best,
Florian

1 Attachments

Discussion

  • Arthur Norman

    Arthur Norman - 2016-08-03

    Thank you for the report..

    The factorizer works internally by making an image of the multivariate input by substituting integer values for all bar one variable and then reducing coefficients modulo some prime. It can then factorize the univariate polynomial over the finite field modulo that prime, and reconstruct full factors. Indeed it makes its choices of substitutions and primes (partly) at random so it is deliberatly non-deterministic. It has a LOT of messy code that tries to spot special cases and preserve sparseness during the reconstruction and it was a horrible thing to devug all those years ago when it was new - but the non-determinism was useful for soak testing because a single top-level example tested different paths on different runs. It could well have a residual bug (or several!). And your report with a nice example and a setting of the random seed is just what I would like...

    I have tested it a bit and it is really odd! The example fails under both CSL and PSL (on some but not all builtds) but at present on 64-bit windows it looks as if merely going "tr next!-random!-number;" changes the behaviour and it works. Right now that muddles me in that merely tracing something ought not to have an effect on outcomes, and if it does that would normally signal a bug at the Lisp (CSL) level - but here the PSL code also fails (with or without tracing).

    I will be taking my summer trip soonish and so may well not have time to sort this out before I go away, but It is now on my pile of things to be resolved.

    Arthur

     
  • Thomas Sturm

    Thomas Sturm - 2016-08-04

    I am attaching a proper divisor of Florian's example, which fails for random seed 3. Maybe that is easier to track.

    The problem is not new. I would like to know: Is this a mathematical problem (like a choice of "unlucky primes" with Berlekamp), or is it a software bug. In the former case, we could run the relevant part under errorset and restart.

     
    • Arthur Norman

      Arthur Norman - 2016-08-04

      Thank you for the updated example.
      It does not change the fact that I am about to go away on a summer trip,
      and am really not likely to have time to fix this before I go!

      You say "the problem is not new" and I wish to refute that statement
      because previously reported issues where investigated and fixed. So while
      this may be something that leads to the same error message as case that
      have arisen in the past that does not mean that it is a fresh report of a
      bug that was previou7sly ignored.

      I had tried to give some background in my previous response, but I clearly
      need to repeat it.

      The factorizer works by starting with what is in general a multivariate
      polynomial over Z (the integers). It substitutes integer values for all
      bar one of the variables and reduced modulo a prime p do that it gets a
      univariate polynomial over Z_p. As you indicate, Berlakamp's algorithm can
      find the factors of such a polynomial if p is reasonably small. If for
      some reasons you needed to use large primes you would need to look into
      alternatives.

      Some substitutions and some primes are degenerate and you can tell that
      their use is silly from the versy start. Eg if you try to use a prime that
      divides the leading coefficient of your polynomial that is a mess, but is
      easy to avoid.

      If you are "lucky" the modular factors are reasonable models of a true
      factorization. Sometimes you will not be! The Hensel Construction can be
      used both to lift the initial factorization to be modulo p^k for
      sufficiently large k that the true factors would have integer values less
      than p^k. It can also be used to reinstate the dependency on all the
      variables that were initially substituted away. If you are "unlucky" then
      the lifted factorization end up not looking like a factoriztaion over the
      integers, but a combinatorial search (which is the nasty bit from an
      abstract complexity perspective) can be used to sort things out.

      That algorithms there are "not too bad". But then things get horrid. If
      you apply the Hensel construction in a direct way especially to a case
      where the univariate image split into more factors than there are over the
      integers then the reconstruction often leads to dense multivariate
      polynomials and the cost consequences of that are really severe. But by
      inspecting the spare structure of the things you are working with it is
      sometimes/often possible to predict or guess the sparse structure of the
      desired result and end up finding the final factorization without the
      exponential blow-up that degenerating from sparse to dense polynomials
      involves. This involves a further level of "being lucky", and as before if
      you are not that fact will emerge later and it is then necessary to change
      tack. The code (obviously) does that.

      There are multiple places where it is possible to (try to) optimise things
      by discarding bad choices early or picking what might be hoped to be good
      paths. The stage at which bad consequences of these emergs would all be
      straightforward and tidy if one were prepared to expand up to degree and
      coefficient bounds that are often larger than necessary. As I ha
      dindicated in my previous message the code was not nice to debug! If you
      look at it you will fin dsit was mostly written over the period 1978-1981
      and so when it is necessary to look at it afresh there is an issue of over
      35 years of other activity to think back past and also a significant issue
      that the style of coding that seemed a good idea 35+ years ago sometimes
      now seems archaic and awkward.

      So yes, as will be obvious, the division failure report is as in the style
      of an "assert" in the code triggered when the consequence of a bug
      somewher in the code emerges. The packages/factor director has 8000 lines
      of code (the factorizer and ezgcd code share a lot of technology). I will
      get to it when I can. On average I am remarkably pleased by how FEW times
      there have been bugs reported in this code over the years, since I believe
      that factorization is reasonably often uised by all sorts of both large
      and small calculations.

      Arthur

      On Thu, 4 Aug 2016, Thomas Sturm wrote:

      I am attaching a proper divisor of Florian's example, which fails for random
      seed 3. Maybe that is easier to track.

      The problem is not new. I would like to know: Is this a mathematical problem
      (like a choice of "unlucky primes" with Berlekamp), or is it a software bug.
      In the former case, we could run the relevant part under errorset and
      restart.

      Attachments:

      • fctrf-bug.red (506 Bytes; application/octet-stream)
       
      • Arthur Norman

        Arthur Norman - 2016-08-04

        The polynomial from Thomas seems to me to merely discard the trivial
        factors that are manifest in the originally given case. And when I try it
        on my Windows machine it does not even report an error to me!

        However further testing shows greater pain in that sometimes the code
        manages to find just trivial factors such as x42, x53, x85-1 and x85 but
        SOMETIMES it also manages to extract x41x85^2 - 3x41x85 + 2x41 - 1 as
        a nontrivial factor.

        In many respects reporting something as irreducible when it is not is more
        of a genuine problem than visibly crashing, but both can relate to some
        coding glitch in the application of heuristics that exist to optimise
        performance.

             Arthur
        
         
  • Thomas Sturm

    Thomas Sturm - 2016-08-05

    Yes, we have

    4: factorize f;
    
    {{3*x41**4*x85**3 - 3*x41**4*x85**2 - 3*x41**4*x85 + 3*x41**4 + 2*x41**3*x85**4
    
     - 11*x41**3*x85**3 + 18*x41**3*x85**2 - x41**3*x85 - 8*x41**3 + 5*x41**2*x85**3
     - 15*x41**2*x85**2 + 6*x41**2*x85 + 8*x41**2 + 2*x41*x85**2 - 2*x41*x85 - 4*x41
     + 1,
    1},
    {x41*x85**2 - 3*x41*x85 + 2*x41 - 1,1},
    {x41,1},
    {x53,1},
    {x85 - 1,1},
    {x85,1}}$
    

    and my g is the product of the first two factors. On OSX

    random_new_seed 3; factorize g;
    

    thows the UNEXPECTED DIVISION FAILURE with both CSL and PSL.

     
  • Arthur Norman

    Arthur Norman - 2016-08-05

    There is an issue where both CSL and PSL have preferred optimisation over
    checking utterly everything as it goes, and so a glitch gets shown up in
    an odd manner. I will be awan on by summer break before I can chase this
    down further, but hope that a progress report will encourage....

    Via a function called diff!-over!-k!-mod!-p the code ends up computing
    quotient!-mod!-p(4480, 7) when the modulus is set to 5764801. This
    misfortune for it here is that the modulus is a power of 7 so this will
    fail. The code (in modpoly.red) is half careful and uses
    safe!_modular!-reciprocal to compute the reciprocal of 7, and when that
    failes a flag called exact!-quotient!-flag is cleared and the value nil is
    returned rather than a numeric result.

    The sad thing is that I believe that this nil is then being passed on to
    other bits of code that expect a number mod 5764801 as argument and do not
    check - the effect is (I currently believe) that the bit-pettern used to
    represent NIL is getting treated as an integer. Because this address may
    not be utterly consistent as between machines can help explain why my
    tests on Windows were not showing the crashes reported on other platforms,
    and why muliple runs of the same test were not consistent - but multiple
    tries of the factorization example within one running Reduce did all
    behave the same and in particular trying the factorization many thousands
    of times did not provoke failure.

    As one can then imagine, continuing with a garbage value could lead either
    to something that ouight to divide exactly not doing so or to the system
    failing to find factors that it ought to... or potentially almost any bad
    behaviour you can think of!

    When it comes to it it is basically instances where some intermediate
    value divides exactly into something "by accident" that underpin most of
    the cases where it is necessary to take recovery action. I have never seen
    a case where that happenstance has led to escaping from the Lisp world
    abstraction to the underlying bit-pattern world before and I view it as
    somewhat entertaining that neither CSL nor PSL gave a direct moan at a
    non-numeric argument somewhere soon after this glitch.

    This report tells you at a microscopic level where the issue arose... some
    way back from the visible division failure that you reported.
    At one level quotient!_mod!-p this is called from somewhere in
    multihen.red and it will not be too hard to sort out where (when I have
    time). setting "on trfac" amd "on trallfac" gives voluminous trace output
    that is typically possible to check by hand but painfully tedious to work
    through.

    I observe a place where there is a potential attempt to divide by a factor
    of the modulus. It is then easy to check for that and back off declaring
    the case "unlucky", and that avoids the visible crash. However it can
    leave me with a case where a polynomial that is composite does not get its
    factors found. I need to delve deeper and I have just 100% run out of time
    for that right now. If anybody else out there understands the Hensel
    Construction and the sort of optimisations Paul Wang proposed to help cope
    with sparse reconstruction I will point them at say around line 250 of
    multihen.red where the call to diff!-over!-k!-mod!-p can cause trouble if
    k divides into the current modulus. I would be grateful to anybody who can
    track through just what is needed there and propose a fix - which once
    found is liable to be quite short.

           Arthur
    

    [citation of original mail removed by TS]

     

    Last edit: Thomas Sturm 2016-08-06
    • Arthur Norman

      Arthur Norman - 2016-09-01

      I have just checked in some changes to the factorizer. Let me first just
      note that the visible message "division failure" is a rather generic trap
      for when a division that was expected to be exact is not. In this case it
      arose as a somewhat indirect consequence of a case where an attempt to
      preserve sparseness during the reconstruction "guessed wrong" leading to
      the Hensel Construction being used to generate higher degree
      approximations to what it was hoped would be factors than the value of the
      prime originally used in the Berlakamp code. I spent some time seeing if I
      could see a nice way to distinguish between this situation and one where
      the image factors were not images of true factors at all and perhasps
      partly because of how long ago all that was done I did not manage. If the
      Hensel Construction fails when you have two putative factors f1 and f2 the
      degree overshoot normally means that your image factorization has found
      factors not that do not life to Z[x,y,...] so you can deduce that the
      input polynomial had fewer factors (and in this case that would mean it
      was irreducible). The "guessed" stuff to do with sparseness preserving
      needs a different recovery where you unwind the guess and try again.

      My resolution is to force the primes used in the univariate modular
      factorization step (Berlekamp) to be larger than the degree of your input
      in variable other than its main one. That ensures that a calculation that
      goes
      (dF/dx)/k
      for k being the step number in the reconstruction should never reach a
      stage where k is the same as your prime (which was leading to a division
      faiulure there, and disaster).

      The new version runs the existing Reduce regression tests OK and also (I
      believe) the example that triggered the bug report. But all that code is
      quite delicate so I HOPE teher are no adverse consequences. I guess I
      could imagine some if people have rather high degree multivariate
      polynomials to try to factorize and that forced the choice of prime to run
      into trouble.

      Fingers crossed!

            Arthur
      
       
  • Rainer Schöpf

    Rainer Schöpf - 2016-08-06

    I'm not sure this is an issue of optimization.

    I agree that the problem comes from dividing by a factor (7) of the current modulus(5764801=7^8), and diff!-over!-k!-mod!-p subsequently returning nil.

    However, since the return value of diff!-over!-k!-mod!-p is taken as the residue in the current Hensel step, it looks to me like the nil is interpreted as zero residue instead of a failure to compute it correctly.

    Rainer

     
  • Rainer Schöpf

    Rainer Schöpf - 2016-09-06
    • labels: --> factorizer
    • status: open --> accepted
    • assigned_to: Arthur Norman
    • Group: -->
     
  • Rainer Schöpf

    Rainer Schöpf - 2016-11-06
    • status: accepted --> closed
     
  • Rainer Schöpf

    Rainer Schöpf - 2016-11-06

    Bug has been fixed.

     

Log in to post a comment.

MongoDB Logo MongoDB