Thread: [TCLCORE] TIP #237: Arbitrary-Precision Integers for Tcl

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

 TIP #237: ARBITRARY-PRECISION INTEGERS FOR TCL 
================================================
 Version:      $Revision: 1.1 $
 Author:       Kevin B. Kenny <kennykb_at_acm.org>
 State:        Draft
 Type:         Project
 Tcl-Version:  8.5
 Vote:         Pending
 Created:      Wednesday, 14 January 2004
 URL:          http://purl.org/tcl/tip/237.html
 WebEdit:      http://purl.org/tcl/tip/edit/237
 Post-History: 

-------------------------------------------------------------------------

 ABSTRACT 
==========

 This TIP adds the capability to perform computations on integer values 
 of arbitrary precision. 

 RATIONALE 
===========

 There have been a number of issues in Tcl 8.4 dealing with the limited 
 range of native integers and wide values. The original ideas of [TIP 
 #72], while they have given at least the basics of 64-bit integer 
 support, have also introduced some subtle violations of the doctrine 
 that "everything is a string." Some of these have been insidious bugs - 
 [<URL:http://sf.net/tracker/?func=detail&aid=868489&group_id=10894&atid=110894>] 
 and 
 [<URL:http://sf.net/tracker/?func=detail&aid=1006626&group_id=10894&atid=110894>] 
 are illustrative - while others are perhaps not "bugs" in the strictest 
 sense, but are still surprising behaviour. 

 For instance, it is possible for a script to tell integers from wide 
 integers even when their string representations are equal, by 
 performing any arithmetic operation that overflows: 

       % set x 2147483647    % set x [expr { wide(2146483647) }]
       2147483647            2147483647
       % incr x              % incr x
       -2147483648           2147483648

 With things as they stand, 
 [<URL:http://sf.net/tracker/?func=detail&aid=1006626&group_id=10894&atid=110894>] 
 is nearly unfixable. It causes misconversion of large floating point 
 numbers that look like integers: 

       % set x -9223372036854775809
       -9223372036854775809
       % expr {double($x)}
       9.22337203685e+018
       % scan $x %g
       -9.22337203685e+018

 The reason here is that the string of digits is first converted to a 
 64-bit unsigned integer (*Tcl_WideUInt*). The '-' sign causes the 
 unsigned integer to be interpreted as a signed integer, and the sign 
 reversed. Since interpreting the unsigned integer as signed yields an 
 overflow, the result is positive rather than negative and gives the odd 
 behaviour shown above. 

 Of course, even if the implementation of 64-bit arithmetic were bug 
 free, there would be uses for arithmetic of higher precision. One 
 example is that several Tcl users have attempted pure-Tcl 
 implementations of RSA and Diffie-Hellman cryptographic algorithms. 
 These algorithms depend on arithmetic of high precision; implementing 
 them efficiently in today's Tcl requires a C extension because the 
 high-precision algorithms implemented in Tcl are simply too slow to be 
 acceptable. 

 Finally, studying the references for accurate conversion of floating 
 point numbers [TIP #132] reveals that input and output conversions both 
 require arbitrary-precision arithmetic in their implementation. The 
 reference implementation supplied with that TIP, alas, is code that is 
 poorly commented and difficult to integrate into Tcl's build system. 
 Reimplementing it according to Tcl's engineering practices means 
 implementing a large part of an arbitrary-precision arithmetic library. 

 PROPOSAL 
==========

 This TIP proposes augmenting Tcl with code for processing integers of 
 arbitrary precision. Specifically: 

    1. The /libtommath/ library [<URL:http://math.libtomcrypt.org/>] 
       shall be added in a subdirectory of the Tcl source tree, parallel 
       to /generic//, /compat//, etc. This library implements arithmetic 
       on integers of arbitrary precision. For the rationale behind this 
       library, and some of the precise integration details, see "Choice 
       of Libary" below. 

    2. New functions, /Tcl_NewBignumObj/, /Tcl_SetBignumObj/, 
       /Tcl_GetBignumFromObj/ and /Tcl_GetBignumAndClearObj/ shall be 
       added to the Tcl library. They shall be specified as follows: 

       Tcl_NewBignumObj:
              Creates an object containing an integer of arbitrary 
              precision. The /value/ argument gives the integer in the 
              format native to /libtommath/. /Upon return, the value 
              argument is cleared, because its digit array has had its 
              ownership transferred to the Tcl library./ 

              Tcl_Obj **Tcl_NewBignumObj*(mp_int */value/) 

       Tcl_SetBignumObj:
              Changes the value of an object to an integer of arbitrary 
              precision. The /value/ argument gives the integer in the 
              format native to /libtommath/. As with *Tcl_NewBignumObj*, 
              the /value/ argument is cleared on return and the digit 
              list is owned by Tcl thereafter. 

              void *Tcl_SetBignumObj*(Tcl_Obj */objPtr/, mp_int 
              */value/) 

       Tcl_GetBignumFromObj:
              Interprets an object as a large integer, and constructs a 
              copy of that large integer in the /value/ argument. 
              Returns /TCL_OK/ if successful. On failure, stores an 
              error message in the result of /interp/ and returns 
              /TCL_ERROR/. 

              int *Tcl_GetBignumFromObj*(Tcl_Interp */interp/, 
              Tcl_Obj*/objPtr/, mp_int */value/) 

       Tcl_GetBignumAndClearObj:
              Interprets an object as a large integer, and stores the 
              large integer in the /value/ argument. Returns /TCL_OK/ if 
              successful. On failure, stores an error message in the 
              result of /interp/ and returns /TCL_ERROR/. Calls 
              *Tcl_Panic* if the object is shared. The object is reset 
              to the empty state prior to return from this call. 

              int *Tcl_GetBignumAndClearObj*(Tcl_Interp */interp/, 
              Tcl_Obj*/objPtr/, mp_int */value/) 

       The memory management of these routines deserves some 
       explanation. For performance, it is desirable that copying 
       bignums be avoided as far as possible. For this reason, the digit 
       array stored in the /mp_int/ object will be stored by pointer in 
       the Tcl internal representation. This will result in memory 
       problems unless the /mp_int/ appears to be destroyed after it has 
       been placed in a Tcl object, since further calls to the 
       /libtommath/ functions may reallocate the array. 

       Similarly, when retrieving a large integer from an object, if the 
       object retains an internal representation with the /mp_int/, the 
       /mp_int/ must be copied. Code that intends to overwrite an 
       unshared object immediately can avoid the copy by using the call 
       that clears the object; this call returns the original /mp_int/ 
       without needing to do any memory management. 

    3. The /internalRep/ union in the /Tcl_Obj/ structure shall be 
       augmented with a /bignumRep/ field. This field shall be a 
       structure comprising a pointer and a long integer. The pointer 
       will designate an array of digits (the /dp/ member of an /mp_int/ 
       structure in /libtommath/). The integer will be an encoding 
       comprising: 

                   * one bit representing the sign of the number 

                   * fifteen bits giving the size of allocated memory in 
                     the digit array. > > * fifteen bits giving the size 
                     of used memory in the digit array. 

       The reason for this tight encoding is that the size of a 
       /Tcl_Obj/ shall not change, and yet an /mp_int/ structure can be 
       stored in the internal representation. This encoding will allow 
       packing and unpacking the object with a few Boolean operations 
       and shifts rather than needing to use a pointer internal 
       representation and dynamically allocated memory to store the 
       /mp_int/. The drawback is that it is limited to integers of 2**15 
       digits (or 917,504 bits). Since cryptographic algorithms, 
       floating point conversions, and most number theoretic work will 
       not exceed this length, it is thought to be adequate in practice. 

    4. A new /tclBignumType/ shall be added to the set of registered 
       object types; it will have the /bignumRep/ internal 
       representation, and the usual four conversion routines. 

    5. The /wideIntRep/ field in the /internalRep/ union shall remain 
       (lest there be extensions that use it), but all code in the Tcl 
       library that deals with it shall be removed. The routines, 
       /Tcl_GetWideIntFromObj/, /Tcl_SetWideIntObj/, and 
       /Tcl_NewWideIntObj/ shall remain as part of the API, but will 
       create objects with either /integer/ or /bignum/ internal 
       representations. 

    6. The *expr* command shall be reworked so that all integer 
       arithmetic is performed /as if/ all integers are of arbitrary 
       precision. In practice, numbers between INT_MIN and INT_MAX shall 
       be stored in native integers. The operators shall perform type 
       promotions as needed. The command shall avoid (as far as 
       practicable) performing arbitrary-precision arithmetic when 
       native ints are presented. Specifically: 

            * Mixed mode arithmetic with floating point numbers shall 
              work as it does today; the argument that is not a floating 
              point number shall be converted. Note that it will be 
              possible for conversion to overflow the floating-point 
              arithmetic range; in this case, the value shall be 
              replaced with /Inf/ (assuming the approval of [TIP #132]) 
              or yield the error, "floating-point value too large to 
              represent". 

       For arithmetic involving only integers: 

            * The unary '-' operator shall promote INT_MIN to a mp_int; 
              this is the only input that can cause it to overflow. 

            * The unary '+' and '!' operators require no promotion; '+' 
              does nothing but verify that its argument is numeric, and 
              '!' simply tests whether its argument is zero. 

            * The binary '**' operator (and the /pow/ function) shall 
              promote to an arbitrary precision integer conservatively: 
              when computing 'a**b', it will precompute 
              'ceil(log2(a)*b)', and promote if this logarithm exceeds 
              INT_BITS-1. The result shall be demoted to an integer if 
              it fits. 

            * The binary '*' operator, if either argument is an 
              arbitary-precision integer, shall promote the other to an 
              arbitrary-precision integer. If both are native integers, 
              and /Tcl_WideInt/ is at least twice the width of a native 
              /long/, then the product will be computed in a 
              /Tcl_WideInt/, and the result will be either demoted to a 
              /long/ (if it fits) or promoted to an /mp_int/. 

              In the case where /Tcl_WideInt/ is not at least twice the 
              width of /long/, the product will be computed to arbitrary 
              precision and then demoted if it fits in a /long/. /This 
              case is the only identified place where 
              arbitrary-precision arithmetic will be used on native 
              integers./ 

            * The binary '/' operator, if either argument is an 
              arbitrary-precision integer, shall promote the other. If 
              the quotient fits in a /long/, it shall be demoted. 

            * The binary '%' operator, in computing /a%b/, shall do the 
              following: 

                   * If /a/ and /b/ are both native integers, the result 
                     is also a native integer. 

                   * If /a/ is a native integer but /b/ is an 
                     arbitrary-precision integer, then /a<b/, and /a%b/ 
                     can be computed without division. 

                   * If /b/ is a native /long/, the division will be 
                     carried out using the arbitrary precision library, 
                     but the result will always be a native /long./ 

                   * If /a/ and /b/ are both arbitrary-precision 
                     integers, the result will be computed to arbitrary 
                     precision, but demoted if it fits in a /long/. 

            * The binary /+/ and /-/ operators, if either operand is an 
              arbitrary-precision integer, shall promote the other 
              operand to an arbitrary-precision integer, compute the 
              result to arbitrary precision, and demote the result to 
              /long/ if it fits. If both operands are native /long/, and 
              /Tcl_WideInt/ is larger than a native /long/, then the 
              result will be computed to /Tcl_WideInt/ precision, and 
              demoted to /long/ if it fits. 

              In the case where /Tcl_WideInt/ is only as wide as /long/, 
              the operators shall test for overflow when adding numbers 
              of like sign or subtracting numbers of opposite sign. If 
              the sign of the result of one of these operations differs 
              from the sign of the first operand, overflow has occurred; 
              the result is promoted to an arbitrary precision integer 
              and the sign is restored. 

            * The /<</ operator shall fail if its second operand is an 
              arbitrary-precision integer and the first is nonzero 
              (because this case must exceed the allowable number of 
              digits). It returns an arbitrary-precision integer if its 
              first argument is an arbitrary-precision integer, or if 
              the shift will overflow. The overflow check for /long/ 
              values (/a<<b/) is 

                   * if /b/>LONG_BITS-1, overflow. 

                   * if /a/>0, and (a & -(1<<(LONG_BITS-1-b))), 
                     overflow. 

                   * if /a/<0, and (~a & -(1<<LONG_BITS-1-b))), 
                     overflow. 

            * The '>>' operator /a>>b/, shall return 0 (/a>=0/) or -1 
              (/a<0/) if /b/ is an arbitrary-precision integer (it would 
              have shifted away all significant bits). Otherwise, the 
              shift shall be performed to the precision of /a/, and if 
              /a/ was an arbitrary-precision integer, the result shall 
              be demoted to a /long/ if it fits. 

            * The six comparison operators <, <=, ==, !=, >=, and >, can 
              work knowing only the signs of the operands wben native 
              /long/ values are compared with arbitrary-precision 
              integers. Arbitrary-precision comparison is needed only 
              when comparing arbitrary-precision integers of like sign. 
              In any case, the result is a native /long/. 

            * The /eq/, /ne/, /in/, and /ni/ operators work only on 
              string representations and will not change. 

            * The /&&/ and /||/ and /?:/ operators only test their 
              operands for zero and will not change. 

            * The ~ operator shall follow the algebraic identity: 

                     ~a == -a - 1 

              This identity holds if /a/ is represented in any word size 
              large enough to hold it without overflowing. It therefore 
              generalizes to integers of arbitary precision; 
              essentially, negative numbers are thought of as 
              "twos-complement numbers with an infinite number of 1 bits 
              at the most significant end." 

            * The base case of the /&/ (/a&b/) operator shall be defined 
              in the obvious way if /a/ and /b/ are both positive; 
              corresponding bits of their binary representation will be 
              ANDed together. For negative operands, the algebraic 
              identity above, together with De Morgan's laws, can reduce 
              the operation to the base case: 

                   * a>=0, b<0: 

                     a&b == a & ~( ~b ) == a & ( - b - 1 ) 

                   * a<0, b>=0: symmetric with the above. 

                   * a<0, b<0:  

                     a & b = ~( ~a | ~b ) = -( ( -a - 1 ) | ( -b - 1 ) ) 
                     - 1 

            * The base case of the /|/ (/a|b/) operator shall be defined 
              in the obvious way if /a/ and /b/ are both positive: 
              corresponding bits of their binary representation will be 
              ORed together. For negative operands, the algebraic 
              identity above, together with De Morgan's laws, can reduce 
              the operation to the base case: 

                   * a>=0, b<0: 

                     a|b == ~( ~a & ~b ) == -( ~a & ( -b - 1 )) - 1 

                   * a<0, b>=0: symmetric with the above. 

                   * a<0, b<0: 

                     a|b == ~( ~a & ~b ) == -( ( -a - 1 ) & ( -b - 1 ) ) 
                     - 1 

            * The base case of the /^/ (/a^b/) operator shall be defined 
              in the obvious way if /a/ and /b/ are both positive: 
              corresponding bits of their binary representation will be 
              EXCLUSIVE ORed together. For negative operands, the 
              algebraic identity above, together with the contrapositive 
              law, can reduce the operation to the base case: 

            * a>=0, b<0: 

                     a^b == ~( a ^ ~b ) == -( a ^ ( -b - 1 ) ) - 1 

            * a<0, b>=0: symmetric with the above. 

            * a<0, b<0: 

                     a^b == ~a ^ ~b == ( -a - 1 ) ^ ( -b - 1 ) 

            * The abs(), ceil(), double(), floor(), int(), round(), 
              sqrt(), and wide() math functions shall all be modified to 
              accept arbitrary-precision integers as parameters. The 
              sqrt() function will return the greatest integer less than 
              the square root when its result is greater than the 
              largest odd integer exactly represented in a 
              floating-point number. 

    7. The *incr* and *dict incr* commands shall work on 
       arbitary-precision values. Specifically, [incr a $n] will behave 
       like [set a [expr { $a + $n }]] with the constraint that /$a/ and 
       /$b/ must both be integers. 

    8. The *format* and *scan* commands will acquire /D/, /O/ and /X/ 
       specifiers that format their arguments as arbitrary-precision 
       decimal, octal, and hexadecimal integers, respectively. The /O/ 
       and /X/ specifiers will both format the /absolute value/ of an 
       integer; another specifier, /S/, will format its signum. (This is 
       done so that hexadecimal numbers can be rendered with the sign 
       preceding a leading literal 'Ox'.) 

    9. User defined math functions will be able to gain access to a 
       /bignumValue/ in the /Tcl_Value/ structure, and a TCL_BIGNUM 
       value in the /Tcl_ValueType/ enumeration. User defined functions 
       that accept arguments of type TCL_WIDE_INT will have precision 
       reduced appropriately (leading to possible "integer value too 
       large to represent" conversion errors). User defined functions 
       that accept arguments of type TCL_EITHER will no longer ever get 
       a TCL_WIDE_INT argument, but may get a TCL_BIGNUM. (Note that if 
       [TIP #232] becomes final, then /Tcl_Value/ becomes something of 
       an historical aberration.) 

 INTEGRATION DETAILS 
=====================

 The /libtommath/ source code shall be extracted from the distributions 
 available at [<URL:http://math.libtomcrypt.org/>] and brought into the 
 CVS repository as a /vendor branch/ (see 
 [<URL:https://www.cvshome.org/docs/manual/cvs-1.11.18/cvs_13.html#SEC103>] 
 for a discussion of managing vendor branches. It appears that all the 
 necessary modifications to integrate this source code with Tcl can be 
 made in the single file, /tommath.h/; it is likely that the /tools// 
 directory in the Tcl source tree will contain a Tcl script to make the 
 modifications automatically when importing a new version. CVS can 
 maintain local modifications effectively, should we find it necessary 
 to patch the other sources. 

 The chief modification is that all the external symbols in the library 
 will have TclBN prepended to their names, so that they will not give 
 linking conflicts if an extension uses a 'tommath' library not supplied 
 with Tcl - or uses any other library compatible with the Berkeley /mp/ 
 API. 

 CHOICE OF LIBRARY 
===================

 The /libtommath/ code is chosen from among a fairly large set of 
 possible third-party bignum codes. Among the ones considered were: 

 GMP:  The Gnu GMP library is probably the fastest and most complete of 
       the codes available. Alas, it is licensed under LGPL, which would 
       require all distributors who prepare binary Tcl distributions to 
       include the GMP source code. I chose to avoid any legal 
       entanglements, and avoided GMP for this reason. 

 The original Berkeley mp library:
       This library is atrociously slow, and was avoided for that 
       reason. 

 Gnu Calc: GPL licensed. A non-starter for that reason. 

 mpexpr: This Tcl extension has been available for many years and is 
       released under the BSD license. (It was the basis for Gnu Calc 
       but predates the fork, and hence is not GPL-contaminated.) It 
       would certainly be a possibility, but the code is not terribly 
       well documented, is slow by today's standards, and still uses 
       string-based Tcl API's. It would be a considerable amount of work 
       to bring it into conformance with today's Tcl core engineering 
       practices. 

 OpenSSL: The OpenSSL library includes a fast and well-documented bignum 
       library (developed so that OpenSSL can do RSA cryptography). 
       Alas, the code is licensed under the original BSD license 
       agreement, and has several parties added to the dreaded 
       Advertising Clause. The Advertising Clause would present serious 
       difficulties for our distributors, and so OpenSSL is not 
       suitable. (The OpenSSL developers are not amenable to removing 
       the Advertising Clause from the license.) 

 The /libtommath/ code is released to the Public Domain. Its author, Tom 
 St. Denis, explicitly and irrevocably authorizes its use for any 
 purpose, without fee, and without attribution. So license 
 incompatiilities aren't going to be an issue. The documentation is 
 wonderful (Tom has written a book of several hundred pages 
 [<URL:http://book.libtomcrypt.org/draft/tommath.pdf>] describing the 
 algorithms used), and the code is lucid. 

 The chief disadvantage to /libtommath/ is that it is a one-person 
 effort, and hence has the risk of becoming orphaned. I consider this 
 risk to be of acceptable severity, because of the quality of the code. 
 The Tcl maintainers could take it on quite readily. 

 ADDITIONAL POSSIBILITIES 
==========================

 The Tcl library will actually use only about one-third of /libtommath/ 
 to implement the /expr/ command. In particular, the library functions 
 for squaring, fast modular reduction, modular exponentiation, greatest 
 common divisor, least common multiple, Jacobi symbol, modular inverse, 
 primality testing and the solution of linear Diophantine equations are 
 not needed, and shall not be include in the Tcl library to save memory 
 footprint. 

 Nevertheless, these operations would be extremely useful in an 
 extension, so that programs that don't live with tight memory 
 constraints can do things like fast Diffie-Hellman or RSA cryptography. 
 We should probably consider a 'bignum' package to be bundled with the 
 core distribution that would add appropriate math functions wrapping 
 these operations. 

 REFERENCE IMPLEMENTATION 
==========================

 No reference implementation is present as yet; the plan is to develop 
 it on the CVS branch called 'tcl-numerics-branch'. Vote will not be 
 called until the implementation is far enough along to convince the Tcl 
 Core Team that it is practicable in the 8.5 release schedule. (The 
 author hopes to enlist other volunteers to assist with the effort.) 

 COPYRIGHT 
===========

 This document may be distributed subject to the terms and conditions 
 set forth in the Open Publication License, version 1.0 
 [<URL:http://www.opencontent.org/openpub/>]. 

-------------------------------------------------------------------------

 TIP AutoGenerator - written by Donal K. Fellows 

Thread: [TCLCORE] TIP #237: Arbitrary-Precision Integers for Tcl

The Tool Command Language implementation

tcl-core