* The organization of the BigNum Library

As mentioned in bn.doc, the library should compile on anything with an
ANSI C compiler and 16 and 32-bit data types.  (Non-power-of-2 word
lengths probably wouldn't be *too* hard, but the matter is likely to
remain academic.)  However, assembly subroutines can be added in a
great variety of ways to speed up computations.

It's even possible to vary the word length dynamically at run time.
Currently, 80x86 and 680x0 assembly primitives have been written in 16
and 32-bit forms, as not all members of these families support 32x32->64
bit multiply.  In future, 32/64 bit routines may be nice for the MIPS
and PowerPC processors.  (The SPARC has a 64-bit extension, but it still
only produces a maximum 64-bit multiply result.  The MIPS, PowerPC and
Alpha give access to 128 bits of product.)

The way that this works is that the file bn.c declares a big pile of
function pointers, and the first bnInit() call figures out which set
of functions to point these to.  The functions are named so that
it is possible to link several sets into the same executable without
collisions.

The library can store numbers in big-endian or little-endian word order,
although the order of bytes within a word is always the platform native
order.  As long as you're using the pure C version, you can compile
independent of the native byte ordering, but the flexibility is available
in case assembly primitives are easier to write one way or the other.
(In the absence of other considerations, little-endian is somewhat more
efficient, and is the default.  This is controlled by BN_XXX_ENDIAN.)

In fact, it would be possible to change the word order at run time,
except that there is no naming convention to support linking in
functions that differ only in endianness.  (Which is because the
point of doing so is unclear.)

The core of the library is in the files lbn??.c and bn??.c, where "??"
is 16, 32, or 64.  The 32 and 64-bit files are generated from the 16-bit
version by a simple textual substitution.  The 16-bit files are generally
considered the master source, and the others generated from it with sed.

Usually, only one set of these files is used on any given platform,
but if you want multiple word sizes, you include one for each supported
word size.  The files bninit??.c define a bnInit function for a given
word size, which calls bnInit_??() internally.  Only one of these may
be included at a time, and multiple word sizes are handled by a more
complex bnInit function such as the ones in bn8086.c and bn68000.c,
which determine the word size of the processor they're running on and
call the appropriate bnInit_??() function.

The file lbn.h uses <limits.h> to find the platform's available data
types.  The types are defined both as macros (BNWORD32) and as typedefs
(bnword32) which aren't used anywhere but can come in very handy when
using a debugger (which doesn't know about macros).  Any of these may
be overridden either on the compiler command line (cc -DBN_BIG_ENDIAN
-DBNWORD32="unsigned long"), or from an extra include file BNINCLUDE
defined on the command line.  (cc -DBNINCLUDE=lbnmagic.h)  This is the
preferred way to specify assembly primitives.

So, for example, to build a 68020 version of the library, compile the
32-bit library with -DBNINCLUDE=lbn68020.h, and compile and link in
lbn68020.c (which is actually an assembly source file, if you look).

Both 16- and 32-bit 80x86 code is included in lbn8086.h and .asm.  That
code uses 16-bit large-model addressing.  lbn80386.h and .asm use 32-bit
flat-model addressing.

Three particularly heavily used macros defined by lbn.h are BIG(x),
LITTLE(y) and BIGLITTLE(x,y).  These expand to x (or nothing) on
a big-endian system, and y (or nothing) on a little-endian system.
These are used to conditionalize the rest of the code without taking
up entire lines to say "#ifdef BN_BIG_ENDIAN", "#else" and "#endif".

* The lbn??.c files

The lbn?? file contains the low-level bignum functions.  These universally
expect their numbers to be passed to them in (buffer, length) form and
do not attempt to extend the buffers.  (In some cases, they do allocate
temporary buffers.)  The buffer pointer points to the least-significant
end of the buffer.  If the machine uses big-endian word ordering, that
is a pointer to the end of the buffer.  This is motivated by considering
pointers to point to the boundaries between words (or bytes).  If you
consider a pointer to point to a word rather than between words, the
pointer in the big-endian case points to the first word past the end of the
buffer.

All of the primitives have names of the form  lbnAddN_16, where the
_16 is the word size.  All are surrounded by "#ifndef lbnAddN_16".
If you #define lbnAddN_16 previously (either on the command like or
in the BNINCLUDE file), the C code will neither define *nor declare* the
corresponding function.  The declaration must be suppressed in case you
declare it in a magic way with special calling attributes or define it as
a macro.

If you wish to write an assembly primitive, lbnMulAdd1_??, which
multiplies N words by 1 word and adds the result to N words, returning
the carry word, is by FAR the most important function - almost all of
the time spent performing a modular exponentiation is spent in this
function.  lbnMulSub1_??, which does the same but subtracts the product
and returns a word of borrow, is used heavily in the division routine
and thus by GCD and modular inverse computation.

These two functions are the only functions which *require* some sort
of double-word data type, so if you define them in assembly language,
the ?? may be the widest word your C compiler supports; otherwise, you
must limit your implementation to half of the maximum word size.  Other
functions will, however, use a double-word data type if available.

Actually, there are some even simpler primitives which you can provide
to allow double-width multiplication: mul??_ppmm, mul??_ppmma and
mul??_ppmmaa These are expected to be defined as macros (all arguments
are always side-effect-free lvalues), and must return two words of result
of the computation m1*m2 + a1 + a2.  It is best to define all three,
although any that are not defined will be generated from the others in
the obvious way.  GCC's inline assembler can be used to define these.
(The names are borrowed from the GNU MP package.)

There is also lbnMulN1_??, which stores the result rather than adding or
subtracting it, but it is less critical.  If it is not provided, but
lbnMulAdd1_?? is, it will be implemented in terms of lbnMulAdd1_?? in the
obvious way.

lbnDiv21_??, which divides two words by one word and returns a quotient
and remainder, is greatly sped up by a double-word data type, macro
definition, or assembly implementation, but has a version which will run
without one.  If your platform has a double/single divide with remainder,
it would help to define this, and it's quite simple.

lbnModQ_?? (return a multi-precision number reduced modulo a "quick"
(< 65536) modulus is used heavily by prime generation for trial division,
but is otherwise little used.

Other primitives may be implemented depending on the expected usage mix.
It is generally not worth implementing lbnAddN_?? and lbnSubN_?? unless
you want to start learning to write assembly primitives on something
simple; they just aren't used very much.  (Of course, if you do, you'll
probably get some improvements, in both speed and object code size, so
it's worth keeping them in, once written.)

* The bn??.c files

While the lbn??.c files deal in words, the bn??.c files provide the
public interface to the library and deal in bignum structures.  These
contain a buffer pointer, an allocated length, and a used length.
The lengths are specified in words, but as long as the user doesn't go
prying into such innards, all of the different word-size libraries
provide the same interface; they may be exchanged at link time, or even
at run time.

The bn.c file defines a large collection of function pointers and one
function, bnInit.  bnInit is responsible for setting the function pointers
to point to the appropriate bn??.c functions.  Each bn??.c file
provides a bnInit_?? function which sets itself up; it is the job
of bnInit to figure out which word size to use and call the appropriate
bnInit_?? function.

If only one word size is in use, you may link in the file bninit??.c,
which provides a trivial bnInit function.  If multiple word sizes are
in use, you must provide the appropriate bnInit function.  See
bn8086.c as an example.

For maximum portability, you may just compile and link in the files
lbn00.c, bn00.c and bninit00.c, which determine, using the preprocessor
at compile time, the best word size to use.  (The logic is actually
located in the file bnsize00.h, so that the three .c files cannot get out
of sync.)

The bignum buffers are allocated using the memory management routines in
lbnmem.c.  These are word-size independent; they expect byte counts and
expect the system malloc() to return suitably aligned buffers.  The
main reason for this wrapper layer is to support any customized allocators
that the user might want to provide.

* Other bn*.c files

bnprint.c is a simple routine for printing a bignum in hex.  It is
provided in a separate file so that its calls to stdio can be eliminated
from the link process if the capability is not needed.

bntest??.c is a very useful regression test if you're implementing
assembly primitives.  If it doesn't complain, you've probably
got it right.  It also does timing tests so you can see the effects
of any changes.

* Other files

sieve.c contains some primitives which use the bignum library to perform
sieving (trial division) of ranges of numbers looking for candidate primes.
This involves two steps: using a sieve of Eratosthenes to generate the
primes up to 65536, and using that to do trial division on a range of
numbers following a larger input number.  Note that this is designed
for large numbers, greater than 65536, since there is no check to see
if the input is one of the small primes; if it is divisible, it is assumed
composite.

prime.c uses sieve.c to generate primes.  It uses sieve.c to eliminate
numbers with trivial divisors, then does strong pseudoprimality tests
with some small bases.  (Actually, the first test, to the base 2, is
optimized a bit to be faster when it fails, which is the common case,
but 1/8 of the time it's not a strong pseudoprimality test, so an extra,
strong, test is done in that case.)

It prints progress indicators as it searches.  The algorithm
searches a range of numbers starting at a given prime, but it does
so in a "shuffled" order, inspired by algorithm M from Knuth.  (The
random number generator to use for this is passed in; if no function
is given, the numbers are searched in sequential order and the
returns value will be the next prime >= the input value.)

germain.c operates similarly, but generates Sophie Germain primes;
that is, primes p such that (p-1)/2 is also prime.  It lacks the
shuffling feature - searching is always sequential.

jacobi.c computes the Jacobi symbol between a small integer and a BigNum.
It's currently only ever used in germain.c.

* Sources

Obviously, a key source of information was Knuth, Volume 2,
particularly on division algorithms.

The greatest inspiration, however, was Arjen Lenstra's LIP
(Large Integer Package), distributed with the RSA-129 effort.
While very difficult to read (there is no internal documentation on
sometimes very subtle algorithms), it showed me many useful tricks,
notably the windowed exponentiation algorithm that saves so many
multiplies.  If you need a more general-purpose large-integer package,
with only a minor speed penalty, the LIP package is almost certainly
the best available.  It implements a great range of efficient
algorithms.

The second most important source was Torbjorn Granlund's gmp
(GNU multi-precision) library.  A number of C coding tricks were
adapted from there.  I'd like to thank Torbjorn for some useful
discussions and letting me see his development work on GMP 2.0.

Antoon Bosselaers, Rene' Govaerts and Joos Vandewalle, in their CRYPTO
'93 paper, "Comparison of three modular reduction functions", brought
Montgomery reduction to my attention, for which I am grateful.

Burt Kaliski's article in the September 1993 Dr. Dobb's Journal,
"The Z80180 and Big-number Arithmetic" pointed out the advantages (and
terminology) of product scanning to me, although the limited
experiments I've done have shown no improvement from trying it in C.

Hans Reisel's book, "Prime Numbers and Computer Methods for Factorization"
was of great help in designing the prime testing, although some of
the code in the book, notably the Jacobi function in Appendix 3,
is an impressive example of why GOTO should be considered harmful.
Papers by R. G. E. Pinch and others in Mathematics of Computation were
also very useful.

Keith Geddes, Stephen Czapor and George Labahn's book "Algorithms
for Computer Algebra", although it's mostly about polynomials,
has some useful multi-precision math examples.

Philip Zimmermann's mpi (multi-precision integer) library suggested
storing the numbers in native byte order to facilitate assembly
subroutines, although the core modular multiplication algorithms are
so confusing that I still don't understand them.  His boasting about
the speed of his library (albeit in 1986, before any of the above were
available for study) also inspired me to particular effort to soundly
beat it.  It also provoked a strong reaction from me against fixed
buffer sizes, and complaints about its implementation from Paul Leyland
(interface) and Robert Silverman (prime searching) contributed usefully
to the design of this current library.

I'd like to credit all of the above, plus the Berkeley MP package, with
giving me difficulty finding a short, unique distinguishing prefix for
my library's functions.  (I have just, sigh, discovered that Eric Young
is using the same prefix for *his* library, although with the
bn_function_name convention as opposed to the bnFunctionName one.)

I'd like to thank the original implementor of Unix "dc" and "factor"
for providing useful tools for verifying the correct operation of
my library.

* Future

- Obviously, assembly-language subroutines for more platforms would
  always be nice.
- There's a special case in the division for a two-word denominator
  which should be completed.
- When the quotient of a division is big enough, compute an inverse of
  the high word of the denominator and use multiplication by that
  to do the divide.
- A more efficient GCD algorithm would be nice to have.
- More efficient modular inversion is possible.  Do it.
- Extend modular inversion to deal with non-relatively-prime
  inputs.  Produce y = inv(x,m) with y * x == gcd(x,m) mod m.
- Try some product scanning in assembly.
- Karatsuba's multiplication and squaring speedups would be nice.
- I *don't* think that FFT-based algorithms are worth implementing yet,
  but it's worth a little bit of study to make sure.
- More general support for numbers in Montgomery form, so they can
  be used by more than the bowels of lbnExpMod.
- Provide an lbnExpMod optimized for small arguments > 2, using
  conventional (or even Barrett) reduction of the multiplies, and
  Montgomery reduction of the squarings.
- Adding a Lucas-based prime test would be a real coup, although it's
  hard to give rational reasons why it's necessary.  I have a number of
  ideas on this already.  Find out if norm-1 (which is faster to
  compute) suffices.
- Split up the source code more to support linking with smaller subsets
  of the library.