MISCELLANEOUS ALGORITHMS

7.0 Testing For Prime Numbers

There are a number of algorithms where randomness plays an important part in
solving a perfectly precise problem. Perhaps the most interesting is the prime
number problem.

Suppose we want to tell whether positive integer p is prime.  Eratosthenes
sieve is essentially a dynamic programming technique in which we remember the
numbers divisible by small primes while working on the next. It requires one
bit for each number up to p and is suitable for numbers up to several million
(say). What if p has several hundred digits? The sieve clearly takes at least
O(p) time, and probably O(plogp).  The size of this problem is the amount of
information needed to specify p, i.e. the number of digits of p, i.e.  n = log
p. We want a fast algorithm, i.e. something polynomial in n = log p.  There is
practical interest in this problem in the area of encryption.

One of the fastest known methods is a probabilistic one based on Fermat's
little theorem. See Lehmann's paper in SIAM J. Comput. vol. 11.  If p is prime
and 1 <= a <= p-1 then

      a^(p-1)  =  1 (mod p)

   =>   a^((p-1)/2)  =  +1 or -1 (mod p)

For example if p = 11 then (p-1)/2 = 5 and:

      a:   1  2  3  4  5  6  7  8  9  10
     a^5:  1 -1  1  1  1 -1 -1 -1  1  -1

If p = 15 then (p-1)/2 = 7 and:

      a:   1  2  3  4  5  6  7  8  9  10  11  12  13  14
     a^7:  1 -7 -3  4  5  6 -2  2 -6  -5  -4   3   7  -1

Some nasty number theory ensures that the following probabilistic algorithm is

      To test if p is prime
      If p = 2 give answer PRIME
      If p is even give answer COMPOSITE
      Choose 20 numbers (say) at random from 1 to p-1
      For each number a calculate a^((p-1)/2) (mod p)
      If any are not +1 or -1, give answer COMPOSITE
      If all are +1, give answer COMPOSITE (prob. of being wrong 2^-20)
      Otherwise give answer PRIME (prob. of being wrong 2^-20)

If k numbers are used instead of 20, the probability of a wrong answer is 2^-k. If a theoretically sound random number generator is used, and the probability of a wrong answer made less than the probability of a bug in the program, the algorithm is as good as a non-probabilistic one. All numbers, including intermediate results can be kept modulo p, so they all have log p digits. The question remaining is how to calculate a^m efficiently. Using m multiplications is O(m) - too slow. Divide and conquer suggests calculating a^(m/2) and squaring. Indeed a^(2^k) takes k squarings, i.e. log m squarings. If m is not a power of 2, it is a sum of powers of 2: m = 42 m = 32 + 8 + 2 = 2^5 + 2^3 + 2^1 In 5 squarings, we can calculate a, a^2, a^4, a^8, a^16, a^32 and with 2 mults we have a^42 = a^(32+8+2) = a^32 * a^8 * a^2. In general we can use O(log m) squarings and O(log m) multiplications. A long multiplication can be done in O((log p)^2) steps, making O((log p)^3) overall. Note that the algorithm does not factorise composite numbers. There is no known polynomial algorithm for that, and indeed some modern public key cryptography techniques depend on the fact that you cannot factorise a number of several hundred digits. Finally, note that there is an arbitrary precision number package on unix which can be used in programs, or interactively through the "bc" or "dc" commands. 7.1 String Searching -------------------- Suppose we are searching for a word of w characters in a text of n characters held in memory (e.g. in an editor). The obvious algorithm is (essentially): To find word[1..w] in text[1..n] -------------------------------- for i := 0 to n-w do for j := 1 to w do compare word[j] with text[i+j] No better method was found until 1970, and no algorithm published until 1977. The idea is to scan the text faster, using knowledge from the mismatches found. The easiest method to understand is a variant of the Boyer-Moore algorithm. The idea is to compare the last letter of the word first. Suppose we are looking for STING in a text: A STRING SEARCH STING ^ The R mismatches. As there is no R in STING, STING cannot possibly match " STRI", "STRIN", "TRING", or "RING ", so we can skip 5 places forward for the next comparison: A STRING SEARCH STING ^
The S mismatches, but is contained in STING, so we can only skip 4 places: A STRING SEARCH STING ^ This leads to a program: word: packed array [1..w] of char; text: packed array [1..n] of char; skip: array [char] of integer; procedure initialise; var j: integer; c: char; begin for c := chr(0) to chr(255) do skip[c] := w; for j := 1 to w do skip[word[j]] := w-j; end; function search: integer; var i,j: integer; begin i := w; j := w; repeat if text[i] = word[j] then begin i := i - 1; j := j - 1; end else begin i := i + skip[text[i]]; j := w; end; until (j<1) or (i>n); search := i + 1 end; These methods are typically O(w+n) rather than O(wn). Note that the preprocessing is worth it. If the text is fixed, and many searches are to be done in it, it may be worth preprocessing the text using some hash-table or binary tree method. Note that there are variants with less overhead, and there are variants which never "back up" in the text suitable for file based searching or for parallel processors.