A QUARTERLY PUBLICATION OF ACCS

Residue Number Systems

P. V. Ananda Mohan, Technology Advisor, Centre for Advanced Computing (CDAC), Bangalore

Residue Number systems have been extensively studied in past four decades in view of their advantages in some applications in Digital Signal Processing and Cryptography. In this tutorial paper, we introduce the basic concepts highlighting the advantages and

disadvantages over other number systems.

Digital computation and Digital signal processing need arithmetic processors of moderate word lengths 16 to 32 bits.Fixed point and floating point arithmetic units which can perform addition, subtraction, multiplication are extensively employed. On the other hand, several cryptographic algorithms used for authentication such as RSA (Rivest,Shamir and Adleman) cryptosystems [1] need to perform operations on large word length operands of lengths ranging from 1024 bits to 4096 bits. Other authentication systems use Elliptic Curve Cryptography [2] where word lengths range from 160 bits to 256 bits. These can be realized in embedded systems using resident processors such ARM or using DSPs using word based implementations. In some applications such as Homomorphic encryption [3] used for computing on encrypted data in cloud computing, processors need to handle words of the size 768, 000 bits. Residue Number Systems facilitate handling such big number operations using smaller word length processors.

In the last decade, considerable research has been carried out in this area in order to design faster, low area and power efficient processors. Just as in the case of other Arithmetic and logic units, a RNS based Arithmetic unit needs to perform basic operations such as addition, subtraction, multiplication, scaling or division, comparison and sign detection. We will consider these in the next few sections in some detail. We will present several example to facilitate easy understanding and avoid too much of theory.

II. ARCHTHCURE OF A RESIDUE NUMBER SYSTEM

Let us consider a conventional 32 bit processor for illustration. It needs to perform several 32 bit operations.Consider simple addition of two binary numbers. The time needed using carry propagate adders or Ripple carry adders is 31 full-adder delays. Most text books [4], [5] give improved designs using carry-look-ahead techniques involving two or more levels so as to reduce the delay for computing the carrybit. Today the benchmark is roughly log2n full adder delay for a n-bit two operand addition. Thus for a given technology (i.e. specified gate delay), this is the minimum time needed. This can be reduced only if we use smaller width operands. Thus, one solution is to represent the 32 bit word using smaller numbers and then being able to perform operations faster. Earlier logarithmic systems have been tried which can convert the time consuming multiplication or division operations to addition and subtraction. But we need log tables to convert from normal form to logarithms and after multiplication or division conversion back to normal form. Unfortunately, addition and subtractions cannot be done easily in logarithmic number system. RNS was introduced by Garner [6]. Szabo and Tanaka [7] and Watson and Hastings [8] have documented the research results in 1967 and have addressed the various issues and design of digital computers using RNS using electronic components available at that time including magnetic memories etc was described.

The concept of RNS is attributed to Chinese who have discovered the so called Chinese Remainder Theorem. The problem given by Sun Tzu in 3rd Century is as follows:

- We have things of which we do not know the number
- If we count them by threes, wehave two left over
- If we count them by fives, we have three left over
- If we count them by sevens, we have two left over

Stated in a simple way, the poser is “what number when divided by 3, 5 and 7 gives the remainders 2, 3, 2?”.

In short, for constructing a RNS, we should choose few mutually prime numbers of similar size whose product is equal to the dynamic range we need. As an illustration, for achieving 32 bit dynamic range, we can choose 2047, 2048, 2049 as the three moduli which are mutually prime. The product of these three primes is 8589932544 <233. A typical RNS based system architecture is presented in Fig.1.

We need to have first a front-end converter to find the residues of the input numbers mod these three moduli. By residue, we mean the remainder obtained by dividing the given number by the modulus. As an example, 8000 mod 2047 =1859. We have thrown the quotient 3 and taken only the remainder. Thus for the three moduli system, we are considering, we represent it in RNS by the three residues (1859, 1856, 1853). Note that this representation is unique.

We therefore need to process such smaller 11 bit numbers in several processors in parallel in a Single instruction multiple data (SIMD) architecture and after our computation using several moduli processors, we need to convert it back to normal number. Thus problem given in the beginning of this section is reverse conversion problem. The number of moduli can be many so

that word length of the individual processors can be reduced. For example using eight four-bit or five-bit mutually prime moduli, we can construct a 32 bit system. Example {9, 11, 13, 17, 19, 23, 25, 29} can realize a 32 bit RNS. We consider next, various operations in RNS.

**III. ARITHMETIC OPERATIONS IN RNS**

We consider modulo addition, modulo subtraction and modulo multiplication, scaling, comparison and sign detection in some detail.

Consider finding (A+B) mod M = (5 + 7) mod 11. We add first 5 and 7 to get 12.Since 12 is greater than 11, we subtract 11 from 12 to get 1. Thus one addition and one conditional subtraction are needed. Similarly, consider subtracting 7 from 5. We get -2 and since the answer is negative, we need to add 11 to obtain 9. Thus, two instructions or two operations are needed. In hardware, we can speed up by computing both (A+B) and (A+B-M) using two adders in parallel and selecting the correct result using a 2:1 multiplexer by looking at thesign of the second adder computing(A+B-M).

Next let us consider multiplication operation (5×7) mod 11. We get 35 by simple multiplication and we divide 35 by 11 to obtain the remainder as 2. We observe that the product is double the length in bits of the given residues.Then, we need to perform division. Division in conventional arithmetic is quite complicated compared to multiplication. One line of thinking is as follows: When we do not need the quotient, why should we follow this method? Moreover, some designers do not like the data path width to become double. Thus lot of effort has gone into the problem of modulo multiplication in the past thirty years.

We will demonstrate this technique using an example given in Fig. 2. We start the computation with MSB of the multiplier. In every step we multiply the 8 previous result by 2 and add A if bi is 1.If bi is zero, no addition of A is needed. We then reduce the result mod M. Note that the term in the brackets S, the intermediate sum, can be at most 3(M-1) and we reduce it mod M by subtracting M or 2M and depending on the sign of the results, we select either S or S-M or S-2M. Thus, in each step, we need a conditional addition, one or two subtractions and one selection. This operation needs to be done as many times as the number of bits in B. We can use Fast adders to perform the addition/subtraction in the loop.

Consider finding A×B mod M = 1001×1011mod 1101(in decimal (9×11) mod 13 = 8).We initialize the accumulator S to zero.Denoting B in binary as b3b2b1b0, we start with MSB b3 of B. We add A since it is 1. S = 0+1001 = 9. We find 9 mod 13 = 9. Next we double 9 and since b2 = 0, we do not add A. Tus, S = 18 mod 13 = 5. We double 5 and add to A since b1 = 1 to get S = 10+9 = 19 mod 13 = 6. We double 6 and add A since b0 is 1 to get S = 21 mod 13 = 8. This is the result.

The technique can be extended by considering more bits of operand B at a time which is known as high radix modulo multiplication. Note that each time we need to multiply the previous output by 2r where r is the number of bits of B we are taking at a time.We consider the same example given above taking two bits of B at a time. We need only two steps.

Consider finding A×B mod M = 1001×1011 mod 1101 We Initialize the accumulator S to zero. Denote B in binary as b3b2b1b0. We start with Most significant digit (b3b2)2 = (2)10 of B. (The subscript 2 indicates binary and 10 indicates decimal). We add 2A since this digit is 2. Hence, S = 0+2×1001 = 18.We find 18 mod 13 = 5. Next we multiply by 4 to get 20 and since (b1b0)2= (3)10, we add 3×1001 = 27 to get S = 20+27 = 47 mod 13 = 8.

The price paid is that size of S has increased and 0 < S < 7(M-1) and hence mod M reduction is slightly involved compared to the earlier case of taking one bit of B at a time in which case 0 < S < 3(M-1). Note that, we perform hand calculation by taking one digit at a time (known as radix 10). Today, hardware and software designs consider 4 bits (or even 32 bits) of B at a time thus needing 256 (or 32) steps to perform one modulo multiplication of 1024 bits (A×B) mod M with A, B and M of 1024 bits) for example for use in RSA algorithm.

We consider another two techniques for modulo multiplication. In one method known as Barrett reduction [9], the product A×B is first found which obviously has word length twice that of A and B assuming both A and B are of the same word length. Here, we find the approximate quotient q first and then subtract q×M from X = A×B. The remainder is our answer with some

correction (one subtraction of M) may be required. We need to pre-compute first. Then, we effectively compute the quotient as . Note that many operations such as division by 2n+1 can be carried out simple shifts (omitting bits) in this technique. We will illustrate with

an example.

As an illustration, consider finding (121) mod 13. Evidently n = 4. We obtain and Thus, remainder r1 = 121-13×8 = 17. Since r1 > 13, we have r1 = 17 mod 13 = 4.

Another popular technique is Montgomery modular multiplication [10]. The troublesome reduction of S mod M we have seen in the first method is removed here. However, this comes at a small price. We actually compute (A×B×2-n) mod M instead of (A×B) modM. We pre-scale (i.e multiply) the inputs A and B first with 2n. Then we do the Montgomery step. We get ((A×2n)×(B×2n) ×2-n) = (A×B×2n) mod M. At the end, we post-multiply by 2-n mod M to remove the unwanted 2n to get (A×B) mod M. The computation of (A×B×2-n) mod M is simple which does not need modulo reduction of S as we have seen earlier. We only need to add B to S and if LSB of the Sum (S+B) is 1 add M or else do not add. The result will have LSB zero so that we score it off meaning dividing by 2 to get the new S. Thus n steps are required to divide by 2n. We illustrate Montgomery algorithm considering the same previous example.

Consider finding A×B mod M = 1001×1011 mod 1101. We multiply first both A and B by 16 since n = 4 and find mod 13. We get Aʹ = 1, Bʹ = 7. We need to find (0001×0111×2-4) mod 13 =(00000111×2-4) mod 13 using Montgomery algorithm. Since LSB is 1, add 13 to 7 to get 20 which when divided by 2 (right shifting by one bit) yields (00001010)2. Since LSB is zero, we need not add 13 but we shift by one bit to get (0000101)2. Since LSB is 1, we haveto add 13 to get (0010010)2 when right shifted by one bit gives 001001. Since LSB is 1, we add 13 and shift right by 1 bit to obtain 1011. Note that we have obtained now (0001×0111×2-4) mod 13 = 11.

At this stage, we have (A×B×2n) mod M. We can continue four more times to get the actual result by getting rid of the 2n factor. The answer is 8.

So far we have seen only multiplication modulo M. In cryptography, we need exponentiation mod M. We need to find (XY) mod M. Here we use the square and multiply technique. Consider Y a binary number say 13. We can get X13 mod M = {[(X8)mod M] × [(X4) mod M] ×[(X)mod M]} mod M. Thus, we need to have all powers of X reduced mod M which are squares of the previous one. Then we choose to multiply some of them only depending on the exponent. For example we have computed X mod M, X2 mod M, X4modM and X8 mod M. But since exponent 13 does not have 2 in the binary expansion, we ignore the X2 mod M value for multiplication. We use it only for getting X4 mod M by squaring. In general, for 1024 bit exponents as are needed today in RSA cryptography, we need to perform 1023 modulo squarings and in the worst case 1023 modulo multiplications all mod M. Today, smartcards, RFIDs or mobile handsets have got powerful cryptoengines optimized to realize these operations or use Java software routines using Barrett reduction or Montgomery algorithm. We have limited ourselves to basics here. Several techniques for speeding up the modulo multiplication and exponentiation do exist which consider more bits at a time or even words at a time see for example [11].

We need to find residue mod mi for the input big binary number. One simple and obvious method is to divide the given number X by mi to get the remainder. This needs division operation which is cumbersome compared to multiplication or addition or subtraction. We choose moduli of special type so that this can be done easily. This stems from the basic fact we know for many years. Suppose we need 4567 mod 9. We suggest adding all digits till you get one digit. We have (4+5+6+7)9 = (22)9 = 4. We have used the property that 10 mod 9 = 1, 100 mod 9 = 1 and 1000 mod 9 = 1. Hence place value mod 9 of all digits is 1. Hence the residue mod 9 is obtained by adding all the digits. We repeat till we get a single digit answer less than 9. If it is 9, the answer is zero. We use the same idea for any modulus, but we need to deal with binary numbers.

Given a modulus mi, we find its order meaning p such that 2p = 1 mod mi. Example consider mi = 7. We have 23 = 1mod7, 24 = 2mod7, 25 = 4mod7, 26 = 1mod7 etc. The residues of powers of 2 (i.e. 2x mod m) have a periodic behaviour. They keep repeating after the period. Thus, if we have a binary number (0110100010011001100)2 = (214,220)10 = 344CC (Hex) and want to find residue mod 7, divide the word into 3 bit fields add all of them. If the sum is word of length more than 3 bits repeat the procedure. Following the above, we have 0 110 100 010 011 001 100 mod 7 = 10 100 mod 7 = 6. If you want residue mod 9, we have another interesting behavior: 20mod9 = 1. 21 mod 9 = 2, 22 mod 9 = 4, 23 mod 9 = 8, 24mod9 = 7, 25mod9 = 5, 26 mod9 =1, 27 mod 9 = 2,28 mod 9 = 4, 29mod9 = 8, 210mod9 = 7,211mod9 = 5, 212mod9 = 1. Here again the order or period is 6 since 26mod9 = 1. We can observe one more characteristic. There is a half-period also since residues repeat after half-period but with negative sign. The residues are 1,2, 4, -1,-2,-4, 1,2,4, -1,-2,-4, 1,2…. . We can take advantage of this. We can add all even 3-bit fields and all odd 3-bit fields and subtract the former from the latter to get the residue mod9. For the example above, 0 110 100 010 011 001 100, sum of odd fields is 1011 an sum of even fields is 1001 and Sum of odd fields – sum of even fields = 2. The reader may verify that 214220 mod 9 =2. This simplicity of forward conversion is the reason why many multi-moduli RNS today use powers of two related moduli sets. Some examples are {31, 32,33}, {31, 32, 33, 65}, {15, 16, 17, 31} etc.

This is a difficult problem. In RNS, by looking at residues, we cannot compare two numbers or know the sign of a given number easily unless we convert the RNS number to binary number system.

Hence it is important to perform fast reverse conversion i.e. residue to binary conversion. Several tens of papers have been written on the subject and several techniques have been developed. We consider two most important techniques (a) Chinese Remainder theorem (CRT) and (b) Mixed Radix conversion (MRC). Using these we can solve the problem we started with in Section 2.

For our problem we have the RNS {3, 5,7} and the given residues are (2, 3, 2).We illustrate CRT for a three moduli set {m1, m2, m3} with given residues (r1, r2,r3). CRT states that the decoded number is X = (r1W1+r2W2+r3W3) mod M where M = m1m2m3.(1)

The values Wi are called weights. These weights are defined as . Note that and the symbol stands for multiplicative inverse of Mi with respect to modulus mi. This means that αi×Mi when divided by mi yields a remainder 1.

We will now use CRT for our problem. The various Mi values are M1 = 5×7 = 35, M2 = 3×7 = 21, M3 = 3×5 = 15. Now we have . Next, we find the weights as W1 = 70, W2 = 21 and W3 = 15. From(1), next we obtain X = (2×70+3×21+2×15) mod (105) = (233) mod (105) = 23. Thereader can verify that residues of 23 for moduli 3, 5, 7 are as given.

The problem with CRT is that the sum in (1) before mod M reduction can be as many times M as the number of moduli and we need to reduce it mod M which is cumbersome. But the advantage of CRT it is very fast since weighing the inputs with weights Wi is done in parallel whereas only addition of these weighted products and modulo M reduction needs time because M is a large number.

We next consider MRC technique. This is sequential and takes (n-1) steps for a n-moduli RNS. The binary number X corresponding to given residues is written in Mixed Radix form as

X = d 0 + d 1 m 1 + d 2 m 1 m 2 + d 3 m 1 m 2 m 3 + . + d n m 1 m 2 m 3 . m n − 1(2)

Note that d0 is given residue r1. We next compute d1 by subtracting r1 from all other residues ri mod mi and divide the result by m1. Since we have subtracted the residue corresponding to m1 from X, evidently the remainder (X-r1) is divisible exactly by m1. This is the basic principle of MRC. The division is performed by multiplying the residues of all other channels with the multiplicative inverse The residues at this stage correspond to . We next subtract the residue corresponding to modulus m2 in the new set of residues and continue in the same way to obtain the other Mixed Radix digits. This process continues till the last residue channel is reached. An example of four moduli RNS is shown in Fig. 2 (a) together with a numerical example in Fig. 2 (b) . After Mixed Radix digits are found, we can compute the binary number following (2). This involves big number multiplications but the advantage is that no cumbersome modulo M reduction is needed. we can represent all positive numbers between 0 to 104. Alternatively, we can represent numbers between -52 to +52. The positive numbers are between 0 and 52. Negative numbers are between 53 and 104. For example, we represent -3 as 105-3 = 102. Its residues are complements of residues of 3 with respect to the moduli:** **There are several other techniques for RNS to nary conversion introduced more recently such as New Chinese Remainder Theorem., Mixed-Radix CRT, core function etc [12][13][14] each having its own advantages and disadvantages regarding area and time needed in hardware implementation.

It is interesting to note that scaling by one modulus or product of two or more moduli is possible using MRC easily. In the example of Fig.2 (b), the residues corresponding to scaling by 3 are available in first step as in the reduced RNS {5, 7, 11}. In the second step, we have the result of scaling by 15 as in the reduced RNS {7, 11} and in the last step we have corresponding to scaling by 105 as d4 in the reduced RNS {11}. Note, however, that these results do not give residues corresponding to the “scaling” moduli (moduli to the left of the scaled result). We need to have these residues also for the complete answer to be available in RNS. This needs a step known as base extension. This needs application of Mixed Radix Conversion in the reverse order of moduli.

Scaling by a power of 2 also is possible as is needed in some DSP applications. Scaling by arbitrary number also is possible with some difficulty.

Sign detection is possible after performing reverse conversion using CRT or MRC. For a given RNS e.g. {3, 5, 7}, we can represent all positive numbers between 0 to 104. Alternatively, we can represent numbers between -52 to +52. The positive numbers are between 0 and 52. Negative numbers are between 53 and 104. For example, we represent -3 as 105-3 = 102. Its residues are complements of residues of 3 with respect to the moduli:3 = (0, 3, 3) and -3 = (3-0, 5-3, 7-3) = (0, 2, 4). By looking at the residues, we cannot find the sign of the corresponding number. Hence, after decoding, we need to compare with 52 to know whether the number is positive or negative. For some special moduli sets, however, some simpler methods have been recently presented. Comparison also can be done after reverse conversion using CRT or MRC. However in MRC case, the comparison can be done using Mixed radix digits di of given numbers A and B without needing full conversion using (2). We can start comparison with highest Mixed Radix digit dn-1 and if they are equal , then we compare the next Mixed radix digit dn-2. Thus n sequential steps will be needed in the worst case in case of a n-moduli RNS.

In the past fifteen years numerous moduli sets have been introduced which enable fast reverse conversion and forward conversion and fast multiplication operations without using Read Only Memory (ROM) based Lookup tables (LUTs). Some moduli sets are for example {2n-1, 2n, 2n+1}, {2n-1,2n, 2n-1-1}, {2n-1, 2n, 2n+1-1},{22n-1, 2n, 22n+1}, {2n-1, 2n, 2n+1, 2n+1 -1}, {2n-1, 2n, 2n+1, 2n+1 +1}, {2n-1, 2n, 2n+1, 2n-1-1}, {2n-1, 2n, 2n+1, 2n-1 -1}, {2n-1, 2n, 2n+1, 22n -1} etc.

In this tutorial paper, we have introduced the concepts of Residue NumberSystems together with the description of procedures for performing various arithmetic operations. In current state of the art, RNS processors have been built for high Dynamic ranges up to 1024 bits using eight to ten moduli taking advantage of special powers of two related mutually prime numbers and massive mathematical operations as needed in some cryptographic pairing protocols [15] have been shown to be benefited by the use of Residue Number system. The parallelism possible with RNS based processors has been exploited to design cryptographic systems to prevent side-cannel attacks. RNS is not only useful for implementing RSA algorithm and Elliptic curve cryptography using Montgomery multiplication and exponentiation, but alsois useful for applications in post quantum cryptographic implementations.

Notable among these are Lattice-based cryptosystems which use thousands of bits of key. While multiprecision arithmetic can be used to implement such big word length operations, RNS is naturally suited to these problems since operation are inherently mapped to several small word length operands.The reader is referred to [16] for a comprehensive tutorial on Application of RNS to Cryptographic system design.

Several RNS based Digital Signal processing applications have been explored for building FIR, IIR filters, FFT engines and implementations on ASIC as well as FPGAs have been continuously being investigated for low power and low area and speed efficient designs. RNS is especially attractive for FIR filters of large taps since all weighting by coefficients and accumulation can be done very efficiently in residue form. Hence, the time taken for input to residue conversion and residue to binary conversion is not significant compared to the advantage gained in fast weighting and accumulation.

More over scaling and stability issues do not arise. RNS is especially attractive for providing tolerance against soft errors. In addition RNS leads to power efficiency due to small word length computations being done in parallel instead of large word length operations. Using special techniques like Quadratic Residue number systems, all transforms such as DCT, DWT (Discrete Wavelet Transform), NTT, FFT have been be efficiently implemented using RNS. RNS has also been used for designing Gaussian smoothing and Sobel’s edge detector in image processing application. RNS has also been used for packet forwarding in Wireless sensor Networks (WSN). Secure file storage in cloud has been accomplished using RNS by dividing the file into residue segments, encrypting and storing on different cloud severs. Other applications include use of RNS for realizing fault tolerant hybrid memories. RNS has been employed in communication systems for code design comprising of information and redundant residues to achieve good error correction capability. The reader is referred to references [17] and [18] for exhaustive treatment of these applications.

[1] R.L. Rivest, A. Shamir, and L.M. Adleman, “A Method for Obtaining Digital Signatures and

Public–Key Cryptosystems,” Communications of the ACM, Vol.. 21, pp. 120–126, Feb 1978.

[2] A. Menezes, Elliptic Curve Public Key Cryptosystems, Kluwer Academic Publishers, 1993.

[3] W. Wang, X. Huang, N. Emmart, and C.Weems, VLSI Design of a Large-Number Multiplier for Fully Homomorphic Encryption, Vol.22, pp 1879-1887, 2014.

[4] K. Hwang, Computer Arithmetic: Principles,architecture and Design, Wiley, 1979.

[5] I. Koren, Computer Arithmetic Algorithms, Brookside Court Publishers, Amherst, Massa-

chusetts, 1998.

[6] H. Garner, “The Residue Number system”, IRE Transactions on Electronic Computers, EC-

8, pp 140-147, 1959.

[7] N.Szabo and R.Tanaka, Residue Arithmetic and Its applications in Computer Technology,

New York: McGraw Hill, 1967.

[8] R. Watson and C. Hastings, Residue Arithmetic and Reliable Computer design, Washing-

ton DC, Spartan, 1967.

[9] A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied Cryptography, by

CRC Press, 1996.

[10] P.L. Montgomery, “Modular multiplication without trial division”, Math. Comput., Vol. 44,

pp 519-521, 1985.

[11]C.K. Koc, T. Acar and B.S. Kaliski.Jr, “Analyzing and comparing Montgomery Multiplica-

tion Algorithms”, IEEE Micro, pp 26-33, 1996.

[12]. M.A. Soderstrand, G.A. Jullien, W.K. Jenkins and F. Taylor (Eds), Residue Number Sys-

tem Arithmetic: Modern Applications in Digital signal Processing, IEEE Press, 1986.

[13] P.V. Ananda Mohan, Residue Number Systems: Algorithms and Architetectures, Kluwer, 2002.

[14] A. R. Omondi and B. Premkumar, Residue mNumber Systems: Theory and implementation,

Imperial College Press, 2007.

[15] J. Fan, F. Vercauteren and I. Verbau- whede, Efficient hardware implementation

of Fp-Arithmetic for Pairing-friendly curves, IEEE Transactions on Computers, Vol. 61, pp

676-685, 2012.

[16] L.Sousa, S.Antao and P.Martins, Combining residue arithmetic to design efficient cryp-

tographic circuits and Systems, IEEE Circuits and Systems Magazine, Vol. 16, Number 4, pp

6-32, 2015.

[17] C.H. Chang, A.S. Molahosseini, A.A.E.Zarandi and T.F. Tay, Residue Number systems:

A new paradigm to datapath optimization for low-power and high-performance Digital signal Processing applications, IEEE Circuits and Systems Magazine, Vol. 15, Number 4, pp 26-44.

[18] P.V Anand Mohan, Residue Number Systems: Theory and Applications, Birkhauser Basel, 2016.