Customized Processor Extension Speeds Network Cryptology

As public data networks, online commerce, and smart cards become more popular, the need for secure data transmission grows. But the complex computations required to encrypt and decrypt data can soon become a performance bottleneck. When servers spend more time handling the overhead of secure connections, users often complain about network delays causing lower productivity and dissatisfied customers. Therefore, fast encryption and decryption provides a distinct competitive advantage.

The U.S. government's Data Encryption Standard (DES) specifies the Federal Institute of Standard Publications' (FIPS) approved cryptographic algorithm FIPS-46-3 as issued by the National Institute of Standards and Technology (NIST). The most widely used encryption algorithm in the world, it's utilized by banks for electronic fund transactions and by government agencies for communication systems.

The primary goal of the project described in this article was to reduce the CPU's workload for Internet Protocol Security (IPSec) encryption and decryption. This frees up CPU resources for other tasks without increasing the CPU's clock frequency. Alternatively, it could enable the CPU to run at a lower clock frequency to reduce power consumption, simplify system design, and cut costs without sacrificing performance.

To do this, we developed a DES extension by employing the user-customizable ARC processor, together with IPSec protocol software and software-development and analysis tools. To analyze and profile the compute-intensive software routines in the DES algorithm, public-domain C source code was implemented.

The result was an overall performance increase of 47× with single DES, and 92— with triple DES (3DES). These extraordinary gains substantially reduce the CPU's cycle count when executing these tasks. Another benefit is a 63% decrease in code size.

Data-Encryption Standard: The DES algorithm works on 64-bit data blocks in 16 repeated cycles, or rounds, under the control of a 56-bit encryption key. Each round uses sub-keys generated from the original key, making it well suited to hardware acceleration. The data supplied to the algorithm is called plaintext, and the resulting data is known as ciphertext. There are essentially three components to the DES algorithm. To encipher a 64-bit data block, the DES algorithm performs the following functions (Fig. 1):

Initial permutation (IP)
Complex key-dependent computations
Final or inverse initial permutation (IP¹)

To decipher the data, the program uses the same algorithm and key that enciphered the data. However, it alters the addressing of the key bits so the deciphering process is the reverse of the enciphering process.

The key-dependent computation can be defined in terms of the cipher function f and the key schedule, KS. The key schedule produces 16 sub-keys based on the 56-bit secret key.

Initial Investigation: The starting point for the acceleration project was to analyze some DES software that's optimized for general-purpose 32-bit microprocessors. The original idea was to create a new instruction that would perform an entire DES round in one clock cycle.

Sixteen iterations of this instruction would almost complete a 64-bit DES encryption routine. Additional instructions would perform the initial and final permutations on the data block and execute the initial contraction and permutation on the key data. These instructions would employ four custom extension registers (named C, D, L, and R) to hold the key and data block values during processing.

Further investigation showed that the ARC processor could perform the data permutations as part of the rounding operation. Also, the key permutation could happen automatically while loading the extension registers. Therefore, the final implementation consists of one new extension instruction operating on four custom 32-bit extension registers.

Methodology: Many projects of this nature require substantial effort to identify the software "hot spots" that consume disproportionate execution time. The kernel of the DES algorithm is a bit-level permutation function often referred to as the switch-box or S-box function. By sequentially executing 16 iterations of this function, the algorithm encrypts a 64-bit block of data based on a 56-bit private key. It's difficult to implement the S-box permutation function with the logical operators typically found in general-purpose processors. The CPU must execute each permutation with a loop of individual bit-rotations and merges.

After analyzing the DES software, two counts of the processor cycles required for each stage of encryption were tabulated. The first (smaller) cycle-count number is for a processor with a barrel shifter that can perform up to four bit-shifts per clock cycle. The second (larger) cycle-count number is for a processor without a barrel shifter that needs a full clock cycle to shift each bit position. (Although most configurations of the ARC processor incorporate the optional barrel shifter for the obvious performance improvement, some small RISC processors may not have that option.)

The DES software requires 10 32-bit reads from memory for every DES round—one for each of the eight S-box substitutions, and two to read the successively rotated 56-bit key from the 16-entry schedule. Because the data stream is effectively little-endian and the encryption algorithm is big-endian, the processor must swap the bytes while reading from and writing to memory. The DES software shifts them one byte at a time (Table 1).

This analysis provided clear targets for optimization, such as eliminating the software for generating the key tables and replacing it with dedicated hardware in the processor (e.g., a custom instruction). Because the key values change slowly relative to the data, the effect on performance might seem to be small. But providing custom extension registers for the successively rotated key values would eliminate all memory accesses during the execution of these rounds. The initial design envisioned four custom extension registers and four custom instructions:

Registers L and R would hold the 64-bit data block.Registers C and D would hold the 56-bit (2 × 28) key.Instruction desPC1 would execute the permuted choice permutation on the values in registers C and D (including the endian conversions).Instruction desIP would perform the initial permutation on the values in registers L and R (including the endian conversions).Instruction Round would round the values in registers L and R and rotate the values in registers C and D.Instruction desIIP would perform the inverse initial (final) permutation on the values in registers L and R (including the endian conversions).A timing analysis revealed that the processor might have enough time in one clock cycle to include the initial or final permutations in the Round instruction. It might even be possible to incorporate the endian byte swaps in those permutations by simply redefining the permutation's bit mapping. Because the key-rotation sequence is easy and fixed, it could be added to the Round instruction, too, by using the shimm (short immediate) field to indicate the rotation's bit count and direction.Further analysis revealed the possibility of eliminating the explicit permutation on the key block. This works by mapping the C and D registers so that the permutation occurs automatically as a result of writing data to those registers. Thus, the designers could remove the desPC1 instruction.The final design thus consists of a single extension instruction, desrnd, that performs all steps of a DES round in one clock cycle. Plus, this instruction performs the initial permutation and endian conversion the first time it executes, the key rotation every time it executes, and the final permutation and endian conversion the last time it executes. The value in the instruction's shimm field controls these features. There are no other arguments for this instruction because it performs all operations on the contents of the custom L and R (data) registers and the C and D (key) registers.Figure 2 illustrates the programmer's model of the desrnd instruction, and Table 2 shows the steps required to encrypt a 64-bit DES block using this instruction. Note that the aggregate key rotations of the 16 encryption rounds leave the 28-bit values in registers C and D in their original states—ready for use on the next data block. Although decryption is very similar (with different rotations), it takes an additional 1-bit rotation after the 16 rounds. This is easy. Just execute a 17th desrnd instruction after extracting the data from the L and R registers.Implementation: The desrnd in-struction was added to the processor using the standard extension-instruction architecture. The core pipeline could modify the values in the extension registers (one at a time). But instead, the DES module simultaneously updates them as it executes the desrnd instruction. The DES module can perform a single DES round or two successive DES rounds per cycle with one desrnd instruction.These extensions also support the ARC processor's scoreboarding capability. Before executing a desrnd instruction, the program can load the C, D, and L registers with LD instructions, load the value of register R into a core register, then move that value from the core register into R. This is necessary because the desrnd instruction writes results into all four extension registers simultaneously.

The IPSec software was modified to use the new instruction if the programmer specifies a conditional compiler directive. There are two levels of software integration:

C module: A simple modular integration replaces the key-generation and encryption modules with functional (nearly empty) duplicates. These duplicate modules employ code macros to perform the key generation and encryption/decryption. (Although they retain the 16-element key table, they use only the first element because the desrnd instruction rotates the keys.)In-line assembly: A cleaner, higher-performance in-line assembly implementation inserts the desrnd instruction (also via code macros) directly into the DES- and 3DES-encryption/decryption modules. The revised modules were incorporated into a test program, and the FPGA-based ARCangel prototyping system exercised them against the baseline code for an additional 500 million data/key pairs.Verification: To verify the modified DES implementation, the DES algorithm (as defined in FIPS 46-3) was first modeled in software, simulating the operation of the new desrnd instruction and registers via function calls. The developers extracted actual modules from the IPSec stack and added them to the shell code driving the simulated instruction. Executing these modules in parallel with the simulated instruction would verify the correctness of the new implementation.

After the developers tested and verified a few hand-built data/key combinations, they modified the shell to continuously generate random data and keys. They compared more than 5 million data/key combinations over a period of several days. The developers used the same software as a guide while designing the hardware implementation. This gave step-by-step intermediate values to verify correctness during early simulation.

When they had verified that hardware simulation matched the results of the software, the developers applied individual test vectors supplied in NIST Special Publication 800-17 (MOVS Requirements and Procedures) to the hardware simulation. This provided additional confirmation of the implementation's correctness.

Next, the developers synthesized and loaded an image of the customized processor into the Xilinx FPGA of an ARCangel development system. They then modified the DES simulation to run on this system and tested over 500 million random key/data pairs. The results of the DES simulation and the customized ARC processor were in complete agreement.

Results—Performance Increase:Figure 3 illustrates the performance increase achieved (relative to the baseline for single 3DES encryption) by using the desrnd instruction with simulated TCP/IP data. It also compares results from the desrnd instruction in a program in two different ways: with a C module and with an in-line assembly-language routine.

Results—Data Rates: Tables 3 and 4 show the 3DES data rates calculated from CPU cycles, assuming a CPU speed of 200 MHz and a DES block size of 64 bits. We used the following formula:

data rate = CPU speed × block size/cycle count

The data rates for 3DES were calculated via two different software implementations. One employed the desrnd instruction in a C module, while the other implemented it with an in-line assembly-language routine. The data rates also were calculated for one DES round per cycle (Table 3, again) and two DES rounds per cycle (Table 4, again).

Results—Code Size: Although the primary goal of the DES extension was to accelerate performance, a fringe benefit of collapsing several conventional instructions into a single custom instruction is smaller code size. The actual reduction is about 4 kbytes, or approximately 63% of the low-level DES-CBC code, important for applications that have limited memory resources (Table 5).

To sum everything up, extending a user-customizable processor improves the performance of compute-intensive algorithms, such as those required in the IPSec protocol. In this instance, an increase of 92&215; was achieved with 3DES. Any application that needs high-performance DES processing could potentially benefit from the custom DES extension to ARC's processor, like Secure Socket Layer (SSL), Transport Layer Security (TLS), and Encrypted PPP.