Network routers are increasingly performance hungry because local-area networks (LANs) and wide-area networks (WANs) currently operate faster than ever. Ten-Mbit/s Ethernet technology has been the mainstay for the past two decades. But now the industry is moving en masse into Fast Ethernet (100 Mbits/s), and even gigabit Ethernet (1000 Mbits/s).
As a result, the number and throughput of datapaths coming in and out of a router are increasing. The system engineer must, therefore, factor greater performance processing into a design, in order for his system to efficiently perform routing functions. Take, for example, a 25-MHz system moving data between a 10-Mbit/s Ethernet LAN and a WAN. This level of processing is adequate for up to 6000 packets per second (pps), or assuming 200+ byte packets, 10 Mbits/s, which is the limit of 10-Mbit/s Ethernet.
However, if this LAN is migrated to 100-Mbit/s technology, with a corresponding WAN capability, there is considerably more throughput, and a major requirement for a faster processor with performance add-ons. To date, conventional CISC processors have been utilized as router CPUs. However, more embedded RISC processing is making inroads into this CISC-dominated design arena. There are several design considerations and trade-offs for utilizing embedded RISC over CISC in next-generation routers, which today splinter into various levels of design requirements from the high-end, high-performance models to newer, low-end access routers.
There's been an ongoing popular belief among CISC devotees that it is indeed faster than RISC processing. The truth of the matter is that a typical CISC processor can perform a task in one instruction, but that same task would require two-to-three RISC instructions. However, a CISC processor requires multiple clocks per cycle—typically, at least three clock cycles of throughput execution time for the simplest instructions, and on the order of 12 to 24 clock cycles for more complex instructions. Conversely, a RISC processor takes a single clock cycle for each instruction.
Therefore, consider that the two-to-three RISC instructions and three clock cycles are much more attractive from a higher-performance point of view compared to one CISC instruction taking at least three clock cycles, and frequently much more. Also, consider that RISC instructions are simpler and, consequently, operate faster so you can achieve substantially higher clock rates. By combining single-cycle execution with high clock rates, the RISC processor can provide more than three times the processing power of a CISC processor in a typical application.
But it becomes more difficult when comparing the processor performance itself. There are a wide variety of CISC processors with an assortment of instructions and instruction timings. Furthermore, different applications will use the various instructions in different ways. If your application seldom uses the "XYZ" instruction, it may not be worth paying extra for a processor that executes "XYZ" in few clock cycles. Different applications, processors, and programming styles will always generate different results.
Let's compare the implementation of a simple ring-buffer put routine on a CISC processor and a RISC processor (Table 1). The put routine is a simplified version of a common routine used to implement a ring buffer. Here, a value in a data register is stored in the next location of a ring buffer in memory, and the next location pointer is incremented and wrapped back to the start of the buffer, if necessary.
No special instructions were used for this comparison. The code looks at general instructions used to implement the put routine on a 68000-style processor, and on a MIPS RISC-style processor. Estimated clock execution times (throughput for pipelined processors) are provided for each implementation. For the CISC implementation, the estimated execution clocks for both a low-end processor and a high-end processor (in parentheses) are given. Both the RISC and the higher-end CISC timings assume full caching of data and instructions.
From this example, it can be seen that the high-end CISC processor executes in nearly the same number of clock cycles as the RISC processor. This is a result of using the simplest instructions with fast execution times to implement a very simple routine. Still, it can be seen in lines four and five that the CISC processor implements the required operation with two instructions requiring five-to-six clock cycles, where the RISC processor requires three instructions and three clock cycles. More-complex instructions show a wider difference both in the number of RISC instructions to be equivalent, and number of CISC cycles to execute.
The other number, not shown in Table 1, is the difference in maximum clock frequency of these processors. The high-end CISC processor here typically tops out around 40 MHz, while RISC processor equivalents exceeding 100 MHz are readily available. When the system designer evaluates processors from a performance standpoint, it's difficult to make exact comparisons without implementing the entire design in each processor. There are, however, some general guidelines. The RISC processor will typically execute in fewer clock cycles, require more instructions for the equivalent application, and will be available at much higher clock frequencies.
In a router design, data packets are handled by software with a minimal of hardware intervention. Legacy-software-based high-end routers can process about 500,000 to one-million 64-byte packets per second. A single gigabit Ethernet interface can pump over 1.4 million of such packets in each direction. This tells the system designer that future-generation Layer 3 routers and switches will not be able to solely depend on software for packet filtering and forwarding. Or, put another way, conventional CISC processing has little likelihood of shoring up this problem.
In addition to higher clock-rate execution, the embedded systems designer can customize the embedded RISC processor for the router application to enhance performance, thus closing the gap between required and available processor cycles. As an example of such an enhancement, consider the problem of accessing the header information of a packet as it comes into the router.
In a typical router design, packets captured from the network or from a backplane are stored in packet memory (Fig. 1). Then, the packet header is examined and modified. A problem with this design, however, is that using the router's CPU to get to the header increases the system's CPU usage by the number of cycles required to access header information (generally 48 to 64 bytes). Because the software accesses the header in random order, it forces the CPU to wait, or stall, until the on-chip cache fills a line before the header can be examined. Most often, packet header fields are utilized once, which makes inefficient use of the processor cache. In a system without a cache, the CPU utilization overhead climbs even higher.
One solution is to extend the RISC processor's capabilities by implementing a packet header cache (PHC) in the embedded RISC processor. While this PHC function is primarily targeted at networking applications, it can be applied to other embedded applications in which large amounts of data need to be processed in real time. The idea behind a PHC is to capture the packet header in an ASIC's on-chip high-speed memory as it's being stored or forwarded, without the CPU's intervention. The CPU is notified by an interrupt as soon as the packet header is captured inside the memory. Then, the CPU examines the header, alters it, and the header is written back into packet memory.
In general, there are two mechanisms for adding functionality to a RISC core—attaching custom hardware to the RISC processor through a direct processor bus, or adding to the processor instruction set. As an example, the CW4011 MIPS RISC core from LSI Logic contains an on-chip-access (OCA) interface that reduces the number of CPU cycles required to access devices or memory attached to this bus. Additional hardware, such as a PHC unit, can be attached to the OCA bus to enhance the RISC capabilities without modifying the instruction set, and while still having direct, fast access to the hardware. Modification of the instruction set consists of the ability to decode additional instructions and execute them through an added coprocessor interface.
In this case, the examples would be the FlexLink interface in the TinyRISC and CW4003 families from LSI Logic. While either type of interface could be used to add PHC functionality to a RISC processor, the following example looks at adding the PHC as a coprocessor through the FlexLink interface on a TinyRISC processor. The PHC could be added through the OCA bus in a RISC processor as a memory-mapped device in other implementations.
Designers can also utilize the FlexLink or OCA bus interface, for example, to connect the embedded MIPS RISC core to a high-performance, multiply-divide unit (MDU); a Fast Fourier Transform (FFT) engine; or a leading-one detector, to accelerate certain computational routines for DSP applications.
Computation units (CU) like those described above, and the router PHC define and decode their own instructions (Tables 2 and 3). A CU obtains its source operands from either its own register files or the instruction's immediate field. At the end of the operation, a CU, like the PHC, writes the result back to the embedded CPU's register file. It can also write the result back to its own register file. This is particularly valuable in multi-cycle operations because the MIPS RISC processor doesn't need to be stalled to wait for a result.
Table 2 shows the FlexLink signals that interface with the embedded CPU, while Table 3 shows additional signals that interface with a CU. For example, signal —CRUN_—INN is utilized by the PHC module to determine when the system is stalling. If the PHC needs to differentiate between pipeline stall cycles and bus stall cycles, the —CPIPE_—RUNN signal can be substituted.
The PHC asserts the CU select (ASELP) signal high to inform the embedded MIPS processor core that the current instruction is a user-defined CU instruction. The CU stall request (ASTALLP) signal is used by the PHC to stall the pipeline by asserting it high. CIR_BOTP\[5:0\] contains the bottom six bits of the instruction register, and allows the PHC to decode its own instructions.
The PHC uses these FlexLink interface signals and the others in its operations with the MIPS RISC processor (Fig. 2a). The MIPS processor sets up the PHC controller, which includes a packet pointer and byte counter. The PHC controller is interrupted either by an external signal, or internally by the CPU as soon as the packet header information is stored inside the packet memory. It then starts a transfer from the packet memory to the PHC. The MIPS processor is interrupted once the complete header is available in the PHC. Software examines and alters the required fields within the header. After the header operation, the PHC controller is commanded to move the header back to the packet memory or any other desired location.
Second Memory Bank Helps
As an example, a design can use a 256-byte PHC memory block with a dual DMA controller (Fig. 2b). This indicates that a total of two blocks of memory can be operated by the CPU, with each block at 128 bytes. By having a second bank, the CPU can examine and modify the header, while a second header is moved between the memory and the PHC.
Two instruction sets that can be used with the FlexLink interface to implement the PHC are the immediate (I-type) and register (R-type) instruction sets (Fig. 3). In the immediate addressing mode, the embedded MIPS processor accesses the PHC RAM directly by an offset built into the immediate instruction. The lower eight bits of the sign-extended, 16-bit offset are used to address the PHC. This offset is present in the CW400x source register (RT) bus (CRTP\[31:0\]) signal during the ASELP cycle. The immediate addressing mode is highly useful when the fields needed by the software are fixed within the PHC.
In the register indirect addressing mode, the embedded MIPS processor software accesses the PHC using one of the processor's internal registers (R-type). The lower eight bits or least significant bits (LSBs) of the RT bus are used to address the PHC. This offset is also present in CRTP\[31:0\] during the ASELP cycle. Register addressing mode is effective when the field required by the software is not fixed within the PHC memory block.
Aside from the important performance boost a CU like the PHC provides, a system designer should also consider the special instructions an embedded MIPS RISC processor offers to further propel that performance. One has to remember that a router moves lots of data very fast. Once a data packet is captured by Layer 2, it's then scanned to determine its protocol and destination. This task involves pulling in multiple bytes of data and performing some form of lookup. Wide datapaths, specialized lookup routines, and comparison instructions are particularly valuable for these router functions.
For instance, the ADD with Circular Mask Immediate (ADDCIU) instruction in LSI Logic's CW4011 allows the designer to access and index into a table. With this one instruction, the designer can index past the end of a table, and wrap around back to the start without performing a separate mask or modulo operation. This feature is ideal for implementing hash functions, circular ring buffers, or other various types of table accesses. These are functions the designer might use for implementing a ring buffer in a router design. The ADDCIU instruction, in this case, is a major performance enhancement because it replaces five or six conventional instructions, including branches, that would normally be used to perform this function.
An embedded MIPS processor like this also hands the router designer two bit-search instructions, which are highly useful for performing network address lookups. These are the find-first-clear-bit (FFC) and find-first-set-bit (FFS) instructions. When FFC is executed, contents of general register RS are examined, starting with the most significant bit. The bit number of the first clear bit is returned in general register RD. If no bit is set, all ones are returned in general register RD. When the FFS instruction is executed, contents of general register RS are examined, starting with the most significant bit. The bit number of the first set bit is returned in general register RD. If no bit is set, all ones are returned in general register RD.
These instructions can be used in a number of ways. For example, in a switching router, a packet of data might be indexed for its destination based on an IP address. The local subnets served by the router could have their addresses ranges organized such that finding the first set (cleared) bit in the IP address (subnet portion) tells the router software which port to direct the packet toward, or which table to access for further lookup.
Bit Encoding Also
The system designer can also use these instructions for bit encoding of a state machine. The state machine could be an operating-system-level encoding of a task being processed, in which case, the designer can encode the state in fewer bits, saving on memory requirements and cost without degrading performance.
Another example is to use the FFC and FFS instructions to examine a data stream. In this application type, a data stream is examined to find patterns or repetition, such as HDLC flags or repeating patterns for compression/decompression. Instead of encoding a task-processing state, the FFC and FFS instructions in this case are used to examine the state of the data stream.
Minimum (min) and maximum (max) instructions are a third grouping of instructions unique to the CW4011 that assist in router designs. The min and max instructions are primarily useful in rate calculations as implemented in routers and other communications devices. Typically, a rate calculation has terms in it that require comparing two values, and using either the minimum or the maximum of the two in the calculation. For example, an allocation of bandwidth for a channel on a link may require allocating the maximum of either the remaining bandwidth available on the link, or the "minimum committed rate" for the link. Using the min instruction can save two instructions in this calculation.
Select and shift left (SELSL) and select and shift right (SELSR), used to build a word of data from multiple sources of data, are another pair of CW4011 instructions with strong value for router applications. An example application for the SELSR and SELSL instructions would be the protocol conversion by a router of an ATM header to a frame relay header.
One example is to modify the IP header or tag field in a routing application. Much of the conversion can be accomplished by using fields of the ATM header to index entries in a series of tables. These pieces are isolated by using the SELSR/SELSL instructions combined with the base address of the table to provide access to the translation value stored in the table. The translation values from multiple tables are then combined using the SELSR and SELSL instructions to build up the new header.
This example displays not only the use of those instructions for building a data word out of multiple components, but also deconstructing fields in a header or data word to quickly index to the action or value required by that field. A simple operation without these special instructions could require two or three additional instructions to implement.
The need for extra features and faster data rates is dramatically increasing the performance demands on the processors in router architectures. Embedded RISC processors can help to address this increasing performance gap in three ways: high and increasingly higher instruction execution speeds that are the trademark of RISC architectures, semi-custom instructions provided for typical router applications including table accesses and buffer management, and fully customized add-on hardware such as the packet header cache example discussed in this article. RISC technology can be used to implement these high-throughput designs while maintaining the flexibility of a processor-based implementation.