When Shopping For Network Processors, One Size Does Not Fit All

A fundamental shift is occurring in communications equipment as demands for very high-speed, service-enabled products eclipse demands for more-traditional routing-and-switching offerings. As with any product, the needs of an OEM, in the context of a network-processing solution, are as unique as the markets served. Successfully meeting market-driven feature/function challenges through standard product integration is the key to maintaining a competitive position.

Network processors (NPs) come in many sizes and flavors, especially if you subscribe to "marketectures" and the related hype. Loosely defined, an NP is a programmable or configurable device that's been designed and highly optimized to perform networking-specific functions. Unfortunately, the term has been applied to an assortment of products (including ASICs and, to a lesser degree, FPGAs) designed to deliver some form of classification, rudimentary quality of service (QoS), or packet forwarding in a network environment.

Definitions aside, NP integration solves many of the same problems as ASICs. In this case, specialized data-movement/handling tasks are off-loaded from general-purpose processors, thereby greatly accelerating packet-handling or communications functions. In contrast to fixed-function devices, NP units (NPUs) offer programmable or configurable solutions that can adapt as standards are adopted or evolve. Custom ASIC technology has a long life ahead of it. ASIC replacement/augmentation is attractive to hardware vendors, however, not be-cause of performance limitations, but rather due to ASICs' high development costs, time-to-market constraints, and product lifecycle shortcomings.

Through advances made in semiconductor technology, the philosophy of network system design has migrated (Fig. 1). In 1995, networks employed traditional routing/switching devices, such as a general-purpose CPU, a packet-processing engine, and a forwarding engine. By 2000, hybridized architectures were common. Such systems consisted of a general-purpose CPU and a custom ASIC or an off-the-shelf application-specific standard product (ASSP), or a combination of these devices. Now in 2001, technology-driven systems consist of a dedicated control CPU and a full-fledged application-specific NP.

This migration of system architecture has also shifted NP functionality from hardwired solutions to programmable solutions (Fig. 2). Note the change from fixed to programmable media, as well as the change from ASIC/ASSP technology to a single-chip (NP) solution.

Regardless of specific function, most devices that fall into the category of "network processors" are based on either a multiprocessor (highly parallel) or a multistage (highly pipelined) architecture. A simple network-processing engine based on a pipelined architecture is depicted in Figure 3, while one based on a parallel/multiprocessor architecture is shown in Figure 4. In either case, some type of hardware functional unit (state machine, programmable microprocessor, etc.) is combined with specialized software to support packet-oriented functions.

There also are two hybridized architectures for NPs. One approach combines the concepts of a very large (up to 64 stages) super-pipelined structure (highly parallel processes on a per-stage basis using a traditional pipeline technology) with that of super-scalar processing technology (multiple parallel pipelines). The other approach is based on supercomputing concepts of processor arrays (multiple processors organized in a nodal mesh that can be configured to form parallel, multistage, multiprocessor execution pipelines). Both ap-proaches, while highly interesting, aren't examined in detail here, because they're not currently employed in a commercially available NP solution.

All architectural approaches implement basic packet-handling functions: classification/parsing, lookup/forwarding, payload management/editing, and queuing/scheduling. Architectural differences revolve around the diversity of approaches to the problem of moving packets with some level of "meaningful processing" at wire speeds.

For example, wide-area-network (WAN) edge devices (ones closest to the user that apply services to a given bit stream) may need to terminate time-division-multiplexing (TDM) streams, or aggregate many physical links or protocols that are asynchronous to each other. At this level, bit-oriented parallel processing is desired to address the specific function implemented. Think in terms of high-level data-link control (HDLC) or bit-level framing operations.

Similarly, termination protocols and transport media tend to be very mixed at this point in the network. Conversely, devices in the core usually have only a few high-speed connections. Data transported here tends to be very serial in nature and of a uniform transport media. These streams lend themselves well to serial (multistaged) architectures.

A particular NP architecture also is directly linked to the relative complexity of the services needed at a given point in the network. For end-to-end QoS, intelligence must migrate from the transport core to the individual devices that make up the network (the edge). This shift in intelligence brings new opportunities for complex service delivery throughout the network. Programmability of an NP offers service delivery in ways that aren't possible for highly generalized CPUs or fixed-function ASICs. Likewise, high-speed, low-service applications might not require the same degree of programmability as lower-speed, higher-service functions. Making an informed decision about the services offered in a particular system eliminates nonconforming NPs from consideration.

Some key characteristics of the pipelined-processing architecture to keep in mind when making architectural comparisons include:

Processing: Pipelined NP architectures are best compared to assembly lines, where specific operations take place at a given point (stage) in the process. The architecture is characterized by the presence of multiple simple processing "engines" that perform singular "subtasks" decomposed from more-complex functions. Each stage of the pipeline must execute its particular task in the same amount of time as every other stage in the pipeline to prevent pipeline stall conditions.

To this end, NPs based on pipeline architectures typically provide less processing flexibility within a given stage and operate at lower clock frequencies than their parallel/multiprocessor counterparts. Although this approach seems limiting, it's ideally suited for data streams requiring well understood functions to be performed in a serial fashion. Functional scaling is achieved by enabling programmability where specifically needed in the pipeline, while performance scaling is enabled through increases in clock frequency.

Memory system: Memory-system structure also sets pipelined machines apart from parallel/multiprocessor implementations. Typically, each "engine" in the pipeline has its own local memory and doesn't depend on pointers or other information passed from its upstream counterpart. The local memory can be used to store both control instructions and any data used in processing (tables, structures, and so on). Alternative approaches separate microcode buffers from data storage areas. Regardless of the approach, available memory in each stage of the pipeline is limited in accordance with the processing tasks of the given engine.

If required, interprocessor communication is accomplished by prepending control information onto the protocol data unit (PDU) being passed downstream or via an external channel. With prepending, the control information is appended (in front instead of behind) so the next processing stage understands what function must be performed on the PDU data.

Overall: Processing and memory differences aside, pipelined machines tend to be state oriented, configurable (through "parameterization"), or limited in programmability, while a high level of flexibility is the key to the parallel/multiprocessor architecture.

On the other hand, parallel or multiprocessor architectures can be compared to multiple independent assembly lines. Each unit can handle its own complex task and produce its own unique product without depending on other processors in the embedded complex. Such designs are well suited to many networking or communications functions.

Parallelism in NPU architectures can be achieved in a variety of ways. Popular mechanisms include multiples of the processor functional unit, overlap of CPU and I/O operations, use of a hierarchical memory system, balancing of subsystem bandwidths, and implementation of a functional multiprocessor development model. The most common approaches include:

CPU replication: This is by far the most popular technique to increase the processing performance of an NPU design. CPU replication offers its own tradeoffs.

Because a packet from a given input source may be processed as an independent unit, logic suggests that more processors make for faster and more efficient packet handling. A common approach taken by many vendors is embedding multiple RISC processors into their network-processing solution, enabling parallel-processing functionality. Each processing unit acts as an independent machine that can process its own packet completely without other processor intervention or interprocessor communication. Still, each unit shares the workload of servicing a given data stream.

Amdahl's Law supplies guidance in scalability of highly parallel systems. In essence, the law says exponentially scaling the number of processors in an embedded complex will yield positive results for a finite period. Overhead associated with operating the processing engines quickly exceeds the incremental performance improvement of the additional processor(s). Designers are cautious in touting the ability to scale parallel architectures to tens or hundreds of programmable units. Attention focuses on faster processing speeds, multiple contexts (CPU and I/O overlap), and better memory systems. Packet-order management is another unique challenge in multiprocessor architectures.

High-level protocols, like the transmission control protocol (TCP), provide for packet reordering at destination systems. Unfortunately, the processing/reordering is time consuming and highly inefficient. To further complicate matters, TCP can't always distinguish between out-of-sequence packets and those that have been dropped, causing needless retransmit requests.

Hardware approaches are complex but generally perform the best. To implement an "order-management" function in a parallel architecture, a coprocessor should ensure that packets leave the embedded processor complex in the same order as they arrived, no matter which CPU processed the individual packet. Similarly, a co-processor can aid in maximizing processor utilization by allocating packets across the individual processing units based on processor availability. Pipelined architectures address this challenge through the serial nature of their organization and are immune to this problem set.

Overlap of CPU and I/O: Maximizing processor utilization and eliminating I/O wait (stall and yield) conditions is important to parallel-processor architects. To keep processors busy during periods of latency caused by resource accesses, they often overlap CPU and I/O operations in the form of a "context switch" or multithreading operation. This overlap technique lets multiple contexts (threads) execute in a simultaneous fashion on a single processing unit. While one context waits for an I/O operation to complete (e.g., table lookup, exception process), another context is free to process its packet.

Although powerful in terms of delivering higher processor utilization, context switching doesn't come without a price. The user typically pays some penalty in cycles for context transitions to occur. Some NPUs even employ individual program counters to minimize the impact of context switches. For any architecture, it's important to understand the eccentricities of individual devices. This way, performance impact is minimized while processor utilization is maximized.

Hierarchical memory: Regardless of architecture, memory bandwidth is one of the most challenging problems to solve. Effective use of new memory techniques is critical to the success of highly parallel NP architectures. Memory bandwidth heavily influences the number of packets per second that can be processed by a given device. Similarly, different memory architectures yield different levels of utilization.

In contrast to the pipelined-architecture approach of individual memories for each "processing engine," parallel/multiprocessor systems normally use a shared memory for data and control (instruction) storage. Each processor has access to both memory systems and can "move" data through the functional stages of processing by passing pointers from place to place. Control storage is implemented as an "on-chip" memory, while data is stored off-chip in most implementations. Local memory is maintained for scratchpad and PDU buffering. The use or availability of high-performance memories should weigh heavily in any NP decision.

Overall: Parallel/multiprocessor architectures differ most from their pipelined counterparts in their programming flexibility. These architectures can deliver whatever their users can imagine, provided the users are willing to write the code to do it. Functional scaling is achieved by enabling programming flexibility and processing "headroom," while performance is scaled through increases in clock frequency, better utilization of memory systems, and code improvements.

NP decisions are really about the tradeoffs between performance and flexibility. The right balance can carry a design through many generations with little architectural change. With so many choices on the market today, making an informed decision in terms of cost, performance, flexibility, and functionality is intimidating. Simply counting the number of internal processing engines (in the case of a parallel- or multiprocessor-based NP) or determining the raw number of instructions that execute per stage per second (in the case of a pipelined NPU) is a terrible measure of suitability or performance. The guiding light should be how well the architecture fits the particular application. Designers should answer a few questions before making a decision:

In general:

Will I need to add intelligence to my network device?

Where is my current or proposed device targeted in the network?

What are my trunk and tributary speeds and feeds?

What are my predominant traffic types?

What are my media interfaces?

What features or functions do I plan to support now and in the future?

About hardware:

Is the NP single or multi in nature?

How many channels/ports can the processor support?

Are the interfaces "standard" or just "open"?

How does the processor interface to adjunct devices, such as content-addressable memories, policy engines, security engines, and control plane CPUs?

How do the processors address network-specific functions? Or, are they highly generalized?

At what packet size and arrival rate does the processor saturate?

Does the processor use DRAM or SRAM? Do I have a choice?

What is the total hardware cost of a workable solution (CPUs, NPUs, and memory)?

What is the power cost of a workable solution?

What is the real-estate cost of a workable solution?

How does the architecture scale?

What are the bottlenecks in the part?

About software:

Is there a development environment? (Simulator? Tools? Evaluation Boards? Examples?)

Can I afford it?

Are the tools proprietary or industry standard?

Are the tools mature or first release?

What is the instruction set?

Is there a compiler?

What's the learning curve associated with the tools?

Is training available?

Is source code available?

Is the example code usable or just for show and tell?

Many opportunities exist for innovation in deep packet-processing systems. In fact, engineers are still trying to figure them all out! Even so, it will take years to fully understand the service needs of our evolving network infrastructure.

When exploring NP integration into existing systems or new products, it's important to consider maintaining a balance between performance and flexibility while addressing the optimal goodness of fit. Keep in mind that all architectures aren't created equal or suited to every application. Each architecture is as different as the markets it addresses. No one size fits all.

Recommended reading:

Croll, A., et. al., Managing Bandwidth, Prentice Hall, 1999.
Ferguson, P., et. al., Quality of Service, John Wiley and Sons, 1998.
Gupta, P., et. al., "Routing Lookups in Hardware at Memory Access Speeds," Proceedings of the IEEE INFOCOM 1998, March 1998.
Kavi, M.K., "Multithreading: Languages, Systems, and Architectures," Computer Systems Architecture Monologue, University of Alabama at Huntsville, January 2000.
Keshav, S., et. al., "Issues and Trends in Router Design," IEEE Communications Magazine, May 1998, p. 144-151.
Sheafor, S.J., "Packet Processing Power: The Tradeoffs," Electronic Engineering Times, May 24, 1999.