Break Through The TCP/IP Bottleneck With iWARP

The online economy, particularly e-business, entertainment, and collaboration, continues to dramatically and rapidly increase the amount of Internet traffic to and from enterprise servers. Most of this data is going through the transmission control protocol/Internet protocol (TCP/IP) stack and Ethernet controllers.

As a result, Ethernet controllers are experiencing heavy network traffic, which requires more system resources to process network packets. The CPU load increases linearly as a function of network packets processed, diminishing the CPU’s availability for other applications.

Because the TCP/IP consumes a significant amount of the host CPU processing cycles, a heavy TCP/IP load may leave few system resources available for other applications. Techniques for reducing the demand on the CPU and lowering the system bottleneck, though, are available.

iWARP SOLUTIONS
Although researchers have proposed many mechanisms and theories for parallel systems, only a few have resulted in working computing platforms. One of the latest to enter the scene is the Internet Wide Area RDMA Protocol, or iWARP, a joint project by Carnegie Mellon University and Intel Corp.

This experimental parallel system integrates a very long instruction word (VLIW) processor and a sophisticated finegrained communication system on a single chip. It’s basically a suite of wireless protocols comprising RDMAP and DDP. The iWARP protocol suite may be layered above marker PDU aligned (MPA) and TCP or over Stream Control Transmission Protocol (SCTP) or other transport protocols (Fig. 1).

The RDMA Consortium released the iWARP extensions to TCP/IP in October 2002, implementing the standard for zerotransmission over legacy TCP/IP. Together, these extensions eliminate the three major sources of networking—transport (TCP/IP) processing, intermediate buffer copies, and application context switches—that collectively account for nearly 100% of CPU utilization (see the table).

A kernel implementation of the TCP stack has several bottlenecks. Therefore, a few vendors now implement TCP in hardware. Because simple data loses are rare in tightly coupled network environments, software may perform the error-correction mechanisms of TCP. Meanwhile, logic embedded on the network interface card (NIC) strictly handles the more frequently performed communications. This additional hardware is known as the TCP offload engine (TOE).

The iWARP extensions utilize advanced techniques to reduce CPU overhead, memory bandwidth utilization, and latency. This is accomplished through a combination of offloading TCP/IP processing from the CPU, eliminating unnecessary buffering, and dramatically reducing expensive operating-system (OS) calls and context switches. Thus, the data management and network protocol processing is offloaded to an accelerated Ethernet adapter instead of the kernel’s TCP/IP stack.

iWARP COMPONENTS
Offloading TCP/IP (transport) processing: In conventional Ethernet, the TCP/IP stack is a software implementation, putting a tremendous load on the host server’s CPU. Transport processing includes tasks such as updating TCP context, implementing required TCP timers, segmenting and reassembling the payload, buffer management, resource-intensive buffer copies, and interrupt processing.

The CPU load increases linearly as a function of the network packets processed. With the tenfold increase in performance from 1-Gigabit Ethernet to 10-Gigabit Ethernet, packet processing and the CPU overhead related to transport processing increases up to tenfold as well. In the end, network processing will cripple the CPU well before reaching the Ethernet’s maximum throughput.

The iWARP extensions enable the Ethernet to offload transport processing from the CPU to specialized hardware, eliminating 40% of the CPU overhead attributed to networking (Fig. 2). The transport offload can be implemented by a standalone TOE or be embedded in an accelerated Ethernet adapter that supports other iWARP accelerations.

Moving transport processing to an adapter also rids a second source of overhead—intermediate TCP/IP protocol stack buffer copies. Offloading these copies from system memory to the adapter memory saves system memory bandwidth and lowers latency.

RDMA techniques eliminate buffer copy: Repurposed for Internet protocols by the RDMA Consortium, Remote DMA (RDMA) and Direct Data Placement (DDP) techniques were formalized as part of the iWARP extensions. RDMA embeds information into each packet that describes the application memory buffer with which the packet is associated. This enables the payload to be placed directly in the destination application’s buffer, even when packets arrive out of order.

Data can now move from one server to another without the unnecessary buffer copies traditionally required to “gather” a complete buffer (Fig. 3b). This is sometimes called the “zero copy” model. Together, RDMA enables an accelerated Ethernet adapter to support direct-memory reads-from/writes-to application memory, eliminating buffer copies to intermediate layers. RDMA and DDP eliminate 20% of CPU overhead related to networking and free the memory bandwidth attributed to intermediate application buffer copies.

Avoiding application context switching/OS bypass: The third and somewhat less familiar source of overhead, context switching, contributes significantly to overhead and latency in applications. Traditionally, when an application issues commands to the I/O adapter, the commands are transmitted through most layers of the application/OS stack.

Passing a command from the application to the OS requires a CPU intensive context switch. When executing a context switch, the CPU must save the application context in system memory, including all of the CPU general-purpose registers, floating-point registers, stack pointers, the instruction pointer, and all of the memory-management-unit state associated with the application’s memory access rights. Then the OS context is restored by loading a similar set of items for the OS from system memory.

The iWARP extensions implement OS bypass (user-level direct access), enabling an application executing in user space to post commands directly to the network adapter (Fig. 4). This eliminates expensive calls to the OS, dramatically reducing application context switches. An accelerated Ethernet adapter handles tasks typically performed by the OS. Such adapters are more complex than traditional non-accelerated NICs, but can eliminate the final 40% of CPU overhead related to networking.

POTENTIAL APPLICATIONS
Clustered servers/blades depend on high-bandwidth, low-latency interconnects to aggregate the enormous processing power of dozens of CPUs and keep I/O needs serviced. For these applications, iWARP is advantageous because it delivers:

• Predictable, sustained, scalable performance, even in multicore, multi-processor clusters
• A single, self-provisioning adapter that supports all clustering MPIs: HP MPI, Intel MPI, Scali MPI, MVAPICH2
• Modern low-latency interconnect technologies to an ultra-lowpower, high-bandwidth Ethernet package; flexibility, cost, and industry-standard management benefits; and exceptional processing power per square foot for clustered applications.

For data-networking applications, iWARP-based accelerated Ethernet adapters offer full-function data-center NIC capabilities that boost performance, improve power efficiency, and more fully utilize data-center assets. The accelerated Ethernet adapters achieve the highest throughput, lowest latency, and lowest power consumption in the industry.

In particular, when it comes to network overhead processing, it achieves 95% CPU availability for the application and operating system. In addition, hardware-class performance for virtualized applications ensures offloaded CPU cycles from network processing are not lost to software-virtualization overhead.

For storage vendors, iWARP will become a standard feature of network-attached storage (NAS) and Internet small computer systems interface (iSCSI) storage access networks (SANs). NAS and iSER-based (iSCSI Extensions for RDMA) storage networks utilizing accelerated Ethernet adapters deliver the highestperformance, lowest-overhead storage solutions at any given wire rate.

At 10 Gbits/s, Ethernet-based storage networks offer a viable alternative to a Fibre Channel SAN. There’s a single network interface for block and file protocols, high-throughput block, and file-level storage access, exceeding 8-Gbit Fibre Channel.

In addition to iSCSI, iWARP supports a wide range of other storage protocols such as network file system (NFS) and common Internet file system (CIFS). This gives data centers the opportunity to reap the increased productivity and lowered total cost of ownership benefits of a ubiquitous, standards-based technology.

Like InfiniBand, iWARP does not have a standard programming interface, only a set of verbs. Unlike the InfiniBand architecture (IBA), iWARP only has reliable connected communication as this is the only service that TCP and SCTP provide. The iWARP specification also omits many of the special features of IBA, such as atomic remote operations. In all, iWARP offers the basics of InfiniBand applied to Ethernet. This suits it well for both legacy software and next-generation applications.