Fabric Acceptance

Looking into the future, it's clear which application segments will adopt which fabrics more readily.

Telecom equipment makers buy commodity technologies from commodity board makers. This is where PCI Express and ASI will see acceptance in low-level edge equipment. As 10-Gigabit Ethernet develops, it will become a commodity, and many other telecom applications will adopt it.

RapidIO will find acceptance in high-level deterministic critical applications across many application segments. Fire-control systems in military, deterministic applications in industrial controls, and telecom billing systems are most prevalent. The only critical application in telecom is the billing system: Your calls can be dropped, and you can't get any ?bars? on your cell phone. But be assured that you won't get an extra minute on your call plan, and there's no way you'll make a free long-distance phone call.

InfiniBand, which will be heavily accepted in clustered Linux servers, is an excellent technology for ?streaming I/O? interfaces. Military radar and sonar are perfect examples of critical streaming I/O applications. Clustered Linux servers aren't deterministic or used in critical applications, but the message-passing mechanisms in InfiniBand make it a very clean and efficient method of hooking up large multiprocessing systems.

As the fabric technologies mature, each fabric technology will move across the end-market segments, depending on the requirements of each application. But, each fabric will enjoy adoption in specific applications where the other fabrics fall short.

The Problems With Fabrics Ever since two people hooked up a couple of tomato cans with a string and talked to each other, we've been enchanted with serial connections in computers. Serial connections have some major advantages. They require a minimum of connector pins, they use less power than parallel buses, and they're simple to use and design. But serial has always lagged behind parallel interconnects in performance.

We used regular single-ended logic to implement serial interconects many years ago. We moved to emitter-coupled logic to get more speed, and we adopted differential connections when the single-ended logic ran out of signal-to-noise margin.

We were told CMOS would never reach gigabit frequencies, so we continued to make parallel buses run faster (with things like incident-wave switching). We also made them wider (i.e., 16 bits, 32 bits, 64 bits, etc.) to increase their bandwidth.

But CMOS advanced to gigabit speeds. Low-voltage differential-signaling logic was developed and refined. Differential pairs made them noise-resilient. And today, we have a host of high-speed serial differential technologies called fabrics. These ?fabrics? promise to revolutionize computer architecture over the next few years. But a few problems, mostly software, must be concurrently solved.

In the past, serial connections were primarily used as ?I/O channels.? A communications processor aggregated the I/O devices. That processor took the I/O traffic in and translated it to parallel data for the main CPU to handle. The same happened on the outgoing I/O traffic, but in reverse.

Creating I/O channels is pretty easy. PCI Express is all about changing the PCI-based parallel I/O to serial streams. Serial interconnects also can enable many different and efficient multiprocessing architectures, and that's where the trouble starts.

We're familiar with the linear models for buses like VME and PCI. You can calculate exactly what is going to happen in specific periods of time on every transaction with these linear-model buses. But fabrics are statistical models. What happens with one transaction depends on what else is occuring at the switches or on the nodes. The statistical nature of fabrics develops from the fact that all transactions on a fabric are split transactions, not continuous transactions like we find on buses.

Two topologies exist for fabric architectures: switched or mesh. In switched architectures, a central switch controls and routes traffic to and from the other nodes. This is where things get complicated. You can have single stars, double stars, Clos switches, and many other topologies. A mesh is an all-to-all topology. Every node in the network has dedicated channels to and from every other node in the architecture. This is very simple from a hardware and software standpoint, as there are few computer-science problems to deal with.

You can create three multiprocessing architectures with the high-speed serial fabrics. This is where we find the problems:

Tightly coupled/shared-everything (TCSE)
Snuggly coupled/shared-something (SCSS)
Loosely coupled/shared-nothing (LCSN)

When you look at the top fabric architectures in the market today (i.e., RapidIO, InfiniBand, PCI Express, and Ethernet), each fits into one of the above architectures, and each architecture has its own aberrant behavior.

TCSE The only fabric that behaves like a tightly coupled/shared-everything architecture is Serial RapidIO. The RapidIO protocols have peer-to-peer mechanisms that enable every processor in the network to directly access every I/O device in the network, regardless of where it resides. TCSE architectures are somewhat deterministic and reasonably predictable. The peer-to-peer protocols tightly couple all devices, and every processor in the network can access every device in the network directly. RapidIO is the only fabric that can behave in this manner. SCSS In this architecture, the fabric offers no peer-to-peer mechanisms in its protocol stack. To access any of the I/O devices in the network, one processor communicates with the processor controlling that device locally (interprocessor communications). This is accomplished by sending ?messages? to each other.

These messages go into a shared element, either memory or disk (hence, the shared ?something?). The receiving processor reads the messages, accomplishes the requests, and sends back the answers. SCSS architectures are essentially message-passing systems with shared memory between the processors. This diminishes the determinism of the architecture. It's not even close to real time, unlike TCSE architectures.

This message-passing structure causes some interesting computer-science problems for the software folks. InfiniBand and PCI Express (Advanced Switching) both behave like SCSS architectures. Remember that InfiniBand has a remote DMA (RDMA) mechanism that sends ?messages? to the other node's memory, and you can see the ?shared-something? aspects of this architecture.

All transactions on an SCSS architecture are split transactions. When one processor sends data or a request to another processor, you have no guaranteed-delivery mechanism to ensure that the data ever arrives. If you write data to another CPU's memory, that processor must send back an ?ACK,? alerting the sender that it arrived.

If you're reading some memory location in another processor's space, the ?ACK? is the transaction that returns the data. So in this type of architecture, some applications code must reconcile all outstanding transactions, just as you reconcile your checkbook at the end of the month. Buses accomplish this with their handshake lines on writes.

On reads, you accomplish the transaction in one bus session, or there can only be one split transaction outstanding at any time. This is how PCI bridges handle the problem.

When two nodes in a fabric share common variables, and they're constantly swapping those variables with each other, you create ?hotspots.? Hotspots occur when two nodes consume the bandwidth of the switch, and other processors can't get their data in or out of those two nodes.

The only way to detect hotspots is to have a statistical application that analyzes traffic patterns within the entire fabric. That's a large chunk of code from a software standpoint. Because fabrics are distributed architectures, none of the nodes knows what's going on in the rest of the architecture and cannot detect hotspots.

Another interesting problem that pops up is called ?loose sequential consistency.? It's also called the ?A-B/B-A? problem. Lets assume a processor wants to know the temperature of an oven. It wants to know if the fan is on too. It sends a request packet to the node controlling those I/O devices that says, ?What is the temp?? Then, that processor sends another request that asks if the fan is on.

The receiving processor reads the first request, but it now must configure the analog-to-digital chip and wait for the conversion to get the temp. Meanwhile, it reads the second request, goes out and reads the ?fan-on? bit, and sends that back to the original requester. Later, the CPU sends the answer to the ?temp? request. Now, the requester sees the answer to its second query first and out of order (B-A, not A-B).

Each CPU must create a transaction-numbering and time-stamping system to match up its received messages with its sent requests in a predetermined period of time. This application must run on every single CPU in the network. You can't take the data from the ?fan-on? request and use it to make decisions about the ?temp? request.

LCSN Ethernet is a loosely coupled/shared-nothing architecture. Each node has its own resources and its own copy of the operating system. Also, each node is autonomous from the network. Determinism in such an architecture is virtually non-existent. No single node has any insight about the traffic patterns in the entire network.

The Ethernet 802 committees are adopting RDMA mechanisms (similar to those found in InfiniBand) in their new 10G specification. As a result, Ethernet fabric networks will move up and become SCSS architectures at the 10-Gbit/s level. Too many problems with LCSN architectures, like determinism, cannot be solved.

While we have many software and computer-science problems to solve with these different architectures, they do offer a step-function improvement in I/O and multiprocessing performance over buses. But none of the fabrics is going to be as deterministic as buses like VME.

Consequently, soft-real-time I/O applications like PCs and servers will accept fabrics first. Then, soft-real-time multiprocessing applications like clustered Linux servers will accept them.

Yet a bus such as VME, with its hard-real-time determinism, will continue as the only architecture that can meet the determinism requirements of many embedded applications. Fabrics will never replace certain buses because of the computer-science problems and the software burdens. But they do offer some tremendous benefits in certain applications that can effectively use TCSE or SCSS architectures.