It all began in 1952, when the ILLIAC I (Illinois Automatic Computer) graced the stage at the University of Illinois. By 1956, this machine had more compute power than all of Bell Labs— not bad for a 4.5-ton, 10- by 2- by 8.5-ft box filled with more than 2800 vacuum tubes and 64-kword drum storage. Eventually, the infamous ILLIAC IV vector processor incorporated 256 processors in its design.
The latest single-chip, multicore processors run rings around these dinosaurs. But the search for faster, better solutions continues unabated. The industry has made great progress in larger symmetrical multiprocessing (SMP) systems, and typical high-end servers host over two dozen processors. Multiple cores per processor effectively increase the number of processors. Yet moving into the hundreds to thousands range requires a change of architecture.
NUMA, or non-uniform memory access, retains common memory. The node's local memory remains the fastest, while slower access times are incurred as memory is accessed farther from the node. NUMA's big problem, though, is programming.
The NUMA architecture has worked well in AMD's Opteron. Each chip has its own memory interface, but its HyperTransport links can be used to access memory attached to other chips. This works well if the memory being accessed is in an adjacent chip because of the speed of the link. But the approach reverts to a typical NUMA system when hundreds of nodes are used.
A mixture of different application requirements has yielded a plethora of designs. These range from massive supercomputer complexes that are tied together by high-speed fabrics to clusters of blade servers connected by Ethernet.
Compute engines these days are based on standard platforms such as Intel's EM64T and IA64 processors, AMD's Athlon 64 and Opteron, and Sun UltraSparc processors. Similarly, standards like Ethernet, Serial RapidIO (sRIO), and InfiniBand provide the interconnect fabric. And in software, standards are slowly improving the developer's ability to employ these hardware features.
High-performance computing (HPC) tends to cover everything these days, from supercomputing applications like weather prediction and earthquake modeling to clusters of Web servers. Systems like the Cray XT3 use the hypercube architecture to take advantage of dual-core AMD Opteron processors (Fig. 1 and Fig 2). Hypercubes offer scalability, but dataflow and routing become issues that programmers must address.
Different connection architectures like the hypercube have been giving way to fabric interconnects like Ethernet, sRIO, InfiniBand, and ASI (Advanced Switching Interconnect). These standards-based solutions cost less. Also, their performance has improved steadily as products mature.
InfiniBand, one of the most mature of these products, has found a niche in HPC. Some of the largest and fastest supercomputers are based on an InfiniBand interconnect. Mellanox's 480-Gbit/s InfiniScale III switch chip can be found at the center of many of these fabrics (see "Switch-Chip Fuels Third-Generation InfiniBand" at www.electronicdesign.com, ED Online 5999). It can be configured as an eight-port, 30-Gbit/s, 12x InfiniBand switch or as numerous 4x ports. Its low 200-ns latency is critical to efficient HPC applications.
Of course, even InfiniBand can go one better with devices like the Path-Scale 10X-MR PCI Express adapter (see "InfiniBand Hits 10M Messages/s" at ED Online 12359). Its connectionless architecture avoids the queue-pairs used with the usual OpenIB stack, allowing a node to handle up to 10 million messages/s.
Blade architectures lend themselves to large systems. Supercomputers, such as those built with InfiniBand, often split the I/O support into different blades. That way, the processing node only has memory, processing units, and a fabric interconnect. Likewise, IBM's Cell Blade houses IBM's Cell BE processor with nine cores (Fig. 3). This architecture will power Sony's forthcoming PlayStation 3 (see "CELL Processor Gets Ready To Entertain The Masses," ED Online 9748).
On the other hand, many applications may seem large in number but small in terms of their individual requirements. A typical rack-mount solution like Sun's Grid Rack System would fit these situations (Fig. 4). Even here, multiple cores per processor chip are common. Instead of using a dualcore processor, there's an eight-core UltraSparc T1 chip that can run up to 32 processing threads at the same time.
Even with all the horsepower, Sun's T2000 resembles the typical rackmount PC server—albeit with high-end components. The system can handle up to four 73-Gbyte, 10k rpm SAS disks and a slimline DVD-R/CD-RW drive. It also can be connected to external disk and tape storage. Four 1-Gbit/s Ethernet links provide the main contact with the outside world.
Unfortunately, each blade system is based on unique specifications. IBM, Hewlett-Packard, Sun, and other companies have their own blade design and backplane. Even the rack-mounted systems only have the width of the rack in common.
Designers are using AdvancedTCA and VME to base large systems on standard boards. These standards offer common board form factors while encompassing other various standards. For example, AdvancedTCA uses the same board form factor for a number of different backplanes and backplane fabrics like Gigabit Ethernet, sRIO, InfiniBand, and ASI. Most systems are likely to be customized because they target communications and military applications.
Artesyn Technologies' KAT4000 exemplifies this customization (Fig. 5). This blade can house a PowerPC processor. But its compute or peripheral processing power comes from the boards that can fit into the four Advanced Mezzanine Card (AMC) card slots. AMC slots also are found in the MicroTCA standard from PICMG. VME and AdvancedTCA racks tend to hold about a dozen cards, including one or two switch-fabric cards. With fabrics, multiple racks can be combined into a larger system.
Switch fabrics also make it easier to link different kinds of compute engines. Mercury Computer's PowerStream 7000 can host hundreds of compute nodes, including FPGA-based devices like the three Xilinx Vertex-II chips on the PowerStream 7000 FCN module (Fig. 6). While it isn't a rugged supercomputer, the PowerStream 7000 can be used in demanding environments.
Switch fabrics are great. But PCI Express virtualization shines in keeping peripheral interfacing simple, especially since many clustered systems don't use the network fabric for peripheral support. In this case, the network fabric generally is Gigabit Ethernet, and the host processor and memory are on different blades than the peripherals. A PCI Express fabric and Ethernet fabric are on the backplane.
The PCI Express fabric switch chips are configured, usually out-of-band, so the root host of each system is connected to the appropriate devices that also are attached to the fabric. Each host uses its peripherals while it's running, but the overall system management can start and stop the hosts as necessary. Reconfiguration can be completed when a host is stopped.
This approach is much less flexible and requires more hardware configuration than using a newer switch fabric like ASI or sRIO. But it also means that the hosts appear as stock hardware to the operating system and applications.
This is important for running clusters of applications using existing software. It's also the kind of environment that most of the clustered software is running in. The virtualization simply permits more flexible use of the hardware, including manual recovery from failed units.
PCI Express virtualization is just starting to show up. It requires management tools and PCI Express switch chips that are just becoming available.