Quad 64-Bit Multiprocessor Targets Comm Applications

Getting data to and from a processor quickly is key to high-performance network processing. Broadcom's new BCM1400 multiprocesssor tackles this problem with a trio of flexible advanced HyperTransport/SPI-4 Phase 2 links. Of course, packing four 64-bit MIPS processors into the same package didn't hurt either. The result is a chip that provides multiprocessing support alone or in an array of HyperTransport linked chips.

The BCM1400 targets communication-oriented applications that need significant computational support, like Internet service routers and switches with deep content switching and differentiated services such as quality-of-service (QoS) and virtual private networks (VPNs). In addition, the BCM1400 addresses Internet-Protocol (IP) servers and subscriber-management platforms, servers supporting high computational re- quirements for scientific or Enterprise Java environments, and wireless infrastructure equipment. The multiprocessing architecture also makes it suitable for scientific and embedded applications requiring significant computational capabilities.

The chip contains a number of peripherals along with its sophisticated memory and communication support (see the table). Up to eight chips can be connected via the HyperTransport links, for a 32-processor symmetrical multiprocessing (SMP) system (see "Multifunctional HyperTransport," p. 48).

Differentiating the BCM1400 SMP support from most small-scale SMP systems with two to eight processors is its use of a nonuniform memory access (NUMA) architecture. This is similar to the NUMA used with AMD's new Opteron 64-bit CPU. The NUMA architecture is often used by medium-scale microprocessor systems with eight to 32 processors. Broadcom's solution is unusual because of its high integration, low power consumption, and multiplexing of memory and I/O traffic on the same link.

In a conventional SMP system, all processors have the same memory access time. A bus or switch acts as an interface between processors and the memory subsystem. Cache coherence is maintained by monitoring the bus or the switch traffic.

With NUMA, the memory address space is made up of the combined local memory from each node in the system. A processor can access its local memory faster than nonlocal memory. NUMA systems have the advantage of being easily expanded, while adding a processor to a conventional SMP shared memory architecture is more difficult because an additional port is needed.

Broadcom uses a cache-coherent form of NUMA, or ccNUMA. This allows on-chip caches to remain up to date even while data moves through the processor/memory interconnect. The BCM-1400's on-chip double-data-rate (DDR) memory controller supports the chip's local, off-chip memory. Its HyperTransport links provide ccNUMA support.

Three-Way HyperTransport/SPI-4 Links: The BCM1400's triple HyperTransport link architecture is critical to its use in communication and multichip multiprocessing support (see the figure). Each link can be configured as an 8- or 16-bit HyperTransport connection, or as a streaming SPI-4 interface. The SPI-4 support includes hardware hash and route acceleration functions.

In addition, the HyperTransport links work with a mix of HyperTransport transactions, including encapsulated SPI-4 packets and nonlocal NUMA memory access.

The key is that hardware handles movement of in-formation. For ex-ample, nonlocal memory accesses are determined by the memory mapping hardware that generates a HyperTransport request for reads or writes. These packets are automatically routed to the proper node that handles memory requests via its local memory. Operating systems simply set up the memory maps and HyperTransport links.

Although ccNUMA incurs an access-time penalty, the effects of using nonlocal memory are mitigated by on-chip caches and the HyperTransport transfers that occur at high speeds. So there's an initial delay when filling a cache entry. But subsequent memory accesses by a processor happen at faster cache speeds than even local memory accesses.

Code prefetching effectively masks the latency of the system. A large 1-Mbyte, level 2 cache per BCM1400 means that only small, random, nonlocal memory accesses will cause any significant slowdown. Moving large amounts of sequential memory via nonlocal memory isn't a problem as only the transfer initiation incurs a latency penalty—a small fraction of the time necessary to send the block of data. The 64-kbyte level 1 cache per processor is split between a 32-kbyte instruction and 32-kbyte data cache.

Large amounts of streaming data can also be handled when a port is set up as an SPI-4 link. This is ideal for high-speed communication environments. It can supply a link to external communication connections that have a native SPI-4 interface like Ethernet MACs or to switch-fabric interface chips.

The 256-Gbit/s switch connects the on-chip memory and processors to the three HyperTransport/SPI-4 links. It offers transparent forwarding of network, ccNUMA access, and HyperTransport packets when necessary.

Three HyperTransport links enable an expandable system. Two are needed for a pass-through architecture where multiple units are daisy-chained together. This architecture is ideal for processing as data moves along the chain. Unfortunately, implementing the same links for NUMA transfers will reduce the bandwidth available for other traffic. It's possible to link a pair of chips using the third link for NUMA transfers if the daisy-chained link bandwidth is needed for I/O or network packets.

A third link allows the nodes in the array to extend in another direction. This can work in two ways. It can supply another path for a daisy-chain architecture. It can also provide additional processors to work on data forwarded from the daisy-chain data stream. This is great for such applications as the VPN processing that's handed off to additional processing nodes.

On-Chip Components: The four MIPS processors are joined to the internal ZBbus. The 128-Gbit/s ZBbus is a high-speed, split-transaction multiprocessor bus. It runs in big-endian and little-endian modes. The bus implements the standard MESI protocol to ensure coherency between the four CPUs, their level 1 caches, and the shared level 2 cache.

Standard MIPS64 cores, the CPUs have floating-point support. The processors are independent, allowing applications to be migrated from one CPU to another if necessary.

The ZBbus supports all on-chip peripherals, including a 66-MHz PCI interface, a 1-Gbit Ethernet interface, four UARTs, and a generic bus that can handle ROMs or flash memory in addition to simple I/O devices. A JTAG interface performs debugging. For system analysis, a bus trace unit provides more data.

The BCM1400 supports VxWorks, Linux, NetBSD, and QNX. These operating systems are great for embedded applications. Linux, NetBSD, and QNX can be used for more general multiprocessing applications as well. The transparent ccNUMA memory architecture and standard operating systems simplify development, allowing the creation of applications that differentiate the hardware even in similar environments.

Moreover, the standard operating systems make it easy to develop Internet-based applications. Standard server applications can be used on the system, so the BCM1400 is ideal for a wide range of applications, such as intelligent network-based storage and Web-based services.

The BCM1400 is compatible with the dual-processor BCM1250 and single-processor BCM112x chips. These have a single HyperTransport link. A remarkable achievement, the BCM1400 lets designers bring high-performance multiprocessing to bear using an easily expandable architecture. It's also surprisingly affordable.

Price & Availability
The BCM1400 costs under $1000 in OEM quantities. It uses a 0.1-µm process and comes in a 40- by 40-mm, 1517-pin BGA package. Samples available Q2 2003. Production quantities available H2 2003.

Broadcom Corp., 16215 Alton Parkway, Irvine, CA 92619-7013; (949) 450-8700; www.broadcom.com.