Massively Parallel CPU Array Targets High-End Communications

Massively Parallel CPU Array Targets High-End Communications

These days, network processing is all about high-speed interconnects and the number of cores designers can bring to bear to the packet processing problem. Companies like Tilera try to pack everything into a single chip (see “72-Core Platform Targets Networking Chores) with multiple versions providing a different number of cores.

Broadcom’s XLP900 series processors incorporate a lot of cores into a single chip but provide a mechanism to build larger arrays to handle more demanding environments. The series packs 80 cores per chip that can be mixed in an array of eight chips for a total of 640 cores (Fig. 1). This array of 64-bit MIPS-based nxCPU cores can deliver 1.28 Tbits/s of memory coherent performance. Each chip delivers 160 Gbits/s executing over 1 trillion operations/s (TOPS).

Figure 1. Broadcom’s XLP900 has up to 80 cores linked to a fast messaging network and distributed interconnect that supports hardware virtualization across an array with up to 640 cores. Each chip brings four DDR3 controllers and a collection of network accelerators in addition to network and peripheral interfaces.

The XLP900 builds around a three-level coherent cache system that is linked by a 2D distributed interconnect. The quad-issue, quad-threading cores use a superscalar, out-of-order design with virtualization support that is compatible with Linux kernel virtual machine (KVM) and Quick EMUlator (QEMU). The system also supports I/O virtualization including PCI Express Gen 3 single root, I/O virtualization (SR-IOV) with 255 virtual functions/port.

Each nxCPU cluster has four cores with L1 and L2 caches for the cluster. Power gating operates on a per core basis. There is a single L3 cache for the chip, but the cache coherency system spans multiple chips. The four DDR3 controllers on each chip deliver a bandwidth of 68.25 Gbytes/s. The memory system is linked to all processors via an inter-chip interface.

This approach is similar to AMD’s HyperTransport on its Opteron server chips and what Intel incorporates in its Xeon processors. Each chip has three inter-chip interconnects that allow up to eight chips to be combined into a single array (Fig. 2).This is sufficient for building an eight-chip array and without using additional hardware.

Figure 2. Three bidirectional, high-speed inter-chip interfaces can link up to eight chips into an array that contains 640 cores.

The inter-chip links handle a range of traffic including caching information. They also support the fast messaging network (FMN), which links peripherals to cores throughout the interconnect. The packet-processing engine handles network peripherals with interfaces that include XLAUI, KR4, XFI, XAUI, RXAUI, and HiGig2. There also is IEEE 1588-compatible hardware time stamping and support for SyncE, MACSec, and PFC.

The hardware accelerators are specialized processors designed to operate with minimal CPU intervention. They target packet-processing chores including compression/decompression and regular expression processors, enabling full-speed deep packet inspection (DPI). The security and encryption support RSA1K, RSA2K, RSA4K, RSA8K, and ECC. The RAID engine targets storage applications, which is where the SATA and PCI Express interfaces come into play. Other peripheral interfaces such as USB 3.0 can be used as well.

These network communication processors will likely handle packet processing chores with data moving onto another host, but the architecture allows this type of processing to stay within the node. Linux virtualization can handle hundreds of virtual machines that are close to the network traffic and protected from each other.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.