Core Counts Keep Climbing

1 of Enlarge image

To meet the needs of high-end networking equipment, vendors such as Cavium, Freescale, NetLogic, and Tilera continue to shovel more CPU cores into their processors. NetLogic’s new XLP II scales up to 20 cores (80 threads), and Tilera’s Tile-Gx offers up to 100 cores on a single chip. This trend could soon grind to a halt, however, if processor designers can’t get around the looming memory wall.

In high-end networking equipment, support for 10G Ethernet connections is rapidly becoming standard. OEMs are deploying line cards for 10x10GbE configurations, requiring 100 Gbits/s of throughput. Standards for 40G Ethernet are already established, with 4x40G configurations in development. After that, the push is underway toward 100G Ethernet, which will further increase line-card requirements. These throughput increases demand new high-performance multicore processors.

Next-Generation Multicores

The next battlefield for multicore processor vendors is the 28-nm process node. At this node, processor designers can play with twice as many transistors compared with current 40-nm technology. Having already validated the multicore approach, they can use these transistors to add more CPU cores to their designs.

The 28-nm XLP II provides more than twice as many cores as the 40-nm XLP. Each core is four-way multithreaded, providing greater parallelism and CPU efficiency. Freescale is introducing multithreading in its 28-nm QorIQ T-series, which scales to 12 cores and 24 threads. Cavium offers 32 single-thread CPUs in its current Octeon II processors and will have more cores in its 28-nm Octeon III products.

One challenge with increasing core counts is efficiently connecting so many cores. Simple buses and crossbars do not perform well with more than eight cores. Longer bus wires cannot clock as fast, causing bandwidth to sag just when it is needed the most.

One alternative is a ring bus, which uses point-to-point connections between cores. NetLogic’s processors use a ring, and Intel adopted this approach in Sandy Bridge. A ring maintains high clock speed regardless of the number of cores, but transactions require multiple hops to get to their destination, adding latency. For more than 16 cores, this latency can be a problem, although making the ring bidirectional helps.

For its many-core designs, Tilera has pioneered a mesh interconnect, which connects cores in a grid pattern. This layout reduces the average number of hops. Whereas a ring provides the same bisection bandwidth as the number of nodes increases, a mesh scales the bisection bandwidth with the square root of the number of nodes. The mesh enables Tilera to deliver processors with 100 CPU cores.

Facing The Memory Wall

Keeping these cores fed is another matter. Amdahl’s Rule requires a proportional increase of memory bandwidth along with CPU performance. Networking in particular requires moving massive amounts of data into and out of the processor, both the packet data itself and the associated tables and data structures required to properly process it.

Taking further advantage of rising transistor budgets at 28 nm, NetLogic includes 32 Mytes of L3 cache in its 20-core monster, far more than in any other embedded processor and more even than the 30 Mbytes Intel includes in its 10-core Westmere-EX server processor. This cache will help reduce the traffic to external memory, particularly for data structures that are repetitively accessed.

Caches just don’t cut it for large data structures and streaming packet data. This is why chips need more bandwidth to DRAM. DRAM has always been a choke point but is becoming more of a wall blocking performance gains. Over the years, processor designers have coped with the slow improvement in DRAM speeds by widening and adding memory channels. Today’s high-end processors support four 64-bit memory channels operating at DDR3-1600 speeds for a total bandwidth of 51.2 Gbytes/s.

With each channel consuming more than 200 pins, implementing more than four channels is impractical. Next-generation processors support DDR3-2133, but whether system layouts can handle such speeds remains in question. Thus, processors that are doubling their core count are facing a memory-bandwidth increase of 33% at best and 0% at worst.

A new approach is needed. For some of its Xeon processors, Intel uses a buffer-on-board (BoB) design. These processors have high-speed serial memory interfaces that connect to external buffer chips. These chips convert the serial interface to a standard parallel DRAM connection. BoB enables processors with six or more DRAM channels, but the external buffers add cost and board area. A better approach would implement a standard high-speed serial interface, such as the Gigachip interface from MoSys, directly to the DRAM chips.

Micron’s prototype Hybrid Memory Cube (HMC) is a radical approach that completely re-engineers the memory subsystem. The HMC includes its own memory controller, which connects directly to a stack of memory chips using 3D vias. This design optimizes the bandwidth between the controller and the DRAM chips while providing an efficient, high-speed interface to the processor. This technology, however, will not enter production until 2014 at the soonest.

Meeting the demands of high-end networking equipment will not be easy. Simply shoveling more cores into the hopper won’t get the job done. Processor designers must efficiently connect the cores, provide adequate cache memory, and improve DRAM bandwidth. Facing the limits of current DRAM technology, these designs may need to adopt a BoB approach for the near future while waiting for a more radical solution such as Micron’s HMC.