Typical general-purpose symmetrical multiple-processor (SMP) multicore designs contain about eight cores. Specialized architectures, on the other hand, push the number of cores into the hundreds. Tilera ups the ante for SMP with its 64- core/tile Tile64 chip (see the figure). Its iMesh interconnect incorporates five different packet networks with five switches per tile (see the table). Chips with 35 and 120 tiles are on the horizon.
Go With the Flow
The SMP nonuniform memory access (NUMA) architecture is similar to the HyperTransport system used by AMD for its Opteron series. As with AMD's approach, location of peripherals and memory are not important to the application, except at a low level of the operating system.
The big difference is that AMD uses the same HyperTransport interface for all traffic, while Tilera splits the traffic into different networks. This enables memory transfers to occur in parallel with other transfers, such as peripheral data. Data moves through non-blocking switches at one cycle per hop.
By splitting the traffic, different types of transfers can be optimized. For example, memory and stream transfers tend to be larger, while interrupts and UDP-style (User Datagram Protocol) transfers are usually smaller. High-level language support permits socket-style communication between nodes.
Communication can occur between any node. Each has a matrix address. Some nodes, such as the memory controllers, feature more than one address to provide higher throughput. The source node determines which address to use. Typically, the system that initializes the operating systems on each core will distribute the addresses to prevent one from becoming a bottleneck.
My Cache, Your Cache
Each tile incorporates an L1 and larger L2 cache. A core's L3 cache is the sum of the other cores' L2 caches. The memory controllers keep track of where information is located in the L2 cache. Accesses from a different node are provided with the location so subsequent accesses can be made via the remote L2 cache.
The response characteristics of this approach are different from a conventional SMP L3 cache. But the efficiency is much better than accessing main memory from a speed as well as a power point of view. Off-chip accesses require hundreds of cycles and 500 pJ. An L3 access will take 20 to 30 cycles and consumes only about 3 pJ. Hardware handles cache operation and virtual memory support. Its operation is transparent to applications.
A bank of 64 cores can be handy, but multiple subsets are often used instead. Tilera's Hardwall technology logically partitions the system into sets of tiles. Traffic can flow through any region to memory controllers and peripherals. However, this prevents communication between cores in different regions. Of course, the L3 caching will be within a region too. Rectangular regions are currently supported.
A hypervisor runs on each core, providing virtual-machine support. Access to peripherals is still controlled at the software level. Still, this is relatively easy to handle at the hypervisor level. Moreover, the hypervisor has control over a tile's switches. The Tile64 can support a range of operating systems, but its initial flavor is Linux. Support also includes the Eclipse-based Multicore Development Environment (MDE), including the GDB debugger. The current mix of software includes opensource tools as well as some proprietary software, such as the C/C++ compiler.
Many Cores, Fewer Watts
Power management can be a significant advantage in multicore environments. In this case, it's possible to power down individual cores while the switches continue to operate. The design also makes extensive use of clock gating, minimizing power requirements for sections of the system that are inactive.
Software support includes tools specific to the Tile64, such as a highlevel and cycle-accurate simulator. A whole application model for collective debugging can single-step multiple cores. Also, a runtime library for socket-style streams provides access to the tile-to-tile hardware support mentioned earlier.
The architecture has had time to mature. A similar system was developed in 1994 at the Massachusetts Institute of Technology, but it required a rack of hardware. Meanwhile, external links between Tile64 chips can be established using the Ethernet or PCI Express interfaces. For now, iMesh operates only within the chip.
The Tile64 should provide 40 times the performance of dual-core DSPs and 10 times the performance of dual-core Xeon processors while using less power. Of course, these are 32-bit cores, not 64-bit cores. Likewise, applications that run on an SMP platform should work well without modification on the Tile64.
New designs can take advantage of more intimate hardware support. But gaining access to such a large number of cores opens new possibilities for parallel programming. And while the Tile64 targets network and video applications, it should equally suit other applications amenable to parallel programming.