Electronic Design

  
Reprints     Printer-Friendly    Email this Article    RSS        Font Size     What's This?


[Leapfrog: First Look]
64-Core Chip Spins SMP Design To Higher Performance Levels

William Wong  |   ED Online ID #16930  |   October 11, 2007


Typical general-purpose symmetrical multiple-processor (SMP) multicore designs contain about eight cores. Specialized architectures, on the other hand, push the number of cores into the hundreds. Tilera ups the ante for SMP with its 64- core/tile Tile64 chip (see the figure). Its iMesh interconnect incorporates five different packet networks with five switches per tile (see the table). Chips with 35 and 120 tiles are on the horizon.

Go With the Flow
The SMP nonuniform memory access (NUMA) architecture is similar to the HyperTransport system used by AMD for its Opteron series. As with AMD's approach, location of peripherals and memory are not important to the application, except at a low level of the operating system.

The big difference is that AMD uses the same HyperTransport interface for all traffic, while Tilera splits the traffic into different networks. This enables memory transfers to occur in parallel with other transfers, such as peripheral data. Data moves through non-blocking switches at one cycle per hop.

By splitting the traffic, different types of transfers can be optimized. For example, memory and stream transfers tend to be larger, while interrupts and UDP-style (User Datagram Protocol) transfers are usually smaller. High-level language support permits socket-style communication between nodes.

Communication can occur between any node. Each has a matrix address. Some nodes, such as the memory controllers, feature more than one address to provide higher throughput. The source node determines which address to use. Typically, the system that initializes the operating systems on each core will distribute the addresses to prevent one from becoming a bottleneck.

My Cache, Your Cache
Each tile incorporates an L1 and larger L2 cache. A core's L3 cache is the sum of the other cores' L2 caches. The memory controllers keep track of where information is located in the L2 cache. Accesses from a different node are provided with the location so subsequent accesses can be made via the remote L2 cache.

The response characteristics of this approach are different from a conventional SMP L3 cache. But the efficiency is much better than accessing main memory from a speed as well as a power point of view. Off-chip accesses require hundreds of cycles and 500 pJ. An L3 access will take 20 to 30 cycles and consumes only about 3 pJ. Hardware handles cache operation and virtual memory support. Its operation is transparent to applications.

Virtual Partitions
A bank of 64 cores can be handy, but multiple subsets are often used instead. Tilera's Hardwall technology logically partitions the system into sets of tiles. Traffic can flow through any region to memory controllers and peripherals. However, this prevents communication between cores in different regions. Of course, the L3 caching will be within a region too. Rectangular regions are currently supported.

A hypervisor runs on each core, providing virtual-machine support. Access to peripherals is still controlled at the software level. Still, this is relatively easy to handle at the hypervisor level. Moreover, the hypervisor has control over a tile's switches. The Tile64 can support a range of operating systems, but its initial flavor is Linux. Support also includes the Eclipse-based Multicore Development Environment (MDE), including the GDB debugger. The current mix of software includes opensource tools as well as some proprietary software, such as the C/C++ compiler.

Many Cores, Fewer Watts
Power management can be a significant advantage in multicore environments. In this case, it's possible to power down individual cores while the switches continue to operate. The design also makes extensive use of clock gating, minimizing power requirements for sections of the system that are inactive.

Soft Tiles
Software support includes tools specific to the Tile64, such as a highlevel and cycle-accurate simulator. A whole application model for collective debugging can single-step multiple cores. Also, a runtime library for socket-style streams provides access to the tile-to-tile hardware support mentioned earlier.

The architecture has had time to mature. A similar system was developed in 1994 at the Massachusetts Institute of Technology, but it required a rack of hardware. Meanwhile, external links between Tile64 chips can be established using the Ethernet or PCI Express interfaces. For now, iMesh operates only within the chip.

The Tile64 should provide 40 times the performance of dual-core DSPs and 10 times the performance of dual-core Xeon processors while using less power. Of course, these are 32-bit cores, not 64-bit cores. Likewise, applications that run on an SMP platform should work well without modification on the Tile64.

New designs can take advantage of more intimate hardware support. But gaining access to such a large number of cores opens new possibilities for parallel programming. And while the Tile64 targets network and video applications, it should equally suit other applications amenable to parallel programming.

Tilera
www.tilera.com


Reprints   Printer-Friendly  Email this Article  RSS    Font Size   What's This?


  • C Tools Accelerate HDV Development On Xilinx FPGAs
  • A New Design Inflection Point
  • Forecasting Industry Growth For 2009 And Beyond
  • EDA Retools To Exploit Multicore Architectures
  • Design And Verification Move Up In Abstraction
  • EDA Retools To Exploit Multicore Architectures
  • A New Design Inflection Point
  • Design And Verification Move Up In Abstraction
    1) Transportation Guidelines For Lithium Batteries Get Updated
    (2160 views today)
    2) WHITE PAPER: Liquid-Level Monitoring Using a Pressure Sensor
    (373 views today)
    3) Build A Smart Battery Charger Using A Single-Transistor Circuit
    (313 views today)
    4) 1-A Switching Regulators Operate With 96% Efficiency To Replace Linear Regulators
    (228 views today)
    5) Tame Switching Supply Noise While Maintaining Efficiency
    (141 views today)
    ALL TOP 20



    POST YOUR COMMENTS HERE
    Name:

    Email:
    Your Comments:

    Enter the text from the image below


    Please refresh the page if you have trouble reading this text.

    Search Electronic Design
         
      
     
    Web Seminar
    Sponsored By:
    Title: Read Pacing: A Performance Enhancing Feature of PCI Express Gen 2 Switch Devices
    Speakers: 
    Date: 07/01/08
    Register: 

    Electronic Design Europe Electronic Design China EEPN Power Electronics Auto Electronics Microwaves & RF
    Mobile Dev & Design Schematics Find Power Products Military Electronics EE Events Related Resources