Intel’s quad core Core i7-965 Extreme Edition
Intel’s Nehalem architecture was actually the Core i7-965 Extreme Edition (see photo) released at the end of 2008 but the bulk shipments were in 2009 hence our choice of the best architecture in 2009. It is the basis for the latest incarnation of Xeon, i5 and i7 processors in 2010 bringing many performance enhancements such as Turbo Boost and QuickPath. This is an overview of the architecture and new features in Nehalem.
QuickPath Interconnect (QPI) is one of the more significant improvements for the server side and the high end workstation end. It is comparable to AMD’s use of Hypertransport in its Athlon and Opteron processors. QuickPath is a high speed link between processors that now incorporate memory controllers. It allows multichip systems to share memory using NUMA (non-uniform memory architecture). It was first delivered as part of the 2008 announcement with the X58 chipset found in motherboards like Gigabyte’s GA-X58A-UD7.
QPI is a point-to-point system similar to PCI Express. QPI has 20 full duplex lanes. These deliver a bandwidth of 26.5 Gbytes/s. QPI is only found on high end Core i7 and Xeon processors. It is no needed on lower end Core i3, i5 and i7-8xx processors that do not need this type of chip-to-chip interconnect.
Turbo Boost is one feature that all the entire Nehalem family has. It essentially over clocking that allows cores to run faster than their normal speed providing more performance. This can be done for all cores in a multicore chip providing an overall performance boost at the cost of higher power utilization. Essentially Turbo Boost allows the system to specify limits on the number of processors, current and power consumption and temperature.
These days power can equate to runtime, or that lack thereof in a battery driven environment, so pushing the limits is not always desirable. Turbo Boost provides these features within the power and thermal limits of the chip. It can also apply to a subset of the cores where the other cores are powered down and the remaining cores are clocked at even a higher speed than all the cores using Turbo Boost. Powered down cores use almost no power allowing idle power to slip under 10W.
In theory, the overall performance efficiency is improved using Turbo Boost. This feature tends to be handled by the operating system making transparent to the user and applications. A number of new core states have been created to handle this feature.
Large L3 caches are especially important on server platforms and very useful on other platforms as well. Servers often have 8 Mbytes of L3 cache. Intel’s SmartCache technology reduces memory traffic between core. This is very useful in multichip environments where QPI is used. SmartCache allows a cache miss in shared L3 to guarantee that the data is not local eliminating unnecessary snooping to locate the data thereby improving performance. Nehalem also adds a second translation lookaside buffer (TLB) to boost performance.
Many of the other new features in Nehalem are not highlighted as much but they remain very useful. Hyperthreading was available before Nehalem but it has been enhanced with an improved ability to run out-of-order by increasing the out-of-order window size. Intel has also reduced the number of stalls in the execution queue. Synchronization primitives, critical to multithreading scheduling, have been improved as well. Likewise branch prediction performance has been improved by lowering the penalty for a wrong prediction.
Intel’s Virtualization Architecture (VT-x) provide has been expanded to allow a guest OS more direct access to the hardware and reducing virtual machine monitor (VMM) overhead. Intel VT-x includes Intel Virtualization Technology FlexMigration and Intel Virtualization Technology FlexPriority in addition to Virtualization Technology for Directed I/O (VT-d) and Virtualization for Connectivity (VT-c).
VT-d allows guests direct access I/O and works with all the Nehalem chips. VT-c works in conjunction with PCI Express adapters that know about virtualizaton. This allows a single physical adapter to appear as multiple logical adapters with one being allocated to each guest that in turn has direct access to the logical device without the need to interact with the VMM for transfers.
Nehalem has proven to be an excellent base for 2009 and it is what the new 2010 Core i3, i5, i7 and Xeons are being built on.