64-Bit Arm Incarnation Employs Dynamic Code Optimization
NVidia often takes an interesting approach with its technology, and the firm’s upcoming dual-core Tegra K1-64 is no different. It employs the Denver CPU architecture that implements ARM’s ARMv8 64-bit architecture (see “ARM Joins The 64-bit Club”), but does so much differently than most other ARMv8 platforms.
NVidia has an ARMv8 ISA license rather than a Cortex-A57 core license. There are others that have taken this approach as well. This means that the resulting processor needs to match the ARMv8 ISA, but it can be implemented much differently than ARM’s Cortex-A57 design.
Denver’s code morphing approach is similar to that found in Transmeta processors available almost a decade ago (see “Low-Power VLIW CPU Delivers Speedy x86 Upgrade”). Code morphing is also known as dynamic code optimization. The technique is similar to Java’s just-in-time (JIT) compilers. In both cases the object code is converted into microcode instructions that are cached. The conversion takes place once, and then the microcode can be executed repeatedly without additional overhead.
This file type includes high resolution graphics and schematics when applicable.
The Denver cores run their own microcode directly. ARMv8 instructions are converted to this microcode before being executed (see the figure). An on-chip, 1K look-up table is used to check if a microcode sequence has already been generated by an Optimizer program and placed into the optimization cache.
The optimization cache is 128 Mbytes and it is stored in main memory. The cache is only accessible by the hardware and the Optimizer. The latter is also hidden from the system. It is written in microcode, so it does not have to use the conversion process that the ARMv8 instructions need. The Optimizer can run on either core when they are idle. It can also be interrupted. It handles management of the cache in addition to performing the optimized code conversion.
The code sequences in the cache are matched to the jump transitions of the ARMv8 instructions. This allows a block to be selected and run sequentially.
The reason for these gyrations is that the code generated by the decoder may be less efficient than what is generated by the Optimizer. It can take into account all the execution units available as well as timing and power details.
The Denver core can execute more than 7 ARMv8 instructions per cycle when using optimized code. The cache delivers a 32-byte “parcel” to the instruction scheduler every cycle. The parcel is actually a series of variable length instructions that will be passed onto the execution units.
The profiler provides information to the Optimizer program about what blocks to optimize and how to ensure the code is being executed. Frequently used blocks get more optimization.
The platform targets mobile devices that need to be very power-efficient. It also addresses applications that need the performance usually found in a 64-bit x86 platform.