Performance and cost are intimately linked in DSP applications for cell phones, music players, and other consumer systems. The DSP must offer enough performance for the current requirement at the right price point as well as have enough headroom and scalability to let designers add new features or enhance system performance without major hardware changes.
The CEVA-X DSP core architecture meets these needs while delivering performance levels far beyond competitive DSP engines. CEVA Inc., formerly Parthus-Ceva, combined the best elements of very-long-instruction-word (VLIW) approaches with single-instruction/multiple-data (SIMD) schemes. The VLIW aspects enable the core to deliver a high level of instruction concurrency while the SIMD facets permit single instructions to operate on multiple data elements, performing more work per instruction.
Scalable and compilable, the resulting CEVA-X architecture offers both a 16-bit integer and a 32-bit path for performance growth. The core architecture, available as fully synthesizable RTL code, comes as part of a complete solution. On top of the RTL code, a compiler and a suite of development tools include the Xpert-Open Framework and the Xpert-Applications, a library of basic algorithms.
The first release is the CEVA-X1600 series. It consists of several 16-bit preconfigured cores that contain one, two, or four dual 16-bit multiplier-accumulator (MAC) units along with a scalar load-store processor and cache controllers for the program and data memories (see the figure).
Designed to operate at clock speeds of up to 450 MHz, the core can execute up to eight instructions in parallel. An implementation with just one dual-MAC unit, the CEVA-A1620, can then deliver a throughput of 12 times that of the company's Teak DSP core. (The Teak core is a popular 16-bit dual-MAC core used in many telecom and audio applications.) With its four dual-MAC cores, the CEVA-X1680 delivers a peak throughput of 11 billion instructions/s.
Each dual-MAC computational unit starts with two 16- by 16-bit two's complement multipliers that feed results into 40-bit accumulators. Also in the unit are four 40-bit arithmetic units, a 40-bit logical unit, a 40-bit bit-manipulation unit (including full barrel shifter and exponent unit), two 40-bit pack and unpack units, and 16 40-bit accumulators. The abundant resources in the computational units allow multiple operations to be performed in parallel. With these resources, a butterfly computation in a fast-Fourier transform requires just two cycles.
The highly parallel architecture is very power-efficient, consuming just 60 µW per megaMAC. The core's various conservation schemes dynamically shut down unused resources, slow down the clocks when not performing critical computations, and more. Also, the architecture is "compiler driven." Designers can write applications in high-level languages such as C and C++, slashing development cost and time-to-market. The VLIW approach additionally enables designers to craft unique instructions and tailor the DSP core to their system needs.
Contact the company for licensing fees and arrangements.