Tiny IVP Plows Through Video Image Processing

Tensilica's new Image Video Processor (IVP) architecture fits in a different processor class versus the conventional CPU/GPU mix. The IVP can be considered a Video Processing Unit (VPU). It is another building block for system-on-chip (SoC) solutions.

Many will assume that a GPU will handle all graphics processing chores well but existing GPU archtictures are heavily slanted towards graphic output since they were originally tasked with this job. GPUs can handle other image processing work and they often do this faster and more efficiently than CPUs but the IVP is even more powerful and power efficient compared to GPU. The IVP addresses a range of image processing applications such as feature matching (Fig. 1), image noise reduction, video image stablization, face recognition and advanced driver assistance systems (ADAS).

Figure 1. Feature matching is just one application for the IVP.

FPGAs have been used for these kinds of video chores (see FPGA Video And All That Jazz). This works well for targeted applications but FPGAs programability is less dynamic than CPUs, GPUs and now IVPs. The Stretch S7000 series is also based on Tensilica's Xtensa processor architecture (see Stretching Multicore For Video Processing). The S7000 architecture shares some basic aspects of the IVP because both are based on the Xtensa extendable architecture but IVP's architecture (Fig. 2) is much different. For example, the 32-element VLIW IVP core incorporates two items now found on most systems. These include a cross-element select/reduce unit and memory data rotator unit.

Figure 2. The IVP employs a highly parallel, VLIW/SIMD architecture with a cross-element select/reduce and memory data rotator that span computational units. They operate in a coordinated fashion to minimize data transfers needed using other architectures.

In a basic sense, these two units are akin to select/split functions and barrel shifter/merge units found in more advanced ALUs. The difference is that the IVP units span the output of the processing elements rather than just bits or bytes within a word. Both approaches provide significant performance improvement because it reduces the number of operations as well as the number of memory or register accesses needed to get the job done. These operations would normally require many RISC instructions in a conventional CPU versus a single cycle using these cross-element select/reduce and memory data rotator units.

The instruction flow starts with a SIMD architecture that includes the 32 processing elements. The VLIW instruction executes four operations per element per cycle. The processing element has three 16-bit ALUs, a 16- by 16-bit MAC and a barrel shifter. There is also a 9-stage DSP pipeline with flexible memory timing.

The ALU size hints at one reason that the IVP has a small footprint. High performance CPUs and GPUs tend to support single and double precision floating point. Those computational units are large and complex. They are very useful for the applications running on those platforms but image processing these days is based on pixels encoded as integers that are typically 8- or 16-bits per color. Floating point is overkill for most of the algorithms. An IVP will not be able to address floating point algorithms as efficiently as CPUs and GPUs but it will run rings around those platforms when it comes to integer manipulation for image processing algorithms.

One might think that the IVP would have a rather large transistor footprint but that is not the case. The IVP core is much smaller (Fig. 3) compared to CPU and GPU cores that provide less performance than the IVP. A smaller core also translates into less power consumption. Typically a GPU provides 3x to 6x better performance/watt than a CPU for typical graphics chores while the IVP provides a x10 to x20 improvement.

Figure 3. The IVP core is much smaller than CPUs or GPUs while providing more video processing power.

The IVP IP (intellectual property) is designed to fit into custom SoCs. Developers can gain early access to an IVP using Tensilica's FPGA-based development board (Fig. 4). The nice thing about this approach is that the development tools will mask most differences between IVP hardware. The development board allows real time input and output processing.

Figure 4. Developers can get started using an FPGA-based IVP board from Tensilica.

The Xtensa architecture tools used to developed the IVP and other platforms like the S7000 generate a tool complement. These new tools (Fig. 5) are used to develop applications for the target, in this case the IVP. The tools include an Eclipse-based tool suite including compilers and debuggers. Tensilica has enhanced these to expose IVP features.

Figure 5. Tensilica's Xtensa development tools generated software tools for development IVP.

The IVP architecture is IP designed to be integrated into an SoC. That means programmers will likely see an IVP as a custom item from a particular vendor versus a standard component. On the other hand, the IVP software tools will be identical and the IVP compiler should address any differences. The architecture and tools are scalable so the number of elements and even the number of cores can change without any or insignificant programming changes. This allows easier inclusion of third party algorithms. This is key because there are many image and video processing algorithms and they are not usually part of a package available with the standard development tools.

Video processing is exploding and so are the processing requirements. IVPs look to provide a better solution than CPUs and GPUs and a more flexible, programmable interface compared to FPGAs and dedicated logic. IVPs will find a home in everything from digital cameras and camcorders to ADAS-capable cars and robotics.