How to Optimize Component-Based Digital Video Systems

By Dr. Cheng Peng, Texas Instruments

The challenge in introducing digital video to emerging applications, such as smart video surveillance IP cameras (netcams) is controlling cost while maximizing performance without compromising reliability or quality. Traditionally, optimizing a design has entailed extensive hand-coding of algorithms in assembly. Today’s video development platforms, however, provide codec and other functionality off-the-shelf. Developers are able to access codec functionality through APIs which abstract the specific implementation details of codecs. Such codecs are often already optimized for a particular hardware platform and are easily integrated into applications using the API.

From an application perspective, it doesn’t matter what video codec is in use. In fact, if the particular codec and its implementation details can be transparent to the application, developers gain the ability to easily interchange codecs during camera operation (see Figure 1). Interchangeability enables many important features, such as dynamic adjustment of the video codec to use a lower bit rate until an event of interest occurs and then switch to a higher bit rate, higher quality codec. This maximizes the use of limited network bandwidth, reserving it for those cameras where the most quality and detail is required.

CODEC TRANSPARENCY
To keep different codecs transparent at the application level, an API must necessarily be somewhat generic, keeping the API focused on encoding, decoding, and control operations. If the API includes too many codec-specific operations, the application code will become tied to that particular codec.

A generic API can significantly speed time-to-market and enables digital video to be easily introduced into a wide range of new applications. Developers, however, often assume that the only way they can optimize a system is to break past the API to call directly into codec code and to fine-tune codec performance. Breaking past the API, however, destroys the interchangeability of codecs.

Breaking APIs also erodes the reusability of application code since it becomes tied to the use of particular codecs. This is especially important to developers architecting a family of video products ranging in functionality. Ideally, when a function such as denoising is implemented, OEMs would like to use the same code across the product line rather than have to redesign the function for multiple codecs, display sizes, or bit rates.

System optimizations can result in substantial cost savings when developers can reduce the overall processing load (enabling a less expensive processor to be used in final production) or decrease the amount of memory a system requires. The key is to step beyond handcoding optimizations or breaking APIs. By approaching optimization from a system perspective that balances performance, cost, quality, and reliability, developers achieve a cost-effective design that provides interchangeability without requiring tedious code optimizations.

COMPONENT-BASED ARCHITECTURES
Figure 2 shows the basic architecture for an intelligent IP netcam. One of the ways that an effective API can be defined is to abstract the system into various components which can be implemented in hardware or software, depending upon the actual hardware resources available.

Any of these components could be implemented completely in hardware or completely in software. For example, for years ASICs have dominated the camera market by providing the lowest power consumption with the highest performance. Unfortunately, ASICs have a long development cycle and result in a fixed implementation. Additionally, the video coding standards themselves are not stable, and an MPEG-4 ASIC that can only handle MPEG-4 but not H.264 is already obsolete.

Video development must also take into account perhaps the most important—and compute intensive —innovations in surveillance: video analytics. Video analytics lend intelligence to cameras though capabilities, such as recognizing objects and triggering events based on their behavior. This emerging technology is extremely volatile, not only in the capabilities supported but also the rapidly changing and innovative ways in which they are implemented.

FPGA or other programmable logic approaches introduce more flexibility than ASICs while maintaining high performance but still require a very long development cycle and high cost. Nor is a softwareonly implementation feasible. Many video functions are relatively fixed in their base computations, and are well-suited for being processed in parallel. The best approach for addressing the complexity and fluidity of video analytics and pre-processing functions such as de-interlacing, de-noising, and color space conversion is to utilize a mixed hardware and software approach that frees the main processor for other tasks. This minimizes development time, maximizes performance, and enables developers to abstract these functions efficiently through an API. It further results in an upgrade path that minimizes time-consuming changes at the application level since developers can interchange codecs without modifying either hardware or the main application.

COMPONENT-LEVEL OPTIMIZATION
When functions are broken into components that can be implemented in software or hardware and abstracted by an API, it becomes more difficult to optimize individual components. With codecs and most video analytic functions available off-the-shelf, however, it is no longer necessary or desirable for developers to optimize at the individual component level. In fact, it becomes extremely difficult to do so and in any case, the performance gains are rarely worth the development investment. Optimization, then, shifts to ensuring efficient interaction—both direct and indirect—between components, such as how an audio codec and video codec cooperate in their use of shared system resources including processor cycles, memory, and DMA bandwidth.

Memory is a critical system resource that directly affects system cost. Fortunately, codec implementations tend to be highly efficient relative to the overall memory available on a processor and do not require much internal memory. For example, an MPEG-4 encoder implemented on a TI DM642 processor needs just 256K internal memory.

Hand-optimized codec implementations, may take advantage of fixed resolution and buffer sizes, limiting their configurability. To achieve true interchangeability, codec implementations need to be flexible enough to support various frame sizes and resolutions to adjust the bit rate to best utilize network bandwidth.

The advantage of a configurable implementation is that interchangeability is supported directly by the algorithm and changing bit rate is a matter of changing the configuration of the algorithm—i.e., lossiness, resolution, or frame rate—which in turn defines how large and how many buffers are required. Additionally, developers are able to leverage this configurability to easily balance performance and memory usage. Reducing the size of frame buffers, for example, enables developers to trade off memory usage against performance. It also becomes possible to dynamically protect against network jitter by increasing buffer size when necessary, smoothing out network latencies.

One area where memory efficiency can be preserved by developers is in how codecs and data structures are instantiated at the application level. Consider a surveillance application where cameras are left on for months at a time. With any long-running application, fragmentation of heap memory can become an issue when algorithms are dynamically implemented. Certain data structures offer better performance when allocated in the same memory page, so it becomes important to manage heap memory carefully. Video applications require a wide variety of large buffers, such as Group of Pictures, I-B-P frames , overlays, etc. For the best algorithm efficiency, these buffers need to be contiguous. Thus static functions should be declared first so that they are allocated at the top of heap memory. If dynamic functions are allocated first, heap memory will be separated into multiple smaller pools that cannot be recombined.

One viable approach is to allocate buffers using the same base block size. Fragmentation is avoided because buffers are equal in size. While some memory may be allocated that is not used in a particular buffer, this is a small price to pay to avoid the extreme long-term difficulties eventually caused by fragmentation. Not only can algorithms process data in a continuous chain, loading of (multiple) buffers can be easily accelerated through DMA mechanisms.

DMA is, without question, one of the essential elements of achieving optimal performance. Programmable DMA engines enable processors to move large blocks of data on and off chip directly into or out of codec data structures in the background without requiring direct interaction from the main processor.

The challenge that arises is that DMA requests within codecs are independent from DMA transfers initiated by the application. Without coordination, it is inevitable that DMA requests will interfere with other, hindering the efficient movement of data. Additionally, DMA also bypasses the need for temporarily storing large amounts of data when passing or receiving video buffers across codec APIs; passing buffers as parameters often requires that an extra copy be made to prevent unintended overwriting of data. Such copying quickly erodes performance and memory bandwidth. It may be tempting for developers to break codec APIs to handle DMA transfers directly, but the fact that different codecs handle data differently locks the specific implementation within the application and destroys the interchangeability of codecs.

Rather than bypassing the API when moving data, codecs and application code can cooperate through the use of a single DMA interface which efficiently manages allocation of DMA resources. When both codec and application adhere to the interface, DMA efficiency is automatically maximized for both codec and application code without manual tuning. Interchangeability is thus preserved while at the same time eliminating the extras reads and writes associated with passing buffers as parameters.

DATA STRUCTURE ACCESS WITHOUT BREAKING APIS
To optimize overall performance without affecting interchangeability also requires application-level access to key video data structures. If a codec is completely hidden behind the abstraction of an API, this can result in significant loss of video quality or performance.

Consider that when the frame rate is dropped, the application must ensure that the highest quality frames are not the ones dropped. For example, MPEG-4 uses I-B-P frames, with Iframes capturing the most detail. Ideally, the application should drop B- or P-frames before it ever drops an Iframe. However, it can only do this if the codec tags I-frames so that they can be differentiated from B- and P-frames.

The same applies to transcoding, where video is decoded from one format and encoded into another. If only the resultant decoded video stream is used, losses in the decoding process will propagate and further degrade quality video. When the transcoder has direct access to the motion vectors used to create the decoded frames, however, quality can be preserved.

Application access into codec data structures is even more important for the support of video analytics. An object recognition component that can utilize processing already completed by the codec (rather than having to duplicate such processing itself) preserves processing resources. Alternatively, an event triggering component must be able to generate alerts to the codec to increase the target bit-rate as quickly as possible and use JPEG compression for pre/post-alert snapshots.

Video analytics is a fast moving field, with no standards and continuous innovation. APIs that don’t provide hooks past the API abstraction force developers to break the API, tying the video analytic implementation to the particular codec implementation, limiting reuse of the video analytic as well as destroying interchangeability.

Interchangeability is a critical foundation of today’s digital video camera applications, and platforms, such as TI’s DaVinci technology, are configured to ensure this is possible with minimal retooling. The ability to dynamically adjust a video stream’s bit rate and quality based on resolution, frame rate, and codec format enables developers to maximize bandwidth utilization. By keeping codec implementation transparent to the application through the use of APIs, interchangeability is preserved. Optimization, then, becomes a system level process. Rather than focusing on hand-optimization of codecs, which requires too much development investment with too little gain, developers instead optimize the interaction between codecs and application code. In this way, APIs can be preserved while enabling codec interchangeability and promoting application code reuse.

Dr. Cheng Peng is a DSP Video Systems Engineer at Texas Instruments. He can be reached at [email protected]

Company: TEXAS INSTRUMENTS

Product URL: Click here for more information