Networked Video: A Single-Processor Solution
By David Katz and Rick Gentile,
Analog Devices, Inc.
Today more than ever, the worlds of networking and video are converging to create new embedded application possibilities. Yet only recently has a single processor been able to handle high quality video with integrated Ethernet capability. The reasons for this are several: microcontrollers (MCUs) are uniformly used as Ethernet conduits, but their processing abilities limit them to handling only modest video resolutions or bit rates. Dedicated video processors, on the other hand, usually either lack a network interface (and thus rely on a host MCU) or are too inflexible to handle changing requirements and multiple video encode/decode scenarios. What's more, they're often power-hungry and too cost-prohibitive for many markets.
The breakthrough comes in the form of a single processor that unifies both MCU and video processing capabilities into a cohesive architecture. An example of this class of device is the Blackfin processor family from Analog Devices. In particular, the ADSP-BF537 provides a useful platform for discussion of networked video applications. Not only does it have a high-performance processor core, but it also features a high-speed parallel peripheral interface (PPI) and an on-chip 10/100 Ethernet MAC port.
MORE ABOUT THE PPI
The PPI is a 16-bit multifunction parallel interface that supports bi-directional data flow and includes three synchronization lines and a clock pin for connection to an external clock. The PPI can gluelessly decode ITU-R BT.656 video frames and can also interface to ITU-R BT.601 video streams. But the PPI is not just a camera port. It is flexible and fast enough to serve as a channel for high-speed analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). Moreover, it can act as a glueless LCD display controller.
Although embedded Ethernet implementations are somewhat standardized, the BF537 contains some helpful extras, focused on reducing the number of times the processor has to "touch" network packets. Multiple features greatly reduce the amount of work the processor has to perform. For example, data movement is managed through a direct memory access (DMA) controller, instead of requiring constant processor involvement. As another example, a hardware checksum calculation on receive packets offloads this activity from the processor. These architectural enhancements ensure that processor bandwidth is not completely consumed managing the network side of an application. There are two main flows associated with networked video applications. In one, compressed video flows into the device via Ethernet, is decompressed, and is displayed locally. Another is just the opposite: raw video streams into the processor and is compressed before being sent out over the Ethernet interface.
ACCOMMODATING ETHERNET AND VIDEO
As implied earlier, interfacing to a network is very much a control-related task. This is why MCUs are a natural choice for integrating Ethernet. However, video processing is very block-centric and processingloop-based, shaping up to a profile very different from that of the control-oriented MCU. These differences, in turn, translate into different programming models. When control/networking and media functions are allocated to separate processors, life is easier in some respects, but the bill of materials and project costs rise, and inter-processor communications becomes a key liability.
Until now, when these functions were combined into the same processor, one side of the application (network or media) had to be severely restricted in performance. For example, perhaps only a limited, low bandwidth network stack could run. On the media side, image resolution might have to be reduced from, say, VGA to QVGA, or maybe the video frame rate would need to be reduced.
These performance restrictions precipitated from the different needs of the networking and video sides of the application. For instance, the two sides compete for external memory resources (like accesses to SDRAM). The network stack will consume some memory space for code and some for data in external memory. The basic access pattern of instructions and data will be somewhat "spread out" in memory, which can degrade performance because rows in external memory will be opened and closed constantly.
This problem is compounded by the needs of the video algorithm, which accepts streaming video from a sensor or outputs a stream to a display. For example, if a memory fetch required to feed an LCD display is held off due to the network side requesting the external memory bus, the consequences will be plainly seen on the display, in the form of synchronization loss es.
Today, a single-processor solution, such as the BF537, addresses these data bottlenecks. We already reviewed how the Ethernet peripheral reduces the processor load by moving data through DMA. Similarly, on the video side, a DMA controller that services the PPI offloads the processor core from constant involvement in data transfer.
Not only do we want to prevent the processor from having to move data, but it's also advantageous to minimize the number of data passes through a particular buffer, because every additional pass wastes memory bandwidth. Therefore, even though video data is stored in SDRAM in a linear, onedimensional manner, the DMA controller has the capability to access data in arbitrarily sized blocks, as if it were stored in a two-dimensional manner. This feature saves considerable cycles that would otherwise be used to compute the transfer "strides," or step sizes, through the data.
The BF537 also includes some "finer grain" controls to ease potential conflicts between network and video sides of an application. These include the ability to control access patterns to external memory, programmable interrupt priority levels to prevent background tasks from "locking out" critical processing regions, and configurable DMA channel priorities to negotiate between the data streams in an application.
The Ethernet peripheral supports "line speed" access while transmitting and receiving packets. For something like the User Datagram Protocol (UDP), transfer rates approach line speed of the 10/100 Mbps interface. The question of processor loading is more important for something like TCP/IP. How much time is needed between packets, and how many processor cycles must be consumed to manage the stack in this case?
In the end, a basic operating kernel and a network protocol such as UDP consume less than 30 MIPS out of the BF537's total budget of 600 MIPS. This leaves plenty of room for basic video decode functionality.
A common practice that reduces time to market is to implement the video algorithm on a QVGA size image and then interpolate the image to, say, VGA. This has the dual benefit of reducing both the processor loading and the percentage of the external bus bandwidth used. For many applications, the scaled QVGA image is perfectly adequate, but in cases where it's not, higher resolutions are also achievable, because of the large processing margin on a 600 MIPS device.
By promoting peaceful coexistence of networking, control and signal processing functions, integrated processors, such as the Blackin, enable lower-cost, higher-performance applications in fields like video surveillance, remote sensing (using ADCs and DACs for measurements) and video-over-IP.
David Katz and Rick Gentile are Blackfin Senior Applications Engineers at Analog Devices. They are the authors of Embedded Media Processing and can be reached via e-mail at [email protected] and [email protected]
Company: ANALOG DEVICES INC.
Product URL: Click here for more information