Simplify Video De-interlacing And Reformatting

Certain video codecs require a progressively scanned 4:2:0 YUV input format. This straightforward method can be used in many applications to convert video from other formats at minimum MIPS cost.

Senthil Natarajan

Sept. 13, 2007

10 min read

Most common video signals require preprocessing before encoding by video-compression codecs, which requires data to be in 420 planar format to achieve higher processing performance. For example, broadcast standards such as NTSC and PAL may need to be converted from an interlaced format to progressive scan, and they frequently require the chrominance and luminance information to be reformatted as well.

In particular, video from CCD cameras is captured in an interlaced 4:2:2 interleaved format. Certain profiles of video-compression standards, however, accept input only in a progressively scanned 4:2:0 format. In this case, interlacing artifacts must be removed since interlaced video content can be quite challenging for progressive encoders.

Engineers have a number of sophisticated de-interlacing algorithms to choose from, but not all applications require the highest level of video quality. Moreover, sophisticated algorithms tend to be compute-intensive, and developers always have a digital-signal-processor (DSP) MIPS budget to manage.

When the application doesn't require the highest video quality, resizing algorithms in hardware can be used for de-interlacing. This technique is particularly useful to save precious DSP MIPS by offloading the 4:2:2 to 4:2:0 conversion and de-interlacing operations to other hardware. Surprisingly, resizing hardware sometimes achieves de-interlacing quality on par with high-complexity de-interlacing algorithms after video compression is taken into account.

The simple method described in this article can be used to deinterlace video applications. This technique works best when there's a significant amount of motion in the video data frames, since still images tend to highlight the deficiencies.

LUMINANCE AND CHROMINANCE CODING NTSC defines standard-definition (NTSC SD) resolution as 720 pixels per row, 480 pixels per column, and 30 frames per second. The information for each pixel contains three components:

• Y is the luminance (luma) information
• Cb (U) is the blue color information
• Cr (V) is the red color information

When the NTSC standard was adopted, engineers faced both transmission bandwidth and computing power constraints for encoding video streams. Since the human eye is far more sensitive to luminance information, the NTSC standard lightened the load by calling for the chrominance information to be horizontally down-sampled by half.

Each captured frame from a CCD camera has 720 by 480 Y values, 360 by 480 U values, and 360 by 480 V values. Each value is eight bits (a byte) in range \[0, 255\], which makes each NTSC SD frame (720 + 360 + 360) X 480 = 691,200 bytes.

The Y/U/V components in the captured frame are typically interleaved, usually in YUV 4:2:2 format. There are two ways to organize the data, but in the interest of simplicity, assume the data is organized in UYVY interleaved 4:2:2 format (Fig. 1).

As previously mentioned, most encoders require input video to be in YUV 4:2:0 format. There are two distinctions between 4:2:2 interleaved data and 4:2:0 planar data.

In 4:2:0 format, chroma information is further down-sampled vertically by half as well. That is, for each NTSC SD frame, U or V components each contain 360 X 240 bytes instead of 360 480 bytes. Each NTSC SD frame in 4:2:0 format is 518,400 bytes \[(720 X 480) + (360 X 240 X 2)\]. The additional chroma down-sampling is required to balance real-time performance with acceptable picture quality.

Efficient implementations of video-compression standards also often require that luma and chroma components are separated in memory, because the encoding algorithms may process them in different ways. Figure 2 shows the NTSC SD video frame in 4:2:0 planar format.

INTERLACED ARTIFACTS Interlace scanning involves scanning a picture twice, with one scan capturing every even line and one scan capturing every odd line. The two captures are separated by a small difference in time and then merged together to form a complete frame.

Interlace artifacts can form when these two fields are merged. For example, the vertical edges of the rectangular box will lead to sawtooth effects, which are shown in the last frame of Figure 3. Such artifacts created by capturing a moving video object at different time instances are called interlaced artifacts.

For the NTSC standard, in which video frames are captured at 30 frames/s, the start time between two sequential shots (i.e., a top field and its complementary bottom field) is 16.67 ms. If fast motion activities in the video scenes are captured in the frame, this will create interlaced artifacts.

Because they're represented as high-frequency noise, these artifacts can potentially cause serious problems for progressive video encoders. This is mainly due to the sensitivity of the human eye and the way compression standards operate. Virtually all video-compression standards are based on two very important assumptions:

• The human eye is more sensitive to low-frequency information, which means some high-frequency information in the original frames can be removed while still maintaining acceptable visual quality.

• Encoding proceeds on a block basis, meaning each 16-by-16 or 8-by-8 block in a video frame can have very similar blocks in neighboring frames. So, the practice is to find a similar block in the previously coded frame and code only the delta between them. This achieves high compression ratios, and in most compression standards, a motion-estimation (ME) module is defined for this purpose.

Unfortunately, interlaced artifacts can appear in almost every block, which makes it very difficult for the ME module to find a similar block in the previously coded frame. As a result, the delta is bigger and ME uses more bits to encode it. Thus, it's a good idea to reduce or remove the interlaced artifacts in the captured frame before feeding it to a progressive video encoder.

DE-INTERLACING VIDEO As previously noted, high-quality de-interlacing can be accomplished with sophisticated algorithms that use a good deal of compute power. A more direct method that employs resizing hardware, such as Texas Instruments' TMS320DM6446 digital media processor, can be employed to simply discard an entire set of field lines. It uses information from the remaining field to generate the missing data.

Discarding all bottom field data of a 480i60 (480 pixels, interlaced, 60 frames/s) video would yield a 240p30 (240 pixels, progressive, 30 frames/s) video. This data is resized vertically to generate a 480p30 de-interlaced result. An advantage of this method is that it removes 100% of all interlacing artifacts, but with an obvious loss of vertical fidelity.

The approach can produce surprisingly good results when used as a pre-processing step prior to progressive compression. This is because lossy video compression algorithms, especially at low bit rates, typically discard high frequencies anyway.

Therefore, depending on the source content, it can provide the same quality as complex algorithms after compression is taken into account. For example, a low-complexity de-interlacer can be used to turn interlaced broadcast data into a low-bit-rate version for a progressive cell-phone screen display.

IMPLEMENTATION The resizer within the TMS320DM6446 processor executes the same general functions of any resizer, but all resizers are a little different. Two important features to note is the resizer module's ability to support 1/4x to 4x scaling in each direction, horizontal and vertical, and the scaling factor is independent for each direction.

In addition, all of the filter coefficients are programmable. A simple example would use an input frame in a 4:2:2 interleaved format organized in a UYVY format (Fig. 1, again). Resolution would be 720 by 480 pixels per frame (NTSC SD).

To de-interlace, the resizer is first told that the input frame is 724 pixels wide, not the real width of 720 pixels. That's because to implement exactly 1:1 scaling, the DM6446 processor's horizontal input size must be adjusted to 720 + a delta, which is calculated by equations in the resizer.

Then the resizer is told the pitch is twice as wide as it really is so that it accepts the first two horizontal scan lines as if they were one. This allows the resizer to perform horizontal 1:1 scaling on even rows (on the top left in Figure 4) and discard the odd rows (on the top right). The input and output vertical size is set to 244 and 480, respectively, so the resizer performs 1:2 up-scaling vertically to interpolate the discarded odd rows.

The resizer is then told that the width of the output frame is 720 pixels and the output pitch is 1440 \[720 + (360 2)\] bytes, resulting in an output frame (Fig. 4, again).

To include the conversion from 4:2:2 to 4:2:0, so that a progressive encoder can use the data, the resizer is called three times for each input frame in the 4:2:2 interleaved format to generate de-interlaced 4:2:0 frames. Three sets of configuration parameters must be maintained, one each for the U, Y and V values. Hence, the three calls to the resizer.

The starting point is the input frame (in NTSC SD resolution) in UYVY 4:2:2 format. The output frame is defined as in 4CIF resolution (704 by 480) instead of NTSC SD resolution (720 by 480). The right 16 columns in the input frame must be discarded due to the 32-byte output alignment limitation of the resizer. Alternately, eight columns from the right and eight columns from the left could have been cropped.

The first call extracts the Y components in the input frame and de-interlaces them. De-interlacing operation should only be applied to Y components by instructing the resizer to treat the input frame as an image in 4:2:0 planar format (Fig. 5). The resizer is also instructed to perform 2:1 scaling horizontally to extract every other Y component in the input frame, and 1:2 scaling vertically to interpolate the discarded Y components in the odd rows

The second call to the resizer modifies the U components, which need to be further vertically down-sampled by a ratio of 2:1. The de-interlacing operation isn't necessary because the down-sampling involves discarding all of the odd rows, which automatically generates progressive U buffers. To perform vertical downscaling, the vertical input size is set to 484 and the output size set to 240.

The operation on V components is similar to that on U components. Operationally, this requires setting some of the resizer filter coefficients differently, but this level of detail is beyond the scope of this article.

A resizing engine can be used to implement the preprocessing of video that calls for de-interlacing and YUV format conversions prior to encoding by video-compression codecs. Due to several factors, including the video codec's tendency to remove high-frequency components, video quality after compression is taken into account. This technique isn't suitable for all applications, however, and care must be taken to ensure that the output quality is acceptable for the application.