Affordable Video Conferencing Primed For Takeoff

A look at the emerging standards and technology that are making video conferencing simpler and more affordable.

Nov. 17, 2011

12 min read

1 of Enlarge image

Fig 1. Significant differences exist between two-way and 10-way calling. In general, two-way calling is much simpler, since it usually doesn’t require any intermediate processing of the audio/video media.

Fig 2. Round-trip delay is inevitable in video conferencing. Because the act of encoding (and decoding) audio and video naturally produces delay, minimizing such delay becomes a key design criterion.

Fig 3. Advanced packet-loss concealment helps boost performance. Two schemes can be implemented to minimize image-quality degradation due to packet loss. The football image shown represents a worst-case scenario because of the continual quick movement.

Video has finally become the killer app that many hoped it would. It has been fueled by the emergence of enterprise-focused tablets, residential services like Cisco’s umi, Google+ video hangouts, and, of course, Apple’s new Facetime service on the iPhone and iPad.

Why has it taken so long? Success doesn’t happen overnight and the sudden proliferation of video devices was long overdue. To put it in perspective, the world’s first public video telephone service, developed by Dr. Georg Schubert, was operated by the German Reichspost in 1936 using square displays of 8 inches (20 cm).

However, technology alone does not ensure market success. There needs to be a convergence—a perfect storm, if you like—of key elements that usually include user demand, technology capability, and cost-effectiveness. Given the abundance of gadgets available for the video-savvy user, and the accelerating adoption of video-enabled equipment in the enterprise, standardization must finally be addressed.

The Need For Standardization

Innovation followed by the inevitable race to commercialize typically leads to a slew of feature-rich products that don’t always go hand in hand. This is particularly problematic for those purchasing video-enabled equipment in the Unified Communications (UC) Enterprises segment. Don’t despair, though. A strong thrust within the industry is working to unify some of these disparate means of communications.

Even though most video-conferencing tools employ open standards for many aspects, such as video encoding, Internet protocol (IP) transport, or even call control, they remain incompatible. Such incompatibility often is caused by the introduction of proprietary features. Therefore, getting to market quickly and facilitating interoperability requires disabling of these feature sets and downgrading other specifications.

Yair Wiener, CTO of Radvision, is one of the pioneers of the industry and co-chair of the Unified Communications Interoperability Forum (UCIF) H.264 SVC technical working group. He says that “while most traditional video-conference systems follow the same ‘standards’ (mainly H.323, H.264 etc), this is not the case with UC solutions. Currently, we have multiple UC solutions that are not interoperable with each other.”

“This situation becomes more problematic with the introduction of SVC (Scalable Video Coding),” he continues. “The complexity of the standard, as well as lack of standardization of different parts of the solution (e.g., error resilience mechanism), introduces a big challenge. The goal of the H.264 Technical Working Group (within UCIF) is to define the subsets of the standards that are required for interoperability.”

The UCIF has long recognized the issue of interoperability (specifically for the federation of UC Systems) as being one of the most “painful” points for customers. The “plugfest” approach, also known as bilateral testing, has become increasingly expensive and lengthy, especially as the market expands outside of traditional telecom equipment. Vendors thus turn to unilateral testing, which places considerable burden on individual vendors and results, at best, in mediocre interoperability coverage. To remedy this issue, the UCIF is working to create and certify UC interoperability scenarios.

True Video Conferencing

Many different Web-based services continue to expand the reach of video conferencing. In fact, Google now provides its WebRTC (Web Real Time Communication) framework as open source. Operators such as Bell Canada offer Facebook to mobile video calling. Open-source players are also diving into the fray, offering lower-cost alternatives.

Not all of the above products offers true video-conferencing, though. Most people still confuse two-way calling with video conferencing (Fig. 1). In contrast, two-way calling is much simpler. It usually doesn’t require any intermediate processing of the audio/video media. The two endpoints establish the call, sometimes with the help of a directory server, and then the audio/video streams flow directly between the two parties.

A video conference, on the other hand, requires a multipoint control unit (MCU). The MCU is responsible for mixing the various audio/video streams, laying out the various images onto the same screen. Very often, transcoding or resizing also is necessary. While many vendors do offer two-way calling, the MCU function has often been elusive due to signaling and media-processing complexity.

Affordable High-Performance Solutions

Video conferencing in the enterprise has experienced many false dawns, with potential benefits being quickly negated by ROI calculations and poor user experience (UX). The most obvious reasons for this involve technology cost, support cost, and technology performance.

Though it has begun to change, affordable systems only offered low resolution and low-frame-rate video. After the novelty of seeing the other party wears off, it becomes apparent that these systems lack many of the basic elements needed for natural and productive human interactions. Humans take cues from many micro gestures. These micro gestures are difficult to detect in anything other than full-on high definition (HD).

Until recently, HD video encoders and MCUs were too expensive for most small and medium companies, and certainly too expensive for personal use. Those that were deployed were expensive and bulky, and they required heavy lifting from the IT department. Consequently, at best they became a niche market reserved for government, blue-chip companies, and educational institutions. For the most part, though, they were unusable.

Recent advances in technology, ranging from HD tactile LCD screens to online collaboration tools and high-performance multicore DSPs, now make it possible to design affordable HD video-conferencing systems for enterprise or carrier networks.

Hardware plays a vital role in user adoption, but the arrival of DSP solutions that come “out-of-the- box” ready has really made the difference. Full-function, scalable MCU software is now available on several DSP platforms. Up to this point, complete hardware and software solutions that take the complexity out of building embedded video-conferencing systems were nonexistent.

Key software blocks include high-quality video codecs, a broad range of narrowband and wideband audio codecs, audio/video synchronization, sophisticated audio and video mixers, and a legacy telecom software suite. Some of the more advanced offerings include advanced quality-enhancement algorithms such as packet-loss concealment, as well as the ability to offer personalized screen layouts for each participant.

Armed with this new generation of integrated MCU solutions, system implementers can focus their attention on product differentiation and in developing products with a greater user experience. Key factors influencing the overall user experience include:

Round-trip delay: The act of encoding (and decoding) audio and video inevitably incurs delay (Fig. 2). No doubt, then, that minimizing this delay becomes a critical design criterion. Generally speaking, one-way delay of less than 150 ms is desirable, while greater than 400 ms is unacceptable.

There are numerous ways to reduce video-encoding delays. When using a multicore DSP, the encoding or decoding task can be distributed and processed in parallel across multiple DSP cores. Advanced scheduling algorithms can reduce this delay well below the 30-frame/s rate of 33 ms per frame.

Audio/video synchronization: The human brain obviously expects to see the speaker’s lips move simultaneously with the audio. Boundaries do form, though, on how far the audio and video streams can be out of sync before perceived lip sync is lost. As audio/video skew approaches 50 ms, the most discerning of viewers will begin to notice the impact of poor synchronization. As the skew increases beyond 50 ms, it generally becomes more bothersome and distracting. Sensitivity to skew varies greatly from person to person.

To avoid this problem, media-processing software must use precise time-stamping at the input of the audio and video mixers to ensure that they’re being mixed on a coordinated time-base.

Audio quality: While image quality, bandwidth efficiency, or other flashy marketing features like touchscreen controls often are emphasized, the most important feature for a video-conferencing system remains the audio path. Of course, that’s because if anything happens to the audio stream, the conference is rendered useless.

Audio quality may be degraded due to the quality of the mixer, network impairments (e.g., lost packets), and/or environmental factors (e.g., echo, ambient noise). A complete software solution must provide algorithms to reduce or eliminate the effects of these impairments. Acoustic echo cancellation, noise reduction, and packet-loss concealment make up the minimum set of voice-quality-enhancement tools. As wideband codec support becomes more prevalent, these algorithms must be upgraded to act on the full spectrum.

Video quality and bit rate: There are many aspects to video quality. The first is fundamental quality of the codec. Given the variable bit-rate (VBR) nature of video encoding, codec design embraces the art of preserving the highest possible quality at the lowest possible bit rate. Using more advanced coding tools, such as CABAC in H.264, leads to large gains in bandwidth savings, sometimes up to 15% to 30%. Some of the more recent video-conferencing systems advertise H.264 High-Profile as a distinguishing advantage, highlighting the higher quality that’s possible on slower bandwidth connections.

Another important aspect to video quality concerns performance in non-ideal situations, i.e., when faced with network impairments like packet loss. This can lead to many different visual artifacts, such as blockiness, trails, or object persistence across many frames.

Most of today’s high-end systems offer some form of packet-loss concealment to replace lost information. Technique #1 reuses the data from the previous frame (Fig. 3). Since video is based on a sequence of images, this represents a good approximation, but the results are less than desirable. Technique #2, deployed in high-end solutions, extrapolates where the motion was headed and continues along that path. It increases the chances of plugging the gap correctly, an advantage over technique #1.

Lost data is recreated by using packet information from previous frames or surrounding macro-blocks. Turnkey software solutions for video MCUs now include impressive packet-loss concealment algorithms.

Security (encryption of media streams): Around 2008, full support for encryption of audio and video streams began appearing in systems. Today, most systems support SRTP (Secure Real-Time Transport Protocol) encryption of all media, including shared presentations, and transport layer security (TLS) protection for signaling protocols. Ideally, designers should pick a software solution that supports at least 256-bit AES encryption. From a hardware point of view, the selected DSP must be efficient at encrypting the traffic.
Codec support: The continually evolving field of codecs impacts both audio and video compression. It’s good news for the UX, but presents a difficult challenge for the MCU. Being at the heart of the conference, the MCU ideally should be codec-agnostic. Since this is unlikely, the MCU must support a broad range of codecs and provide the operator with a seamless upgrade path.

As a whole, the industry has embraced the concept of open standards, in part to facilitate interoperability. It has become more apparent as new technologies and new norms replace the traditional monopolistic telecommunication modus operandi.

The G.7xx series of audio codecs from the International Telecommunications Union (ITU) offers a good selection of wideband audio codecs. However, industry leaders are now coming together to standardize an unencumbered audio codec named “Opus.” It provides better audio quality than the current ITU codecs and is considered “free” since it’s not subject to any intellectual-property licensing.

In the video arena, the latest example can be found in Google’s approach to the new VP8 standard. After purchasing On2 Technologies in early 2010, Google provided an irrevocable patent promise for underlying patents regarding the VP8 format. Google followed up by releasing the source code for libvpx (a reference implementation of VP8) under a BSD-like license, effectively making it accessible to everyone.

Conclusion

At long last, large-scale adoption of affordable video conferencing solutions is possible. While they can never serve as a replacement to face-to-face meetings, video conferences can provide a usable and acceptable analogue.

The industry is finally in the midst of a perfect storm—a coming together of user demand, technology advancements, and standardization. Many equipment vendors, new to the video space, are building MCU equipment for this emerging market and leveraging the available DSPs and software solutions to get to market first.