Design Maximum Data Flow Into Your Communications System

DESIGN VIEW is the summary of the complete DESIGN SOLUTION contributed article, which begins on Page 2.

The relentless rise in network traffic rates and the ongoing shift from circuit-switched to packet-based architectures promises to bring a variety of new challenges to communications-systems design. Many of these systems will reach success only if the design team can maximize data-flow efficiency by building highly efficient and cost-effective subsystems for data segregation, data prioritization, and bandwidth aggregation.

So it's no surprise that memory subsystem design has become a full-time job for system architects and designers of networks, cellular basestations, and data-acquisition systems. In the process, they have seen a growing percentage of their design resources and development time spent on the arduous task of building highly specialized memory subsystems for bandwidth aggregation, data segregation, and data prioritization.

Complicating this task has been the rapidly changing mix of data types running across current networks. While most data remains sequential in nature, the rising use of audio and video has placed new time demands on the transfer process. As a result, there's a new premium on subsystems that can prioritize data as it flows through the system and provide the data-management functions needed to meet the demand.

This article takes a look at four different options when building a data-flow-control subsystem: use of off-the-shelf specialty memories a custom home-grown solution using an FPGA or ASIC with integrated memory devices a custom home-grown approach based on external memory and a smaller, more affordable FPGA or the use of flow-control-management (FCM) ICs. FCM devices are discussed in depth, as the author concludes that these chips may provide the most attractive combination of performance and functionality at low cost.

HIGHLIGHTS

Specialty Memories
Automated systems typically include feedback elements to ensure accurate and stable control over speed and position.

"Home-Grown" Designs
As applications grow in complexity, it might be most practical and cost-effective to implement your memory subsystem design in an FPGA or ASIC. Another home-grown approach is to use external memory and stay with a smaller and more affordable FPGA.

Flow-Control Management Devices
FCM devices combine many characteristics of FIFOs, multi-port SRAMs, and specialty DRAMs with highly optimized flow-control logic in multiple configurations.

Encoder Sigal Cable
Shielded twisted-pair cable should be used for best performance. The cable should carry only the encoder's signals.

Sequential Applications
A sequential flow-control (SFC) device can be used to build a memory subsystem that transfers large amounts of data in a sequential fashion. It lowers cost, lessens the overhead, and shortens the design cycle.

Sidebar: Inside The Multi-Queue IC
Functionality embedded in a line card's FPGA or ASIC can be replaced by a single-chip multi-queue, flow-control IC with up to 32 discrete queues.

Full article begins on Page 2

The relentless rise in network traffic rates and the ongoing shift from circuit-switched to packet-based architectures promises to bring a variety of new challenges to communications-systems design. Many of these systems will only reach success if the design team can maximize data-flow efficiency by building highly efficient and cost-effective subsystems for data segregation, data prioritization, and bandwidth aggregation.

So it’s no surprise that memory subsystem design has become a full-time job for system architects and designers of networks, cellular basestations, and data-acquisition systems. In the process, they have seen a growing percentage of their design resources and development time spent on the arduous and time-consuming task of building highly specialized memory subsystems for those tasks of bandwidth aggregation, data segregation, and data prioritization.

Complicating this task has been the rapidly changing mix of data types running across current networks. While most data remains sequential in nature, the rising use of audio and video has placed new time demands on the transfer process. Many of these new data types must arrive on time to function properly. That requirement places a new premium on subsystems that can prioritize data as it flows through the system, and provide the data-management functions needed to meet that demand.

If you’re building a data-flow-control subsystem today, you have three basic options: You can use off-the-shelf specialty memories. Consider building a custom “home-grown” solution using an FPGA or ASIC with integrated memory devices. Lastly, try a custom “home-grown” approach based on external memory and a smaller, more-affordable FPGA. Or, you can take advantage of something completely different—the recently introduced flow-control management (FCM) ICs. So let’s take a look at the merits of each approach.

Specialty Memories The first question to ask is whether or not you can provide the throughput necessary in your system datapath by building a multichip board solution using off-the-shelf specialty memories. Historically, off-the-shelf specialty memories have represented the most popular solution for network equipment designers building memory subsystems. For example, many designers of SONET, Fibre Channel, and Gigabit Ethernet equipment use low-power FIFOs or dual-port SRAMs for rate matching, bus-width matching, or data-buffering applications. These specialized memories typically feature embedded flag mechanisms that aid in the data-monitoring process. Some dual-port devices even offer user-selectable I/O on each port to interface devices operating at different voltage levels.

Figure 1 illustrates a typical Ethernet router design using low-power FIFOs. In this application, frame traffic enters and exits via the FIFO buffers, and the control logic routes the frames within the system. An on-board microprocessor manages queues and ensures that traffic is routed appropriately throughout the network. A 9-Mbit dual-port SRAM serves as a storage buffer for both the header and payload information. It also acts as a lookup table and scratchpad to assist the microprocessor in performing calculations.

As long as the data is very sequential in nature, or the design calls for two devices to interface in a random fashion (and system performance is limited to around 50 MHz), a high-speed FIFO or dual-port SRAM can meet your requirements. Moreover, a memory subsystem built around off-the-shelf specialty memories is relatively simple to implement because it requires no additional logic. Component count is low and, because all key components are available off-the-shelf, the subsystem cost remains very attractive.

But if you consider this approach for higher-speed applications, you may run into some obstacles. Memory density is typically the primary restriction. Off-the-shelf FIFOs are currently available on a cost-effective basis only in densities up to 9 Mbits. Plus, vendors have only recently announced a dual-port SRAM at 18 Mbits. If your buffer requirements call for higher densities, or the data-management task demands control of multiple ports and multiple queues, you will need to make a decision. You can build a solution using multiple FIFOs in line—in which case you’ll have to deal with complex protocol translations and multiplexing—or you can look in an entirely different direction.

“Home-Grown” Design As network throughput rates rise and applications demand higher levels of content examination and packet manipulation, many subsystems are also gaining complexity. The higher the throughput rate of the network, the faster the processing resources required to support that rate, and the larger and faster the buffers it takes to support line-speed throughput.

If your application is growing in complexity, you’ll want to examine the advantages and disadvantages of building a custom “home-grown” solution. Instead of using off-the-shelf specialty memories, it might be most practical and cost-effective to implement your memory subsystem design in an FPGA or ASIC. You can use integrated memory blocks within the device to meet your requirements for quality of service (QoS), packet prioritization, and data-bandwidth aggregation.

The block diagram of a typical line-card implementation illustrates the advantages of this approach (Fig. 2). Usually, the system connects to a data-communications network, such as a corporate backbone, through a line-interface module. The module consists of the PHY, SERDES, and MAC interface layers of the OSI model. A local processor reduces the computation burden on downstream processing modules. The line card provides bus/rate matching and packet-buffering capability along the datapath between the PHY/MAC circuitry and the processor.

In this home-grown custom approach, that functionality is integrated into a single FPGA or ASIC. Typically, the FPGA or ASIC combines receive and data buffers with rate- or bus-matching logic, supporting bus interface and control logic, I/O logic, and various clock circuits. If the application requires larger buffers to support higher data rates, designers may opt to use DRAM or SRAM that is external to the FPGA.

This approach offers far more flexibility than one built around off-the-shelf FIFOs or multi-port SRAMs. You can program the FPGA to perform the exact functionality required by your application. As programmable logic vendors bring increasingly larger FPGAs to market, this home-grown approach can be used in a wider array of applications. Today you can buy FPGAs featuring millions of logic gates, and large blocks of memory that can support extremely complex memory subsystem designs. Moreover, programmable logic vendors now offer standardized IP blocks for their FPGAs, which significantly simplify the programming and implementation of a design.

The custom home-grown solution is very attractive because it allows the integration of multiple functions into the FPGA. In the process, you minimize component count, thus saving significant amounts of board space. The prospect of collapsing many board functions into one high-density FPGA and building a highly elegant solution is very tempting.

It doesn’t come without its costs, however. If you’re trying to control the flow of multimedia and other types of network traffic at high speed, your design will require very fast and very wide data pipes. Moreover, those applications usually require a higher-density memory. You can support these growing memory requirements by moving to a larger FPGA. But the larger FPGAs that support those memory requirements internally will often be prohibitively expensive. In most cases, you will find that the maximum internal memory density of affordable FPGAs is limited to about 1 Mbit.

As a result, a home-grown FPGA- or ASIC-based solution makes sense from a cost standpoint in applications running at relatively modest data rates and demanding fairly limited-size data buffers. As long as your application runs under approximately 50 MHz, and requires buffers no larger than about 8k by 16 bits, the custom approach offers a viable option. For higher-speed applications that need larger buffers on-chip, the expense of large FPGAs presents a serious impediment to a cost-effective design.

Another “Home-Grown” Option Your other option that takes a home-grown custom approach is to use external memory and stay with a smaller and more-affordable FPGA. Leading programmable-logic vendors, such as Altera and Xilinx, offer blocks they can integrate into their devices for external memory management. But if you choose this approach, you must carefully consider the I/O limitations of programmable logic and its impact on subsystem latency.

A typical FPGA-based solution using external memory presents multiple performance bottlenecks to the designer. As the data enters the FPGA, there’s latency associated with getting onto the device. As a result, the system has to assign an address to the data and move it to the external memory device. Next, the controller assigns an address to place the data into the external memory. Both operations must pass through programmable-logic gates. When the data is fed back into the system, it must once again pass through the FPGA’s programmable I/O.

Accordingly, during the process of accessing a byte of data, the system must perform four separate accesses to the FPGA, and two to the DRAM. With each access representing a separate clock cycle, the potential latency inherent in the system is very significant, even before you factor in the DRSM refresh cycles. Moreover, the data must be continually monitored to ensure data coherency is maintained throughout the system.

You can avoid the latency inherent in an FPGA-based solution by building a high-performance, ASIC-based design with large amounts of internal memory on-chip. But the associated NRE costs usually make this option prohibitively expensive, except for extremely high-volume applications.

Another important factor to consider is design-time costs. Depending upon complexity, designing a custom solution, programming the FPGA, testing, and modeling the system can take weeks, if not months, out of the design cycle. Alternatively, working with off-the-shelf devices with pre-validated models can significantly shorten the process. Obviously, the more time you spend on designing and verifying your system, the longer the design cycle, the higher the costs, and the later your product gets to market.

Flow-Control Management Devices Until recently, your primary alternatives for memory subsystem design in high-speed network and data-acquisition applications were the aforementioned off-the-shelf specialty memories or custom solutions. Now, a new option has emerged. It combines many benefits of an off-the-shelf approach with the performance attributes of a custom approach.

This new class of products, FCM devices, addresses the need of data flow-control applications by combining many characteristics of FIFOs, multi-port SRAMs, and specialty DRAMs with highly optimized flow-control logic in multiple configurations. These devices feature many control-logic functions found in memory controllers, multi-level queue controllers, and multiplexers/demultiplexers, as well as a variety of sequential blocks and clock-critical circuits. As a result, they can be used to address many of the data-segmentation and prioritization problems associated with the controlling multimedia and other time-sensitive network traffic.

These off-the-shelf devices are expressly designed for complex network applications where you need to build a system that meets QoS or data-differentiation requirements, or buffer data streams in parallel. For example, one of IDT’s devices, called the multi-queue, mounts up to 32 synchronous queues on one chip. Each queue can provide independent read and write access simultaneously in different frequency domains, at speeds of up to 200 MHz. Each queue in the device has a common data-input bus and common data-output bus. Data can be read to and written from the device on each rising clock edge, even if the user is switching between queues. You can use integrated bus-matching capabilities to configure read and write buses independently of each other by bus width, speed, or data rate.

A multi-queue flow-control device such as this one can be used in a wide variety of applications, from general switching environments where you need to write or read from any or all ports, to those in which two x18-bit data streams must be combined into one x36-bit stream. You could also use it in a data-mirroring application in a storage area network, where data coming in from a system controller must be dispersed across three separate disk controllers. See Inside The Multi-Queue for more details about read and write operations.

If segregating or prioritizing data at full line rate through multiple queues is required, this device offers a more-integrated approach than using multiple FIFOs, multiplexers, and demultiplexers, plus supporting logic. For instance, if you want to assign Ethernet data coming into your system to one of several queues, depending on some user-defined packet priority, you can use each queue in this multi-queue device to represent a different level of service. To ensure QoS, a local processor can be assigned to run an algorithm to process high-priority packets first.

The line-card block diagram in Figure 3 shows how the functionality embedded in the FPGA or ASIC in Figure 2 is replaced by a single-chip, multi-queue, flow-control IC with up to 32 discrete queues. Encapsulated voice, video, or data “information grams” are written into each queue via the write port and read from the read port. The device supports writes and reads at frequencies up to 200 MHz, and transfer rates up to 6 Gbits/s.

All data read and write operations are completely independent. You can select any queue on the write port and access any queue on the read port. The queues also support simultaneous write and read operations. Moreover, it’s possible to configure the device for packet-mode operation to efficiently identify the start and end of packets stored in the queues. Other multi-queue devices can support up to four queues that contain up to 5 Mbits of storage with 200-Mbit/s SDR and 400-Mbit/s DDR data rates.

Obviously, this new class of FCM ICs offer designers a number of advantages. Available in a growing variety of configurations, they combine much of the functionality you may need for a complex memory subsystem design in a low-cost, off-the-shelf device. In many cases, they provide more functionality and higher performance than traditional off-the-shelf FIFOs or dual-port SRAMs for virtually the same cost. Those cost savings increase if your primary design alternative is an FPGA- or ASIC-based custom architecture.

For instance, you would need a fairly expensive multi-million-gate FPGA to replicate the functionality in the multi-queue flow-control management described above. One customer found that they could move the functionality from an FPGA to the multi-queue flow-control device and reduce their overall FPGA cost from $600 to $250. And the $50 multi-queue brought the overall subsystem cost down by about 50%.

These devices offer multiple advantages from a system’s perspective as well. Because an FCM IC integrates a variety of functions, including FIFOs, control circuitry, status registers, fan-out circuitry, and line drivers, it can significantly reduce pc-board real-estate requirements. Secondly, it reduces the number of clock line traces on the pc board. In contrast, if you’re building a traditional multi-chip design built around FIFOs or multi-port SRAMs, you must run clock lines to each device. And if you’re building an FPGA-based solution, clock distribution will often be a major design concern. Third, the FCM approach reduces metastability design issues by providing integrated rate matching between asynchronous timing domains, such as 125-MHz Fast Ethernet and 155-MHz ATM.

Using a single FCM device also simplifies pc-board layout, as you’ll only have to run traces to one device. Moreover, to help minimize pc-board trace routing, the multi-queue IC features common data buses for communicating to and from the queues.

It’s also important to consider overall development time. A custom approach will require that you spend considerable amounts of time programming and testing your FPGA. An off-the-shelf FCM device eliminates that portion of the development cycle. Moreover, the FCM further simplifies the design process by supporting a wide variety of standard I/O specifications, including 3.3-V LVTTL, `2.5-V LVTTL, 1.5-V HSTL, and 1.8-V eHSTL. Lastly, because these off-the-shelf devices come fully validated with complete Verilog and IBIS model support, you can save significant time verifying the design.

Sequential Applications Until recently, if you were designing a memory subsystem to support the transfer of very high quantities of data in a sequential fashion, an off-the-shelf solution wasn’t an option. Most designers would implement a microcontroller or fairly high-density and expensive FPGA to manage external high-density RAM, and perform queuing in a sequential method. Building this type of custom sequential flow controller would typically involve writing large amounts of software, long design schedules, and relatively high NRE costs.

Today you can use a sequential flow-control (SFC) device to build the same subsystem with less overhead, lower cost, and a shorter design cycle. Featuring a seamless connection to external DDR SDRAM, the single-chip SFC supports management of up to 1 Gbit of data without user intervention. You can design this device into your data-flow stream, and read and write data with minimal control logic interfacing it from any upstream or downstream device. And you needn’t create circuitry to handle DDR SDRAM interactions; the SFC handles that task. Independent read and write ports with associated read and write clocks operate synchronously at speeds up to 166 MHz. The read and write ports can also function asynchronously, enabling the SFC to run without a free-running clock. Consequently, data can be written and read whenever it’s available at each port.

The block diagram in Figure 4 illustrates how this off-the-shelf device can be used in a high-performance CDMA basestation design. In this application, an IDT72T6480 SFC IC is used as a high-speed FIFO to support a Xilinx XC201000 FPGA. The data stream is then fed to an array of Analog Devices’ TigerSHARC DSPs for processing. The off-the-shelf SFC device manages up to 1 Gbit of data, without user intervention, through a seamless interface to external DDR SDRAM.

While the SFC uses off-chip storage to maximize density, it also supports extremely high throughput by adding a number of innovative features designed to cut through the traditional latency issues associated with storing data off-chip. Elastic buffers reduce latency by automatically transferring from input to output when throughput rates permit. Alternatively, a copy of recently requested data can be stored in high-speed caches. This allows the SFC to operate in parallel, supporting upstream devices reading out of one cache and downstream devices reading out of another cache. Simultaneously, it can maintain the bulk-storage SDRAM in the event that data requests outrun the size of the caches. To further minimize latency, the SFC device adds embedded prefetch capability to accelerate data from the off-chip memory device. A user-selectable error detection and correction (EDC) function identifies and corrects data errors when reading from the SDRAM.

To implement this same functionality in an FPGA, you would have to use a fairly large and expensive programmable device. Moreover, your investment in development time for programming and testing the device, then coping with the invariable latencies associated with moving data through programmable I/O gates, assigning addresses, and moving it out to external memory and back, would be daunting. The SFC allows you to use a smaller, less expensive FPGA, reduce the complexity of your design, and dramatically shorten design time.

So how can you determine which architectural approach best fits your design requirements? As we have seen, a wide variety of issues from budget restrictions to time-to-market schedules come into play.

Generally, if the performance and size of the datapath in your design is relatively low, and the access scheme is relatively simple, off-the-shelf FIFOs or multi-port SRAMs probably offer your most cost-effective option. If the complexity of the logic in the design rises, but the performance and size of the datapath requirements remains moderate, you may be able to use the internal memory resources of moderate-size and affordable FPGAs to meet your design needs. As performance requirements rise to OC-3-type levels and the design calls for the complex interaction or manipulation of multiple streams of data, FCM devices offer a number of attractive advantages. In many applications, these off-the-shelf devices may provide the most attractive combination of performance and functionality at low cost.