Manage Network Traffic In Your 100-Gbit/s Designs

Per Limbre

Network processor must handle range of applications

100G Ethernet requires a single scheduler

The queue manager writes packets to DRAM

Key Traffic Manager Features

Traffic-manager design was once a straightforward process for packet-switching systems operating at a few gigabits per second. Packet buffers could simply be implemented in high-volume, power-efficient, cost-effective DRAM. That’s now much more complicated.

Some of the basic design assumptions just don’t apply in today’s high-performance networking systems, where single processing devices control 100 Gbits/s of traffic. To better understand the current environment, this article explains the limitations and explores how packet buffers and traffic-management logic can be designed to meet the requirements for 100-Gbit/s designs and beyond.

TMs In Today’s Broadband Networks

Internet routers, access switches, and mobile gateways include deep packet buffers and logic to manage traffic in times of congestion. These systems, generally called TMs, play a critical role in modern broadband networks. They’re used mainly for two purposes: to enforce service-level agreements as contracted between service providers and their customers, and to maintain quality-of-service (QoS) attributes for traffic across network bottlenecks such as switching or compute resources.

Dedicated devices often implement TMs and associated packet buffers with fast silicon. Recent network processors (NPUs) integrate this logic to provide consistent packet processing and traffic management in a single device, reducing the bill of materials (BOM) and power consumption.

NPUs are quickly becoming popular in the latest carrier-networking products. Consequently, we can expect traffic-management capabilities, which were previously limited to high-end broadband edge routers, to become more generally available in cost-optimized networking nodes (e.g., Optical Line Terminal, Metro Ethernet aggregation, packet-optical transport systems).

Traffic Engineering Objectives

Implementation of today’s broadband networks involves systems from a range of vendors. Each system has its particular feature set for managing traffic flows and keeping QoS attributes. The number of queues in an aggregation switch may be limited to a few hundred, while an edge router may provide hundreds of thousands of queues per line card. Subsequently, service providers came up with best-practice network configurations to meet service contracts and traffic objectives given the available hardware.

In practice, some models for traffic engineering in Internet Protocol/multiprotocol label switching (IP/MPLS) networks are widely accepted. However, no two carrier networks feature identical implementations, complicating providers’ traffic engineering objectives.

A high-performance and highly advanced TM with a rich feature set is therefore viewed as a strategic product advantage. Service providers can configure the TM to meet their objectives and seamlessly integrate with other systems in heterogeneous networks. A modern TM must implement a range of features to meet these objectives (see the table).

Wirespeed Classification

Integrated chip designs implement traffic-engineering features in two different subsystems: the packet-processing subsystem and the traffic-management subsystem (Fig. 1). Note that both systems must be designed for 100-Gbit/s operation. Failing to support wirespeed operations in packet processing will eventually cause the complete system to behave randomly by dropping packets without any control.

The packet-processing subsystem is responsible for classifying user traffic into flows and associating correct Class of Service attributes. This is often called traffic conditioning carried out on ingress processing, which is on the incoming side of the line card. Meters are used to check if the traffic is within or out of contract. If within a guaranteed contract, the packet is colored green. If within accepted excessive rates, it’s colored yellow. Red is used for packets strictly out of contract.
The actual drop decisions for packets are made either within the logic or at a later stage. If there’s a strict policy for out-of-contract traffic, the packet-processing subsystem may police the traffic down by dropping packets over a defined threshold.

The subsystem may want to forward the packet for further processing even if it’s out of contract. It takes additional information into account, such as TM queue status, filter, and forwarding updates. By allowing the packet-processing subsystem to make drop decisions, counting operations for statistics and service level agreement (SLA) reporting can be executed in direct connection to the drop decision, making them highly accurate for accounting purposes.

Once classified, the flows are enqueued to the traffic-management subsystem, where the packets are buffered in deep off-chip memories. The system includes advanced scheduling logic, which can be flexibly configured to manage flows per service, user, group, and port.

This component of the TM is critical because the logic that makes scheduling decisions must be both advanced and flexible. Its task is to select the queue packet that should transmit data to the outgoing port in the next clock cycle. Scaling such a decision process for tens and hundreds of thousands of queues at 100 Gbits/s is a daunting task.

For emerging 100GE applications, the performance of the scheduling logic can’t be increased by dividing the task to different schedulers (Fig. 2). Using multiple schedulers would result in packet re-order problems and add load sharing complexity. As a result, 100GE requires a single scheduling decision tree.

More Flexible Scheduling

To be applicable in many applications and support various traffic-engineering models in today’s carrier networks, the hierarchy of scheduling nodes must be extremely flexible. The operator should be allowed to flexibly map queues and the first scheduling level.

In service gateways, it’s popular to associate queues with services and then group the queues together for a user. Here, two to eight queues are typically allocated per user. For other applications, such as data-center load balancing, the operator may want to aggregate flows based on traffic types and current load on compute resources. This results in very different queue structures and scheduling configurations.

The scheduler implements a range of algorithms (round robin, deficit weighted round robin, and strict priority queuing). Available on each scheduling level, the algorithms can be used flexibly to achieve the desired traffic characteristics in terms of packet loss, delay, and bandwidth distribution between users and services.
The scheduling hierarchy also includes shapers to enforce bandwidth attributes. This is achieved by implementing dual token buckets, one for controlling the committed information rate (CIR) and one for the excess information rates (EIRs).

Services are given different priorities by configuring associated queues with a corresponding priority level. This information should be allowed to propagate to any given point in the scheduling hierarchy. Therefore, every scheduling level can be priority-aware and can give voice traffic priority at every level without jeopardizing SLA conformance (e.g., on the user level).

Last but not least, the scheduler includes a backpressure mechanism. Other systems such as outgoing Ethernet ports use it to signal their receive status to avoid buffer overrun. When set, the scheduler throttles traffic to the receiving subsystem until the overflow signal is released.

The TM subsystem also includes a queue manager. It handles buffer reservation, tracks queue fill levels, and suggests drop decisions based on active queue-management algorithms. For UDP-based (User Datagram Protocol) traffic, used in real-time time-sensitive applications like voice, it’s better to drop a packet than to delay it in a deep buffer. As a result, this traffic class often implements a tail drop, which drops a packet when reaching a certain fill level.

Queue management for TCP-based (Transmittion Control Protocol) data traffic is best implemented with weighted random early detection (WRED), where the likelihood of dropping packets increases over a certain threshold. Essentially, the method harmonizes with the TCP protocol’s identification of effective data rates across networks. Traffic associated with different colors may be assigned different WRED profiles, allowing TCP sessions with non-confirming SLAs to have a higher drop probability than those with conforming contracts.

It’s difficult to design hardware support for queue management and scheduling targeted at 100-Gbit/s Ethernet wirespeed operations. Since the problem can’t be parallelized, implementations tend to compromise on flexibility, features, or performance.

The Packet Buffer Challenge

The packet buffer consists of logic and memory to read and write packets to and from the queues. Key design criteria include low cost, low power dissipation, and dynamic sharing of buffer memory among the queues. Low cost and low power dissipation can be achieved with high-volume, low-cost DDR3 DRAM for queue memory. Dynamic usage of memory is attained via a page-based memory, where queues allocate new pages on demand and unused pages are tracked in a linked-list.

This concept works well for designs targeting applications of a few gigabits per second. The access time to write or read to DRAM is slightly more than 50 ns, which is plenty of time to keep up with the interface speed. Things become much more complicated in today’s high-speed designs, where a single device is used to manage a single stream of 100-Gbit/s traffic. Here, the inter-arrival time for smallest-size packets is as little as 6.6 ns. So, what’s the best way to overcome the random-access time properties of DRAM?

One common solution is to store the linked-list, which holds the page information, in SRAM. SRAM delivers much higher performance than DRAM. However, the devices come at a significantly higher price—approximately 500 times higher. Consequently, vendors tend to trade performance against cost, designing their TMs to perform well under low load or “Internet traffic mix.” Because the design can’t support wirespeed for smallest-size packets, the whole system will start to drop packets in an uncontrolled manner over a given point.

New packet-buffer design innovations allow DRAM to be used for the complete queue manager and buffer system without relying on SRAM, leading to reduced cost and low power dissipation. As such, traffic management is possible for single stream of 100-Gbit/s traffic (Fig. 3). Packet data, packet information, and next pointers are all stored in DRAM.

The queue manager is responsible for the enqueue/dequeue logic between the TM subsystem and the off-chip DRAM. When a packet enters the queue manager, it already has been accepted for enqueuing by the drop unit. The drop unit performs drop decisions based on queue fill level, drop algorithms associated with the queue, and packet information extracted from the classification process (e.g., class of service and color).

One of the design goals is high-memory-bandwidth utilization. This is achieved by dividing the packets into pages of data units aligned to the burst size for write/read toward the DRAM. In addition, an extra container layer is implemented so the queue manager and scheduler can operate on a higher volume of data than the packet layer provides for legacy queuing implementationsThus, the queue manager and scheduler can scale to higher data rates, enabling designs of 100 Gbits/s and beyond.

Building packet buffers with SRAM performance characteristics and DRAM density characteristics is not an industry illusion. These systems are commercially available today. In fact, some of the latest systems feature rates of 100 Gbits/s and 150 million packets/s and offer sustained enqueuing and dequeuing rates using six 667-MHz DRAM banks. Moreover, the TM design will linearly benefit from future performance increases of DRAM in terms of both data rates and memory depth.

Conclusion

A “more of the same” approach simply will not work when designing TMs for 100 Gbits/s and beyond. Classification will have to perform at wirespeed under all conditions. The internal data rates between subsystems must guarantee 100 Gbits/s and 150 million packets/s. Moreover, the scheduler must be flexible enough to achieve any use case. Lastly, the packet buffer must preserve or improve cost and power budgets from previous generations and support similar sustained packet rates.