Multicast, a technique commonly used in networking systems, allows a given processing unit to send a single data stream to multiple destinations at the same time efficiently. Typically, the switch employed as the interconnect backbone implements the actual multicast replication function. Because it’s programmable, it can duplicate a given packet to any given device connected to the switch.
For example, a server can send a video stream to multiple receivers simultaneously with a single transaction. Since the same packet is sent to all eligible endpoints in a given multicast group, the endpoints also need to be aware of, and support, the multicast protocol to take appropriate action for a multicast message.
Although networking and communications systems have been implementing multicast schemes for some time, the concept of multicast is new to PCI Express (PCIe) systems. The PCI SIG, the governing body for the PCI Express Base specification, recently ratified a multicast specification in the form of an Engineering Change Notice (ECN) specifically designed for PCIe. Yet there’s an alternate method for implementing multicast— using the integrated direct-memory-access (DMA) function inside a PCIe switch.
To get a sense of PCIe’s role in multicast, it helps to understand conventional PCI, the precursor to the PCIe standard. PCI was designed to be a bus-based protocol shared among several devices. Busbased protocols have inherent broadcast built in, where data on the bus is seen by all devices. In the case of PCI, broadcast can be implemented when one initiator targets a receiver while other receivers listen in silent mode. This subset of multicast could be implemented using the Special Cycle command defined in the PCI protocol.
Figure 1 illustrates the broadcast mechanism implemented in the PCI bus. The flow process for issuing a broadcast message on the PCI bus is:
• The PCI bus master starts a transaction with the assertion of FRAME#.
• The Special Cycle (broadcast) command is issued in the C/BE\\[3:0\\]# lines.
• All slave devices accept the command and data from the master.
For such applications, the multicast function described in the ECN can be implemented using PCIe switches with integrated DMA, which are now shipping in volume. Using the DMA controller in the PCIe switch offers an efficient and attractive alternative to implementing multicast with devices available now.
IMPLEMENTING MULTICAST USING DMA IN A PCIE SWITCH A DMA engine is typically used to offload the data transfer from the CPU’s local memory out to devices connected to the other side of the interconnect. Generally, DMA engines reside in endpoints such as a storage or network endpoint. The DMA controllers on these devices are application-specific and can only transfer data between themselves and system memory. A generic DMA engine is also used to transfer a large amount of data sent from one local memory to remote memory.PCIe continues to be the interconnect of choice in a wide range of applications across multiple industry segments. Integrated DMA in a PCIe switch provides the capability to move large amounts of data from local memory to devices attached to the switch, returning CPU cycles for time-critical applications. This capability for offloading the CPU plays a bigger role in embedded systems running real-time operating systems.
Recently introduced PCIe switches with 16 lanes or fewer that integrate DMA engines are available. The DMA engine in these devices supports four DMA channels, which can be independently programmed and controlled. The DMA engine in the PCIe switch is also very flexible, resulting in a versatile DMA implementation that can be used in a large range of applications.
The DMA engine appears as another function on the upstream port (Fig. 2). This function has a TYPE 0 configuration header, and it follows the standard PCIe driver model. The driver for the DMA engine programs its DMA channels by writing to internal registers in the DMA function.
Using the DMA engine in the switch requires software to construct a multicast descriptor ring. This ring will have the same source address, which points to the same transmit buffer, and a different destination address for each descriptor based on the destination port in the PCIe switch. The number of devices in a given multicast group will determine the number of descriptors in a ring. The DMA engine can optionally generate an interrupt upon completion of the multicast ring.
Although multiple DMA channels are supported, a single channel is enough to support the multicast function. The descriptor ring format for each DMA channel descriptor comprises four doublewords (Fig. 3):
• A destination address
• A source address
• Transfer size
• Control
Continue on Page 2
The control field in the descriptor enables the driver to generate an interrupt. Every descriptor will be treated as one packet or multiple packets. Each descriptor provides the bitmap that’s being looked at by the driver. In turn, the driver inserts a number of descriptors equal to the number of copies to be made.
A separate descriptor ring is set up in host memory for each multicast group supported. The number of members in that particular multicast group will determine the number of descriptors in the descriptor ring. Each descriptor will hold the source address constant while the destination address will be set according to the address range for each of the downstream devices in that multicast group.
One descriptor at the end of the ring is used as a fence to stop the DMA engine and keep it from wrapping around the descriptor ring. Once the descriptor ring is written, the CPU informs the DMA engine about the descriptor ring base address by writing first to the internal DMA registers and finally writing to the DMA start bit. The DMA engine reads the descriptor, finds the source pointer, and sends the data toward the downstream port.
Once the DMA engine finishes processing the current descriptor, it moves to the next descriptor, which points to the same buffer in the local memory but with a destination address to the next downstream port. The DMA engine can generate an interrupt after each descriptor is done or after the entire descriptor ring is done. Upon completion of a descriptor ring (representing one multicast group), the DMA driver can program the bit to generate an interrupt to the host processor. At this point, the driver can also free up the source buffer:
1. CPU programs descriptors in RAM
2. CPU enables DMA
3. DMA reads descriptors in RAM
4. DMA works on four descriptors at a time:
a. DMA reads source
b. Completions arrive in switch
c. Completions are converted to writes
d. DMA writes to destination
e. Repeat for all members of multicast group f. Interrupt CPU after descriptor (optional) g. Start next descriptor
5. DMA done—multicast group done
6. Generate interrupt to CPU
For additional convenience and higher performance, these PCIe switches with integrated DMA devices additionally support internal buffers for up to 256 descriptors (Fig. 4). The multicast descriptor ring can be configured directly on the device. Utilizing the internal descriptor buffer space removes the additional overhead for prefetching the descriptors from main RAM.
The internal descriptor buffer space, which is shared with all four DMA channels, is automatically segmented evenly between all enabled channels. In other words, the user can create multiple rings using the internal descriptor buffer by enabling multiple DMA channels and by programming the descriptors accordingly.
The DMA channels in the PCIe switch are independent and can run simultaneously. Each DMA channel can work on separate multicast groups, or multiple channels can collaborate on the same multicast group. The descriptors will need to be configured accordingly to enable the collaboration. When internal (on-chip) descriptors are used, Step 3 (described above) is omitted. All other steps apply.
MULTICAST DRIVERS A multicast driver will be required regardless of whether the multicast protocol is natively implemented in the switch as defined by the PCI-SIG ECN or using the DMA engine as described earlier. When using the DMA engine to implement multicast, a software driver would implement an application programming interface (API), which would be required to provide support for this function. The multicast driver must support the following items:• The descriptor base and ring size parameters in the DMA engine must be configured.
• The descriptors are all zeros. The descriptor valid bit is disabled, which means the driver owns the descriptor.
• The multicast driver has registered for the interrupt.
• The multicast driver is aware of the destination addresses based on a system-wide address map.
• When the packet is designated as a multicast packet, the driver API call-forwards the pointer and number of destination ports (bit map).
• The multicast driver writes one descriptor at a time and enables the valid bit on the descriptor, which results in the transfer of ownership to the DMA engine.
• Once the DMA engine has completed processing the descriptor, it can optionally write the status back into the descriptor (completion). Furthermore, the DMA engine can be configured to generate an interrupt to the host after it completes processing the entire descriptor ring.
• At this time, the host can verify the completion status and clean up the descriptor ring buffer.
• When the multicast packet encounters an error in reaching the desired endpoint, the unsuccessful status can be written back to the appropriate descriptor.
• The multicast driver can write the descriptor again to make sure that the endpoint receives the packet. In a communications system, these endpoints are typically ASICs that are required to be in sync. Implementing multicast using DMA in the PCIe switch allows for this check, which can’t be done when multicast is natively implemented in a switch.
• The multicast driver waits for the next multicast transaction. Note that any PCIe switch port can be the source port for a multicast transaction. Similarly, any PCIe switch port can be the destination port for a multicast transaction, including the source port.
The internal DMA engine in a PCIe switch can be leveraged and used to perform the multicast function in applications that use small PCIe switches. The advantages for implementing a DMAbased multicast scheme are many: support for an unlimited number of multicast groups (each descriptor ring represents one such group); a reliable multicast scheme (status for each copy made can be provided by the DMA engine); multicast transfers that aren’t limited to posted transactions; and most importantly, DMA-based multicast provides a PCIe multicast solution now.