System design no longer involves simply seeing a single vision through to completion. It represents the struggle to successfully join together several evolving technologies and standards. Even the development tools, circuit fabrication options, and software development tools needed to design systems are evolving. But somehow, designers must complete their projects faster than ever before to meet market demands.
One of the system designer's assets that has emerged as an increasingly useful tool in dealing with these issues is programmable logic. With programmable logic comes many decisions—like which device to use, how to integrate it into an existing design flow, and how to plan the development cycle to maximize its effectiveness. To illustrate the use of programmable logic in system design, we'll examine a Gigabit Ethernet device recently developed by Packet Engines Inc.
The device is a PowerPC-compliant system controller that is the heart of the PR5200, a high-performance wire-speed router designed for the core of enterprise networks. Such devices are replacing traditional routers, whose low performance makes them system bottlenecks. Wire-speed routers perform complex routing functions in custom ICs, making them significantly faster and more cost-effective. One of the major reasons is the system controller, which supplies massive amounts of system bandwidth between the PowerPC and the Gigabit Ethernet switch fabric.Higher Density, More Flexibility Familiar to many designers in their role as interface or glue logic, complex programmable logic devices (CPLDs) have undergone many improvements in complexity (density) and flexibility. These advances make it possible, and even desirable, to implement large subsystems on one chip. Add to this the historic advantages of CPLDs (flexibility and rapid design turn-around), and you can apply programmable logic in a wider array of design situations than ever before.
The controller examined here was prototyped in a CPLD. The design supports a 6-Gbit/s memory bandwidth, a 2-Gbit/s direct memory access (DMA) receive channel, a 2-Gbit/s DMA transmit channel, an industry-compliant I2C (inter-IC) interface, and a high-performance 32-bit local bus. When coupled with a local-bus controller, the engine controls the entire computer system. Functionally, the Gigabit Ethernet controller integrates two independent, synchronous DRAM controllers; a receive DMA channel; a transmit DMA channel; a 32-bit local bus; an I2C controller; an interrupt controller; and a system-configuration controller.
An internal, multimaster/multislave parallel-bus structure connects six independent execution units, and allows up to seven concurrent transactions. An internal arbiter coordinates the switching and interconnection of the system execution units. Each execution unit supports a multidepth pipeline that allows for the execution of at least two concurrent transactions within each execution unit.
We estimated that the logic of this design would take about 100,000 gates, plus enough memory to implement a pair of 2-kbyte packet buffers—one for the receive DMA channel and one for the transmit DMA channel. These buffers must each operate as a single-port memory, with a clock-multiplexing architecture that allows access to the memory from two separate clock domains. The first domain frequency is defined by the DMA-channel-operating speed, while the second is defined by the processor-bus speed.
Also, the multiple 32- and 64-bit DMA and memory buses required by the controller as well as the 64-bit host-system interface, need a device with at least 450 user I/O pins. The multiple buses are needed to support a very-high system bandwidth. The host interface to the PowerPC processor is a full 64-bit-wide data bus, with support for a two-level pipeline, using a split-bus configuration. Each of the two independent synchronous bus interfaces includes a 64-bit data path, a multiplexed memory address bus, and a control interface. Both DMA channels provide full-duplex operation by employing two independent, 32-bit-wide data paths. The local-bus interface uses a 32-bit-wide data path, with the ability to address a memory space of 128 Mbytes.
The execution units in control of each of the two DRAM interfaces, the receive DMA channel, the transmit DMA channel, the PowerPC interface, and the local-bus operate independently of one another. By separating the architecture in this way, the controller can perform multiple transactions in parallel. Additionally, the multiple buses and pipelined structure deliver top-notch performance on the host PowerPC interface.When To Use CPLDs? With the size, complexity, and on-chip features of CPLDs on the rise, system designers must keep abreast of the latest enhancements to correctly evaluate when to use them and which ones to use. In general, CPLDs are quickly pushing further into the realm of 100k and more gates, with realizable system speeds of 40 MHz. That combination provides a serious alternative to gate arrays, with the added benefit of the short turnaround. Additionally, features such as improved on-board memory structures, multivoltage cores and I/O capability, and the increasing quality of integration into multiple EDA environments, simplify the use of CPLDs in even the most complex systems.
The typical system-design timeline includes distinct stages that might be labeled prototyping, initial manufacturing, and full-scale production. Often there are good reasons to use programmable logic in some or all of these stages of development. In general, they're most compelling in the early stages of the design timeline, but recent advances in CPLDs, as well as lower costs, are making it more attractive to use programmable logic throughout the design's lifetime.
The decision to employ programmable logic at each of these three stages requires an analysis of several issues: gate density, pin density, system performance, time-to-market, unit cost, non-recurring engineering cost, and development risk. In the case of the Ethernet controller, programmable logic makes the most sense during the prototyping and initial manufacturing stages, with a transition to a 0.35-µm gate array or another ASIC option for full-scale production.
The lower cost of a quick-turnaround custom device in the density range (roughly 100 kgates) that could accommodate the controller and deliver the performance requirements of the final implementation, drove the decision to transition to a gate array.
For lower-density designs that can meet system performance requirements in both programmable and semicustom technology, it's worth performing a cost analysis to determine if PLDs should be used throughout the product's lifetime. At today's PLD volume prices, many designs in the 50-kgate range could warrant full-scale production without any transition to custom ICs.
In principle, device selection should be focused on the device's characteristics and the system's requirements. But other issues are also relevant, including which tools will be used to develop the PLD configuration, and how that process will integrate into the existing overall design flow. Despite the importance of these issues (which will be discussed later), device characteristics remain at the heart of this decision.
Logical gates, memory bits, and pin counts, along with the performance characteristics of a specific architecture, are the most useful factors to evaluate. Previous experience with a given device or device family is the best guide for estimating how a design will fit into a PLD. But, when in doubt, vendors generally offer resources for determining the size of the device needed.
In some cases, specific device features, like support for on-chip memory, will further steer the decision. Many high-density PLDs offer ways to implement on-board memory. The two most popular schemes are embedded memory (in which the device includes dedicated memory structures) and distributed memory (where logic resources are converted into memory resources).
Both implementations have advantages and disadvantages. For the Ethernet circuit, a pair of 256-word-by-64-bit memory structures (single-port RAMs) are needed to interface between the DMA engines internal to the controller and a switch fabric. The memories must operate at 41 MHz. For memories of this size and speed, distributed memory is too slow and costly, so devices with embedded memory are the better fit.
The next consideration is pin count. It's usually a good idea to choose devices that offer more I/O pins than needed. This allows you to address unforeseen changes and modifications, and to add test ports for debugging purposes. There are no hard and fast rules for choosing the right pin count; some designers prefer a buffer of anywhere from 5% to 10% extra I/O pins.
As estimated earlier, the Ethernet controller subsystem requires roughly 450 I/O pins and about 100k gate-array gates (the actual gate total will be reviewed later). An additional 4-kbytes of internal single-port memory are needed for the buffers. Among the embedded-memory PLDs, the EPF10K130 in the Altera 10K family appears to be a good fit. It offers up to 130,000 usable gates (including 32 kbits of RAM) and 470 I/O pins.
An added advantage is the pin-compatibility between members of the FLEX 10K family. Several PLD suppliers offer this capability, which is useful not only for flexibility with your present design, but also in planning upgrades or cost reductions. Upgrades presumably would require more logic and memory resources while using the same I/O. Accordingly, the EPF10K250, with nearly twice the logic and memory resources as the EPF10K130, could be dropped into the same socket, because it has the same pinout and package options.Will It Go With The Flow? Before committing to a specific PLD, look at how it fits into the existing design flow. For most devices, this is not a problem in principle, but the exact details of integrating a PLD into the flow of capture and verification tools will vary depending on the company, and possibly the family, of devices. Describing all the possible variations that exist could fill another article, so we'll focus on the design of the Ethernet controller, noting the areas that will likely apply to all design flows regardless of tools and methodology.
As with most ASIC design flows, the design of the controller begins with design capture using an industry-standard hardware description language (HDL). Following the successful implementation and verification of the design at an RTL level, the HDL is input into a synthesis tool to create a gate-level representation targeted toward a specific technology. From there, the physical design of the implementation is conducted which, when targeting CPLD technology, is the responsibility of the logic designer as opposed to the silicon provider (Fig. 2).
Programmable devices (especially CPLDs) require device-specific design compilation tools that are provided by the vendor. These tools can typically be used either standalone exclusively to develop PLD designs, or together with gate-array design tools, like those from Cadence and Synopsys, as part of a larger flow.
In this design, the Synopsys synthesizer was directed to produce a netlist that will serve as the input to the programmable logic tools. The synthesis process involves the creation of a set of scripts with a specific synthesis script associated with each Verilog design file. A bottom-up strategy produces gate representations of each synthesizable leaf-level module.
The gate-level output files produced by each of the leaf-level synthesis scripts are stored in a common directory. In the bottom-up synthesis strategy, the gate-level design files are connected (using scripts within the Synopsys environment) at higher levels of the hierarchy, until a top-level design file representing the entire design structure as a hierarchical gate-level netlist is created. This file is then tested against the original test fixture before being ransferred from the Synopsys environment to the remaining tools as an Electronic Design Interoperability Format (EDIF) hierarchical netlist. It is this netlist that is transferred to the physical place-and-route tools. Here, the physical place-and-route tool was the Altera MAX+PLUS II compiler.
Hierarchical EDIF is useful because it allows the designer to manage the place-and-route timing/area requirements with constraints assigned to any module within the hierarchy, at any level. MAX+PLUS II provides a detailed design hierarchy viewer/editor that serves this purpose well. Logic assignments that will carry over to MAX+PLUS II can also be made from within the Synopsys synthesis tool environment (in the case of Design Compiler, it requires entering commands at the dc_shell prompt).
The PLD vendor's own tools can provide much more accurate post-route timing information than can be provided with a pre-route estimate from the logic synthesis tool. Although this post-route information can be imported into the Synopsys static-timing tool for analysis, in this example, the MAX+PLUS II static-timing analyzer was used to verify the timing of critical paths. To check the functionality of the design, a Verilog file that includes the timing information can be exported from MAX+PLUS II. This file can then be imported into Cadence's gate-level simulator, Verilog-XL.
To cut time-to-market, separate design groups developed the controller's logic in parallel with the pc board and the embedded software. Although programmable logic is designed for flexibility, in many cases, a designer can reap benefits from intelligent placement of I/O pins, depending upon the device architecture and the needs of the design. In this controller, all the pins were expected to be used, so there would be no opportunity to change the pinout after the pc board was completed.
Accordingly, designers identified the I/O buses early that require the most stringent timing, and placed them on pins that corresponded to "rows" in the FLEX 10K device. This placement is best because the FLEX architecture employs rows and columns of interconnect. A simple observation of the FLEX architecture reveals that more I/O pins and logic resources are associated with a given row than with any column.
One implication of this structure is that for applications in which many data buses are passed through several levels of processing, it makes sense to orient these signals along rows. With this I/O placement scheme, the designer can lay out the board before the Verilog code is synthesized, and still achieve the desired timing. The 20 "spare" I/O pins in the device were brought out to probe points for diagnostic use later in the prototyping stage.Incremental Releases By incrementally releasing the PLD design, the team allowed early hardware/software integration using strategic subsets of the final PLD logic. That allowed designers to functionally check out portions of the logic as the design progressed, reducing the chance that the final circuit won't work. For the level of complexity in the Ethernet controller, four prototyping releases, with the fourth being the first production-ready release, were made.
The first release, which took about five weeks to develop from specification to completion, contained the PowerPC interface, memory controllers, and the superscalar bus fabric. This release allowed software developers to get to work quickly with their PowerPC emulator, testing routines for transactions between the PowerPC and its DRAM.
A second release (about a week later), added the I/O bus, which offered access to several data sources in the overall system. These includeed a UART (to provide a monitor interface), flash memory, and a PCMCIA port. With this release, the software developed code for the PowerPC (a 603 in this example) to talk to the data sources. One method is to boot from the flash memory or the PCMCIA port, and load the corresponding instruction sets into the DRAM.
With the inclusion of PowerPC-to-DRAM routines, the software team could focus on developing routines that deal with the new data sources. A few weeks later, a more-complete third version added the DMA engine and the single-port RAMs for communicating between the external Gigabit Ethernet switch fabric and the PowerPC.
The final (fourth) release followed a few weeks after. It contained minor enhancements, and permitted the software team to perform intensive software testing.
Following the formal release of the completed design in PLD format, the began retargeting it to gate-array technology. The only design difference between the two is the structure of the single-port memory associated with the DMA engines. To manage this situation, the design was configured from its conception to isolate the logic implementing the interface to the memory into a single module within the design hierarchy. That ensured a relatively smooth changeover. But, the timing and functionality of this portion of the design had to be carefully scrutinized during Verilog simulation and testing of the actual gate-array design. The reason is that it's the one area in which the PLD logic deviates from the gate-array logic.
One of the greatest benefits of using programmable logic is the ability to test real hardware under actual operating conditions. This capability proves the design, and potentially verifies some of its more difficult-to-simulate aspects, such as system timing. The testing was performed in parallel with the completion of the gate-array design, so that any required changes could be made before building the first gate-array samples.Testing Performed Also While the team was retargeting the controller in Verilog to produce the gate-array version, it also subjected the PLD implementation to literally billions of Ethernet packets. At this time they fully sounded out the design and tested its capabilities. Simulating the controller design within the context of the physical board, and using the actual software on a physical PowerPC 603 provides a very powerful verification platform. Additional operating insights regarding the logic are also possible, as the spare I/O pins were used to form a probe bus.
A note on the time it takes to compile a PLD design: As with gate arrays, times will vary with the size and complexity of the design and will definitely be a factor in the overall design-cycle efficiency. In this project, compilation times for the early releases were about 20 minutes (using MAX+PLUS II on a Sun UltraSPARC 2-based workstation). The final releases required compile times of up to six hours.
The final production version of the controller, when implemented in the EPF10K130, occupied 82% of the logic resources, and all of the memory resources of the device. (A comment on gate counts: Altera documentation states that the EPF10K130 provides from 82 to 211 kgates, depending on how the logic is implemented, and how the memory structures are used.) In comparison, a gate-array version required 95 "gate-array" kgates and two 2-kbyte single-port RAMs. So by the measure of this design, the total logic elements in an EPK10K130 could provide a maximum of about 115 kgates, and the EABs could provide a total of 4 kbytes of memory.
Early releases of a product, thanks to the use of programmable logic, provided extra months of market penetration and revenue generation that would have been lost if a gate-array-only strategy had been employed. Furthermore, the PLD-based version provides an invaluable verification platform for testing the logic design in an actual environment prior to converting it to a semicustom implementation. In addition to mitigating the risk associated with releasing a complex system to custom silicon, the programmable logic approach provides a contingency position just in case something delays the release of the gate-array or ASIC version.
Programmable logic and EDA tool vendors continue to make great strides in integrating PLDs into the familiar gate-array design flows. Moreover, future generations of PLDs will offer the system designer even more gates and memory, allowing direct system upgrades. For example, a future version of the controller will employ a FLEX 10KE device, which offers a higher memory-to-logic ratio than previous FLEX 10K family members. As a result, the FLEX 10KE solution in the same pinout and package could support deeper transmit and receive buffers, improving data bandwidth.