Architectural Advances Propel FPGAs Into High-End ASIC Turf

Finer features and architectural enhancements let the latest-generation FPGAs deliver higher gate counts and application-targeted resources to implement complex systems-on-a-chip.

Dave Bursky

Oct. 18, 2004

17 min read

Finer features and architectural enhancements let the latest-generation FPGAs deliver higher gate counts and application-targeted resources to implement complex systems-on-a-chip.

The rules of the ASIC game are changing as the cost to fabricate a custom solution creeps past the million-dollar mark. And, new rules are emerging due to the shorter market life of the end system and the continuous need to upgrade or update a product's features. The flexibility of FPGAs to meet these changes hits home at the FPGA's heart—its programmable architecture.

In addition, the old concept of an FPGA being just a collection of configurable gates and programmable interconnects has given way to resource-rich platforms. These new FPGAs contain dedicated but configurable high-speed I/O ports capable of multi-gigabit/s data rates, large blocks of single and multiported memories, phase-locked loops (PLLs), multiplier-accumulators (MACs) for DSP support, 32-bit CPUs, and other dedicated functions. And this is just the beginning (see "FPGAs Enter New Design Territory," p. 78).

To integrate those resources, the companies designing the FPGAs employ advanced processes that reduce feature sizes to just 90 nm. Thus, they can pack more gates on a chip while improving gate performance. In addition, these advanced processes use as many as 11 levels of copper metallization to provide better signal routing and configurability.

Device cost is always a key factor for FPGAs, especially those used in the final version of a product. The high-density, resource-rich, top-of-the-line FPGAs tend to cost several hundred dollars to over a $1000 apiece, even in moderate volume. These are used sparingly in production systems and replaced, where possible, with a full ASIC or one of the new structured or platform ASIC alternatives (see "Structured ASICs Compete With, And Complement, FPGAs," p. 81).

New classes of FPGAs with slightly fewer gates, less memory, and a limited number of other features have sprung up, though, to meet the cost constraints of production systems. This will push the crossover point when it makes more economic sense to switch to an ASIC much further out. In some cases, it may never be economical to use an ASIC, because product life cycles are now shrinking below the time required to develop the ASIC replacement.

YOUR PAD OR MINE? Typically, as gate counts go up, so does the number of I/O pads to move the signals on and off the chip. Today's FPGAs come with hundreds to nearly 1000 I/O pads. Most are housed in very large BGA packages and use traditional wire-bond connections to connect the chip's pads to the package. Yet as operating speeds accelerate and the number of I/O pads increases, traditional wire-bonding schemes fall short in getting the chip to deliver top performance due to inductance and parasitic losses. Furthermore, wirebonding may also make the chip too large for the desired gate count.

These issues arise because traditional design approaches arrange the I/O pads around the chip's perimeter. Consequently, the physical chip size may actually be determined by the number of pads and how tightly they can be spaced. Designers then would fill in the area within the perimeter with as many configurable cells, memory, and other resources as possible.

But this forces system designers to route all high-speed signals to the edge of the chip, which could detract from overall performance of the function being implemented. For large I/O pad counts, chip sizes may expand more than necessary because pad-to-pad spacing must accommodate the automatic wire-bonding machines.

Staggered pad rings offer one partial solution. Pads are actually arranged in two concentric rings, with one ring offset from the other so that the pads in the inner ring reside in the space between the two pads in the outer ring.

Another emerging approach eliminates wire bonding outright, as well as all of the limitations imposed by pad rings. The flip-chip assembly approach, which uses solder-bump connections across the chip's surface, is already employed by a number of companies producing high-performance ASICs.

The flip-chip approach provides several advantages. First, it eliminates the need for a pad ring, considerably reducing chip area. Second, it allows the I/O pads to be more optimally placed right in the middle of the configurable logic array, which shortens the signal paths and reduces inductance and capacitive loading. Although a slightly more expensive packaging technology, the surface-bumped flip-chip approach will be deployed for the highest-performance and highest-I/O-count FPGAs, allowing vendors to charge a premium price for the chips.

MANY ARCHITECTURAL CHOICES Over the past year or so, FPGA vendors have made at least a half-dozen major introductions of new programmable architectures or families to better serve various segments of the OEM markets. To address the high-performance/high-density markets, both Altera and Xilinx unveiled their latest SRAM-based high-density architectures—the Stratix II and Virtex-4 families—which deliver densities exceeding 6 million gates. These families close to double the resources available on each company's previous-generation families, the Virtex II and IIPro from Xilinx, and the Stratix series from Altera.

Competitors Actel and Lattice Semiconductor also released more modest families that offer top gate counts of 1 million to 2 million system gates. These families will perhaps compete with the recently released Max II and Cyclone families from Altera and the Spartan-3 series from Xilinx, respectively. Plus, although offering megagate densities, the Eclipse II family developed by QuickLogic offers up to 320k system gates but was designed to keep power consumption to a minimum. On the low-power, low-density side of complex programmable logic devices (CPLDs), Xilinx offers the CoolRunner family.

It's not just the gate capacity that makes the Stratix II and Virtex-4 series highly integrated system solutions. They also contain many features that support memory-intensive, or DSP-intensive applications—up to 9 Mbits of SRAM, hundreds of MACs, dedicated PLLs, specialized high-speed serial I/O ports capable of up to 6-Gbit/s data-transfer rates, and embedded CPUs for high-performance control.

Both the Altera and Xilinx devices are based on multi-input, lookup-table (LUT), combinatorial-logic blocks and registers that form repetitive configurable logic-element (LE) cells. Each cell provides approximately the equivalent of 15 to 25 logic gates. Depending on the logic functions being implemented, these large-grain cells can be highly efficient, with the logic functions consuming all of the resources. Alternately, they can end up being highly inefficient, with many of the resources left unused. Or, they may simply fall between the two extremes.

The forthcoming Stratix II series overcomes the efficiency/inefficiency issue by employing a new adaptive logic module (ALM) in each LE (Fig. 1). Developed to maximize logic efficiency and performance, it can reduce the number of logic levels needed to perform a desired function, which boosts performance. The adaptive module features eight inputs that can be flexibly divided between the two output functions. As a result, wide input functions can run fast, and narrow input functions can efficiently use the remaining resources.

The ability to expand and share the combinatorial LUT portion in the ALM allows the ALM to absorb more logic capacity than traditional four-input LUTs for an equivalent function. The larger logic capacity not only cuts down the total logic utilization, it also reduces the average routing utilization. Once again, this will enhance circuit performance.

The Stratix II family will initially feature six FPGAs that range in capacity from 6240 to 71,760 ALMs and from about 420 kbits to 9.383 Mbits of static RAM. To support DSP operations, dedicated 18- by 18-bit MACs are also pre-integrated on the chips. The multipliers can be subdivided into two 9- by 9-bit units, or two of them can be combined to form 36- by 36-bit units. The smallest Stratix II chip contains 48 multipliers, while the largest has 384. PLLs for clocking and other timing applications are also plentiful. There are six PLLs on the two smallest family members and a full dozen on the remaining chips.

As with the previous Stratix series, the Stratix II family won't pre-integrate dedicated embedded processors. Rather, Altera designers have just revamped their soft-core NIOS processor to create a 32-bit implementation that comes in three performance variations. To best fit system requirements, designers can select from the three performance and chip areas (number of LEs) of the three NIOS II cores. Performance of the core ranges from 200 down to 28 MIPS and core complexity from 1120 down to 400 LEs when implemented on the Stratix II FPGAs. The cores can also fit in the logic fabrics of both the Stratix family and the Cyclone series FPGAs.

In its forthcoming Virtex-4 family, arch-rival Xilinx only made some minor fixes to the basic LEs to improve performance. But the overall FPGA architecture did undergo a major change. Dubbed ASMBL (application-specific modular block architecture), the concept behind the family is to craft a highly modular chip architecture with various logic resources—logic, DSP support, memory, processing, etc.—arranged in columns (Fig. 2). By combining columns with different resources on a chip, Xilinx can quickly create chips that are optimized for a particular application segment.

Initially, Xilinx plans to release three application platforms, each featuring several members. They include the Virtex-4LX for logic-intensive applications, the Virtex-4SX for signal processing, and the Virtex-4FX for embedded processing and high-speed serial connectivity. Chips in the Virtex-4 series will pack up to 200,000 logic cells and operate at up to 500 MHz. That's about twice the density and performance of today's FPGAs.

The LX version combines stripes of configurable logic cells with embedded block RAM, digital clock-management blocks, and some DSP/arithmetic functions to handle high-density, I/O-intensive, high-performance logic applications.

The SX platform includes an exceptionally high ratio of the DSP/arithmetic blocks and memory blocks so that it can handle audio, video, wireless communications, and other DSP-intensive applications. Lastly, the FX version's columns provide multi-gigabit serial transceivers that can support any speed from 600 Mbits/s to 11.1 Gbits/s. Each included enhanced 32-bit PowerPC 405 cores has an auxiliary processor unit that provides hardware acceleration for critical computations. There's an abundant number of logic cells, block RAM, and clock-management circuits as well. Soft-CPU cores such as the company's internally developed MicroBlaze or commercial third-party soft cores can readily be configured in the logic fabric.

SUPPRESSING COST In addition to these latest high-density offerings, both Altera and Xilinx offer solutions for applications more driven by low cost than high logic and memory density. The Spartan 3 family from Xilinx and the Cyclone II series just released by Altera aim at applications that could use FPGAs in large volumes, replacing full-custom designs.

The low-cost FPGAs are viewed as an attractive alternative to an ASIC. Due to the short life of many consumer products, the time and cost of developing a full ASIC solution would actually be more expensive over the life of the product than using an FPGA. Aggressive pricing by FPGA suppliers is also a bonus. Xilinx is quoting prices of less than $12 apiece for a 1 million-gate device and less than $2.95 each for a 50-kgate chip (in lots of 250,000 units) by the end of this year. Assuming about 17,000 LEs on the megagate FPGA, this translates into about $0.70 per 1000 LEs (assuming about 50 to 60 gates per LE). Altera says its forthcoming Cyclone II FPGAs will be even more economical, with a cost of less than $0.65 per 1000 LEs.

The Spartan 3 series offers eight devices with capacities ranging from 50k to 5M system gates and up to 1.8 Mbits of block RAM. Up to 104 embedded 18- by 18-bit multipliers provide hardware assist to support high-performance DSP algorithms. By fabricating the FPGAs with 90-nm process rules and employing a dual, staggered pad ring, the chips can pack up to 784 I/O pads. But, they're considerably smaller for equivalent density devices fabricated on 130-nm processes. Or, for the same size chip area, the FPGAs can pack more than twice the number of system gates.

Soft processors, such as the 32-bit MicroBlaze and the 8-bit PicoBlaze, require just a small portion of the FPGA logic. In fact, the MicroBlaze processor can be implemented with an effective cost of less than $1.40 worth of logic (in volumes of 250k units).

Going head to head with the Spartan 3 series is Altera's just-released Cyclone II family. Also based on a 90-nm process technology, the six chips in the initial family offer from 4600 to over 68k LEs (about 250k to 3.8 million system gates).

The Cyclone chips also pack 120 kbits to over 1.1 Mbits of memory, and embedded multipliers range from 13 on the smallest device to 150 on the largest. The memory is set up as blocks of 4608 bits (4096 bits plus 512 parity bits) and can be configured for true dual-port operation (one read and one write, two reads, or two writes). Memory accesses occur at 250 MHz. On-chip multipliers can also run at 250 MHz.

One particularly novel feature on the Cyclone II FPGAs is a dedicated DDR II and QDR II memory controller that can operate at memory data-transfer speeds reaching 333 Mbits/s. I/O lines have also been certified for compliance to the PCI local bus specification, revision 3.0, for 3.3-V operation at 33 or 66 MHz with 32- or 64-bit interfaces, and 100-MHz PCI-X 1.0 compatibility.

To get higher performance than the original Cyclone family, Cyclone II designers increased logic-array-block (LAB) grouping size from 10 LEs to 16 per block. This helped shrink chip area and boost performance, because larger functions can be configured in the LAB.

The just-released SRAM-based EC and ECP families from Lattice Semiconductor also aim to replace ASICs. Available without dedicated DSP support, the EC series devices are basically a subset of the ECP series, which includes from four to 10 dedicated DSP blocks. Each block can implement up to eight 9-bit, four 18-bit, or one 36-bit multiplier (for full family details, see "FPGAs Bring Custom-ASICs Economy To System Design," electronic design, Aug. 9, p. 40).

Lattice also breathed some new life into a mature FPGA family, namely the ORCA. (It acquired ORCA from the company now known as Agere.) The recently released ORCA ORL1I10G is an FPGA packing from 333 to 643 system gates and 111 kbits of RAM. Its high-speed dedicated I/O ports are OIF-standard-compliant and can handle from 10 to 12.5 Gbits/s using a 16-bit low-voltage differential-signaling interface operating at 850 Mbits/s. There are also four 2.5-Gbit/s interfaces, each with a separate clock to synchronize the transfer of data to the FPGA logic.

INSTANT GRATIFICATION For those applications that can't tolerate delays on power-up, on-chip configuration storage lets a chip start functioning in microseconds. This is in contrast with the tens or hundreds of milliseconds required by a SRAM-based FPGA to load its data from an external flash memory or host system. These same features also hold true for all flash-based programmable devices, such as Actel's ProASIC and ProASIC Plus families, Lattice's ispXPGA family, QuickLogic's Eclipse and other families, and Altera's Max II series of flash-based CPLDs.

A bit more conservative on the gate count than the Altera and Xilinx SRAM-based families, the Actel Axcelerator family is manufactured on a 150-nm antifuse process. Chips in the family feature a gate capacity that ranges from 125k to 2 million system gates, with a typical actual usable number of about 82k to 1 million. However, the logic performance is right near the head of the class, with internal operating speeds that can exceed 500 MHz and system speeds reaching 350 MHz.

The antifuse technology used by Actel (and QuickLogic) makes the FPGAs one-time programmable. So once they're configured, the logic can't be altered. Although this rules out system updates after the devices are fielded, it does provide other benefits. For instance, the nonvolatile configuration pattern is stored on-chip. Thus, no off-chip flash memory device is needed to store the configuration pattern that's loaded during power-up. And, because the data is on-chip, it's safe from the prying eyes of anyone attempting to reverse-engineer the configuration.

The Axcelerator family, though, uses a different large-grain cell approach than that used by either Altera or Xilinx. Rather than use the basic SRAM-based lookup table, designers at Actel developed a block they call a "SuperCluster." It contains multiple combinatorial logic and register modules and transmit and receive routing buffers (Fig. 3). The basic architecture is an enhancement of the company's SX-A sea-of-modules architecture. Its logic fabric covers the chip within the pad ring. Virtually no chip area is lost to interconnect elements or routing since the antifuses lie between the metal layers above the silicon.

The SuperCluster includes two types of logic modules—a register cell (R-cell) and a combinatorial cell (C-cell). The C-cell, which can implement more than 4000 combinatorial functions of up to five inputs, includes carry logic to more efficiently implement arithmetic functions. The R-cell packs a flip-flop with asynchronous preset, active-low enable control signals, and programmable clock polarity. The clock source for the cell can be chosen from hardwired clocks, routed clocks, or internal logic.

Two C-cells, a single R-cell, two transmit buffers, and two receive routing buffers form a cluster. Two clusters form a SuperCluster. One additional independent buffer provides extra buffering on high-fanout nets. The AX architecture is fully fracturable, which means that if a particular signal path uses one or more of the logic modules in the SuperCluster, other signal paths can still use the other logic modules.

Though not packing quite as many equivalent ASIC gates, the smaller basic LE (finer-granularity) of Actel's ProASIC flash-based FPGA family puts that family in the high-gate-count race as well.

Flash memory on the same chip as the SRAM-based configurable logic holds the key to instant-on operation for both the ispXPGA family from Lattice and the MAX II family of CPLDs from Altera. Each family basically consists of SRAM-based FPGAs. But rather than use an off-chip flash memory to hold the configuration data, designers incorporated the memory on-chip. What that translates into is a very fast power-on configuration capability.

The ispXPGA family includes four devices with complexities ranging from 139k to 1.25 Mgates, 92 kbits to 414 kbits of dedicated RAM, 30k to 246k bits of distributed memory, and 160 to 496 I/O pads. The configurable logic is grouped in blocks called programmable function units. Each unit contains four quad-input lookup tables to support wide and narrow functions, dual flip-flops to allow for extensive pipelining, and dedicated logic for adders, multipliers, multiplexers, and counters.

Although internally based on an FPGA architecture, the MAX II family is classified by Altera as a CPLD, with an equivalent macrocell count ranging from 240 to 2210 LEs. Like the Lattice devices, the on-chip flash memory holds the configuration pattern for "instant-on" configuration. However, designers added a separate 8-kbit user flash memory. It can be used to hold additional system parameters, which effectively eliminates a small off-chip memory employed by many systems to store setup parameters.

The MAX II devices are relatively low power, consuming only about 2 mA during standby. But the lowest-power FPGAs to date are QuickLogic's new Eclipse II devices, which draw as little as 17 µA of standby current. Chip complexities range from about 47k to 248 kgates and 9 kbits to 46 kbits of embedded memory. The largest device, the QL8325, also features 12 special embedded computational units that pack an 8-bit multiplier, a 16-bit adder, a 17-bit register, several multiplexers, and a 3:4 decoder. As a result, up to 12 8-bit MAC functions can execute per cycle for a total of 1 billion MACs/s when clocked at 100 MHz.

Of course, nearly all FPGA vendors offer previous-generation families. Based on your system performance needs, such devices may also be a good, cost-effective solution.

NEED MORE INFORMATION?Actel Corp.www.actel.comAltera Corp.www.altera.comLattice Semiconductor Corp.www.latticesemi.comLeopard Logicwww.leopardlogic.comQuickLogic Corp.www.quicklogic.comStretch Inc.www.stretchinc.comXilinx Inc.www.xilinx.com

About the Author

Dave Bursky

Technologist

Dave Bursky, the founder of New Ideas in Communications, a publication website featuring the blog column Chipnastics – the Art and Science of Chip Design. He is also president of PRN Engineering, a technical writing and market consulting company. Prior to these organizations, he spent about a dozen years as a contributing editor to Chip Design magazine. Concurrent with Chip Design, he was also the technical editorial manager at Maxim Integrated Products, and prior to Maxim, Dave spent over 35 years working as an engineer for the U.S. Army Electronics Command and an editor with Electronic Design Magazine.