High-Density FPGAs Take On System ASIC Features And Performance Levels

Able to provide better system solutions, FPGAs are taking on system building blocks—SRAMs, PCI bus interfaces, CPUs, and more, to boost system throughput.

Dave Bursky

Sept. 18, 2000

22 min read

Since their introduction, field-programmable gate arrays (FPGAs) have evolved from a prototyping tool to full-fledged commodity components. As chip complexities increased, however, the performance of complex logic functions formed with programmable-logic cells hasn't kept pace with the performance demanded by leading-edge applications. That has led to today's overwhelming interest in incorporating dedicated function blocks on the same silicon as the FPGA logic. The dedicated blocks can achieve much higher operating speeds than if the functions were implemented with the FPGA's logic elements.

Complex functions, like microprocessor cores, high-end PCI interfaces (64-bit/66-MHz), and other functions that need clock rates beyond about 35 MHz, just couldn't be routinely created. The limitation hasn't been the basic performance of the gates on the FPGAs, but rather the interconnection delays encountered when synthesis and other automated tools compute the placement and routing of the circuits in the logic cells. Newer tools are doing a better job, especially if they can take the physical placement into account. A few such tools are starting to appear (see "FPGA Synthesis Tools Coming Of Age," p. 102).

Yet hand-laid out and optimized functions using the logic cells of the FPGA can achieve clock speeds of 40 to 50 MHz, while dedicated blocks integrated into the base silicon can often achieve operating speeds of 80 to 100 MHz. Meanwhile, typical top performance of complex functions using automatic tools is often limited to top operating speeds of 25 to 35 MHz. In today's performance-driven world this amounts to basically lackluster performance.

FPGA manufacturers are fighting back in several ways to improve the base performance of the logic cells, the interconnect wiring, and the tools. That will provide designers with a better starting point for functions that don't require ultimate performance.

Finer-feature processes are allowing FPGA suppliers to shrink cell areas, and pack more cells on a chip. Today, most companies are already shifting their chip production to processes that employ minimum design features of 0.18 µm. A few companies, such as Altera Corp. and Xilinx Inc, have already started to move products to processes implementing features as small as 0.13 µm.

It takes more than smaller features to solve the performance issues, though. Designers must also have the ability to interconnect the cells and minimize the delay imposed by the interconnect wiring. To provide better performance in this area, FPGA vendors have added more levels of interconnect—today's high-density FPGAs now employ as many as six or seven levels of metal interconnections. Additionally, the use of copper metallization rather than aluminum is another approach being experimented with by both Altera and Xilinx. Copper promises about a 20% or better reduction in wiring delays. That translates into at least a 20% boost in operating speed.

One area where digital FPGAs still come up short is in the interface to the "real world"—that is, the analog world. The high-performance digital processes used to fabricate FPGAs don't lend themselves easily to the creation of mixed-signal functions, like op amps, comparators, analog-to-digital and digital-to-analog converters (ADCs and DACs), and other blocks.

For such functions, designers usually add discrete analog functions and components to the board to complete the system. But, that circuitry then becomes "locked in," while the FPGA could still be reconfigured.

Two companies have attempted to solve this problem by developing programmable analog arrays. Both Lattice Semiconductor Corp. and Anadyne Microelectronics Inc. have developed single-chip solutions containing a mixture of analog building blocks. These can be interconnected with programmable elements and, therefore, form any desired interface function (see "Extending The Boundaries Of Digital Systems," p. 106).

Of all the functions that designers need on a chip, memory has been the most popular. Almost every design that engineers have to do on an FPGA may require some storage, anywhere from a few bytes to tens of thousands of bytes. But the early FPGAs were very inefficient when it came to implementing blocks of SRAM. Because many of the FPGAs are based on SRAM cells to hold the configuration data, most FPGA suppliers first found ways to use the configuration lookup-table storage as general-purpose RAM. Thus, designers could make use of larger blocks of memory—typically 16 or 32 bits per logic cell, rather than lose multiple logic cells for every bit. That significantly helped improve the memory density on SRAM-based FPGAs.

Such small lookup-table memories, however, still aren't dense enough to craft the multikilobit and multikilobyte memory blocks that are needed to support CPU caches, large register files, and high-speed data-communication buffers. Therefore, not only have SRAM-based FPGA suppliers added dedicated blocks of configurable SRAM onto the FPGA, but antifuse and flash-based FPGA suppliers have done so as well. The amount of SRAM typically scales with the size of the FPGA, but designers now have anywhere from a few kilobits to as much as 16 kbytes of dense, high-performance SRAM at their disposal to support advanced system applications.

Companies are offering basic single-port SRAM capabilities in addition to dual-port memory cells or support circuitry that allows efficient dual-port implementation. Furthermore, a few FPGA suppliers are beginning to offer content-addressable memory support for applications in networking and other systems that must perform fast matching.

Additional Logic For Clocks There's a need for additional support logic on the FPGAs to ease the implementation of high-speed clock systems. This has led most FPGA suppliers to include phase-locked loops (PLLs) or delay-locked loops (DLLs) on-chip. These allow systems to use low-speed external clocks and multiply-up the frequency in order to operate the internal logic at a much higher speed. Additionally, using PLLs and DLLs eases the problem of distributing stable, high-speed clock signals across the logic on a large FPGA. Xilinx, for instance, packs eight DLLs on the Virtex-E family of arrays, which includes extended amounts of SRAM—four blocks of 4 kbits each.

Unlike other FPGA suppliers, though, that incorporate such large dedicated functions as PCI interfaces or CPUs, the Virtex series is banking on its performance to allow the use of soft cores on the arrays. Therefore, as part of its library, Xilinx inked a deal with ARC Cores Ltd. to offer ARC's configurable/synthesizable CPU core on the Virtex architecture as well as the company's Spartan-II FPGA family.

On the Virtex-E arrays, designers at Xilinx employ an architecture similar to the previous Virtex family members (Fig. 1). The main logic element, the configurable logic block (CLB), contains two configurable "slices." Each slice, in turn, contains a pair of four-input lookup tables, carry and control logic, plus a pair of flip-flops. Groups of CLBs form a VersaBlock. These are alternated with blocks of SRAM (the BRAMs), and the entire grouping of logic and memory blocks is connected to the VersaRing I/O interface. The VersaRing provides the general routing matrix that interconnects all of the blocks. It consists of an array of routing switches located at the intersections of horizontal and vertical routing channels that route the signals onto the various wires.

Providing the I/O support, flexible I/O blocks (IOBs) surround the logic and memory. The blocks support over a dozen interface standards, ranging from low-voltage TTL to low-voltage differential signaling and Gunning transceiver logic.

A trio of storage elements in the I/O block function as either edge-triggered D-type flip-flops or as level-sensitive latches. Each IOB shares a single input clock signal and a common set-reset line among the three storage elements. But, each element has its own chip-enable control line. The set-reset line can be used by each element to serve a slightly different function—asynchronous set, asynchronous reset, asynchronous preset, or asynchronous clear.

The output buffer and all of the IOB control signals have independent polarity controls. Optional pull-up and pull-down resistors and a weak keeper circuit are selectable too.

Even though many of the Virtex-E arrays are only just now being sampled on the market, the company has already started to release details of its next-generation family. This second-generation architecture, Virtex-II, is based on 0.1-µm design rules. It will support chips capable of holding 10-million-system-gate designs. The process will also deliver twice the system performance of the previous Virtex architecture—internal system clocks will be able to run at 200 MHz, and system IOBs will have the capability to handle over 800-Mbit/s signal data rates.

Furthermore, the chips will pack more block memory—four times that of the previous Virtex family members. Plus, the memory arrays will be able to provide true dual-port configurations with up to 18 kbits per memory block. The memory blocks will also support parity for applications requiring high data integrity. Also added is a read-before-write and no-output-change/
write modes to enhance the array's ability to perform DSP algorithms. Memory interfaces support external double-data-rate and quad-data-rate devices too.

To achieve all of these capabilities, Xilinx's designers started with a new CLB cell with more resources than ever before—four slices, each packing a pair of LUT-based logic cells. The 16-bit LUT memories in each part of the slice can be used as distributed memory on the FPGA, providing designers with a total of 128 bits of single-port SRAM or 64 bits of dual-port memory in each CLB.

Additionally, the larger CLB contains an enhanced wide multiplexer capability to handle more bus signals within a single block. For example, a single CLB can implement a 16:1 multiplexer, while a 32:1 multiplexer can be formed with two adjacent CLBs.

The CLBs also are optimized to allow the easy implementation of arbitrary-length shift registers, a key function found in networking, communications, ciphering systems, and digital filtering applications. A single CLB could hold a 128-stage register, which Xilinx claims is a density improvement of 16 times over competing FPGA architectures.

The enhanced CLBs support DSP functions, such as multiplication, with the CLBs able to implement 18-bit multipliers. If an entire Virtex-II chip were to be used for multiplication, it could achieve a peak throughput of 0.6 tera multiply-and-accumulate operations per second (16 TMACs/s).

Underlying the enhanced CLB features is an improved active interconnect technology. This makes it easier for complex logic functions to be implemented on the FPGA by offering a wider choice of single-, double-, and longer-length routing resources. These routing resources grant shorter delays thanks to the use of up to eight levels of copper metallization. Additionally, active drivers for all routing connections, rather than passive transistor pass-gate structures, reduce signal margin problems and provide fanout-independent routing delays. That improves the performance predictability of complex blocks.

To support the complex functions, the I/O buffers on-chip must also be enhanced to handle the high data rates necessary to transfer data into or out of the chip. Capable of this, the RapidI/O technology available in the Virtex-II series offers a packet-based interconnection scheme that could deliver up to 10-Gbit/s data bandwidths.

As part of its recently unveiled Excalibur program, Altera offers a soft CPU core too. But instead of going to an outside source, designers at the company crafted a core that was optimized for use on their FPGA architecture. The Nios core is a RISC processor with a configurable 16- or 32-bit datapath (Fig. 2). The simple architecture of the processor will allow synthesis tools to easily overlay the processor's logic on the FPGA logic cells.

The core employs a 16-bit instruction set to maximize code-storage efficiency. Each of the instructions can execute in a single clock cycle. With a top operating speed of about 50 MHz, the core can deliver a throughput of around 50 MIPS.

Through the use of synthesis tools, designers can not only set the datapath width, but also the size of the register file (up to 512 32-bit registers) and various support microperipherals. The peripheral functions include a UART, a parallel I/O port, a timer, and various memory interfaces for SRAM, flash, and other memory types.

Supporting designs with Nios, the MegaWizard software helps create and configure the Nios core. The companion SOPC (system on a programmable chip) Builder software helps generate the on-chip bus, configures the peripherals, and generates a C header file.

The Excalibur program is more than a single CPU core. In fact, Altera also plans to incorporate both the ARM 9TDMI and the MIPS-32 into the base silicon of APEX FPGA arrays. The ARM and MIPS cores are supported by various microperipherals and a cache module as well.

Because the ARM and MIPS cores are pre-optimized hard cores, they can offer substantially higher performance than the Nios soft core. Both can deliver about four times the throughput of the soft core—around 200 MIPS. That can open up many higher-performance applications which the high-density APEX family and competitive arrays just couldn't handle with soft cores. (For more about the Excalibur family, see Electronic Design, "On-Chip Processors Broaden Embedded Designers' Choice," Aug. 21, 2000, p. 68.)

In addition to the ARM and MIPS cores, Altera is working out the details to license the PowerPC architecture from Motorola Inc., thereby providing a third hard CPU core option. Further evaluation is ongoing to select a 64-bit processor core to handle the most performance-intensive applications. At the same time, the company will continue to expand its IP library offerings.

All of the cores and IP library elements are designed to meld with the APEX family of FPGAs. Current APEX family members include devices that range in complexity from about 60,000 to more than 1.5 million gates (or, about 160,000 to over 2.5 million system gates according to an alternate estimation scheme). The arrays are fabricated using a 0.18-µm, six-level metal process. They include large blocks of embedded memory and high-bandwidth I/O buffers. Embedded RAM bits range from about 24 kbits on the smallest device to over 440 kbits on the EP20K1500E.

Plus, Altera is developing a copper-based interconnect technology that promises significant speed improvements and substantial power reduction. The goal is to have a production process in 2001 that's capable of delivering chips with eight or more levels of metal when fabricated with 0.13-µm minimum features.

The Nios soft-core requires about 1000 logic cells, which is about equivalent to 12% of the capacity of the APEX EP20K200E, or around 2% that of the EP20K1500E. The core's small size allows designers to readily implement systems containing multiple instances of the CPU, permitting highly parallel algorithms to execute very efficiently. The ARM and MIPS cores occupy a larger percentage of the chips into which they will be embedded, because the cores are more complex and will typically include a cache and additional support functions.

Although not a newcomer to the use of embedded system blocks, Lucent Technologies had previously focused on PCI interfaces and some high-speed serial I/O blocks. The just-released ORCA 4 family, however, is a more complete system-on-a-chip strategy that will allow engineers to craft a system solution in several ways (Fig. 3).

Like the PCI solution in the ORCA 3 family, one option would be to design a custom logic block and integrate that on the chip along with the FPGA logic cells and memory. Alternately, Lucent has turned the FPGA logic into an embeddable block able to be integrated into a full ASIC design that's mostly a customer-developed standard-cell-based design. This approach is similar to the ASIC-plus-FPGA scheme released last year by LSI Logic Inc. Additionally, designers will have the basic FPGA with embedded memory blocks that can be used along with soft cores and other synthesizable IP blocks.

The basic logic forming the ORCA Series 4 architecture begins with a 0.16-µm, six-level metal process. It delivers an internal performance of greater than 200 MHz and a chip complexity of over 1.5 million usable "system" gates. A flexible I/O structure allows the FPGAs to tie into a variety of interface standards, including GTL, LVDS, LVPECL, and many others, as well as deliver data at rates of more than 416 MHz. Furthermore, the I/O lines can handle double-data-rate signaling, which effectively doubles the data rate versus the clock rate, allowing data transfers at up to 850 Mbits/s. The process technology used by the arrays allows for the lowest operating voltage in the industry—just 1.5 V—which trims about 30% of the power versus a chip that operates from a 1.8-V supply.

Along with the advanced process technology, designers at Lucent also enhanced the basic programmable function unit (PFU), which now contains eight 4-input (16-bit) lookup-table logic cells, much like the forthcoming Virtex-II family from Xilinx. Each PFU has nine user registers—one following each lookup table plus an extra one for arithmetic operations. The registers and LUTs are arranged so that two 4-bit nibbles can act independently.

The LUT memory can be used as distributed RAM, and can be configured as either single- or dual-ported memory. In addition to the LUT RAM, however, the FPGAs incorporate embedded quad-port RAM blocks that pack two read ports, two write ports, and two sets of byte-lane enables.

The five initial members of the general-purpose ORCA Series 4 arrays, the OR4E2, E4, E6, E10, and E14, will pack from 74 to 221 kbits of embedded RAM, respectively. Additionally, the chips will pack from 4992 to 36960 LUT blocks of distributed RAM (about 80 to over 500 kbits), respectively.

Each embedded RAM block can be configured in a number of ways: as one 512-word by 18-bit quad-port block, one 256-word by 36-bit dual-port block, one 1-kword by 9-bit dual port, two 512-word by 9-bit dual-port RAMs, or as two RAMs with an arbitrary number of words whose sum is 512 words or less by 18 bits. Additionally, the memory blocks can be set to function as two 16-word by 8-bit content-addressable memories or as FIFO registers, or even used as constant or variable multipliers.

An embedded 32-bit-wide bus plus four parity lines interconnects the PFUs, a microprocessor interface, the embedded RAM blocks, and embedded standard-cell blocks. Based on the ARM AMBA specification 2.0, the bus can clock at up to 100 MHz.

The embedded bus includes built-in system registers that serve as the control and status center. Additionally, the arrays offer a high-speed off-chip synchronous microprocessor interface, compatible with the PowerPC860 and PowerPC II CPUs from Motorola Inc. and IBM Corp., respectively.

To ensure high-speed operation, the on-chip logic also is supported by up to eight PLLs that can manipulate and condition clocks ranging from 20 to 420 MHz. Additional optimized PLLs on-chip are designed to meet the performance requirements for applications like DS-1/E-1 and STS-3/STM-1. Full boundary-scan (IEEE 1149.1 and draft 1149.2) JTAG test support is built into the arrays as well.

On-chip backplane transceivers make it easy for designers to implement network systems. Furthermore, two enhanced versions of the ORCA Series 4 arrays, the ORT8850L and 8850H, include an eight-channel CRD macrocell that delivers data at 850 Mbits/s per pin, thus providing a data bandwidth of up to 6.5 Gbits/s (full-duplex over serial links). The arrays include dedicated SONET support, too, for framing, pointer moving, and transport overhead handling. For nonSONET applications, all SONET functionality is hidden from the user and the FPGAs can be used as if the circuitry wasn't present (Fig. 3d).

Also included on all of the chips are three full-duplex high-speed parallel interfaces, each consisting of an 8-bit data interface, control signals, and a clock signal. These interfaces are fully compatible with the Motorola-proposed RapidIO interface, and can operate at clock rates of up to 311 MHz. With the use of DDR signaling, the three ports deliver data at up to 622 Mbits/s per pin, for an aggregate data rate of 15 Gbits/s.

Additional companies focusing on the use of SRAM-based system platforms include Atmel Corp., Triscend Corp., and Chameleon Systems Inc. Designers at Atmel have leveraged their AVR 8-bit processor core and their ARM 32-bit license to create system chips that combine the CPU, RAM, and FPGA logic. The field-programmable system-level IC (FPSLIC) capability allows Atmel to address the market with an in-system configurable solution.

Yet, Atmel wasn't the first to address such a market. Designers at Triscend were the first to offer a commercial FPGA/CPU/RAM combination as a system-chip platform. The CPU is an 8-bit 8032 microcontroller, and supporting it are 8 to 64 kbytes of SRAM and up to about 50 kgates of programmable logic.

This summer, the company upped the performance and system capabilities by offering the A7 configurable system platform. The A7 replaces the 8-bit processor of the E5 series with a 32-bit ARM7TDMI processor core (Fig. 4). In addition, Triscend's designers threw in a unified data/instruction cache of 8 kbytes to support the CPU, and provided a 16-kbyte scratchpad RAM. The A7 series also packs up to 3200 configurable logic cells (about 40,000 gates of programmable logic).

To support the processor, Triscend's designers added some system peripheral functions. Included is a memory interface unit that ties into SDRAM and/or static RAM, a four-channel DMA controller, a clock synthesizer, power-management support, and a hardware breakpoint unit for software debugging. Additional peripherals, common to the 8-bit version, include two timers, two UARTs, an interrupt controller, and a watchdog timer.

Four versions of the ARM-based system platform chip will be offered, each with different amounts of configurable logic and I/O pins. Packing 512 logic cells and 124 I/O pins, the smallest is the TA7S05. The largest is the TA7S32, which holds 3200 logic cells and 316 I/O lines.

An application-optimized progammable chip is the result of research done by Chameleon for a flexible communications system solution. The first member of the CS2000 family combines a 32-bit ARC RISC processor, an array of compute-optimized data paths, blocks of RAM associated with the data path blocks, a block of programmable logic, and varous system interfaces, including a 32-bit PCI bus, a 64-bit memory bus, and 160 programmable I/O lines. (For more about Chameleon's solution, see the cover story, electronic design, May 15, 2000, page p. 66 "Comm Processor Adjusts For Task At Hand").

SRAM-based FPGAs aren't the only programmable platforms that can leverage embedded CPU cores or other dedicated function blocks. Both QuickLogic Corp.and Actel Corp., employing their antifuse technology, have developed approaches to create system platform solutions.

QuickLogic has already developed FPGAs that pack dedicated SRAM and a variety of PCI interfaces. More recently, designers at QuickLogic developed a family of arrays that include embedded multiplier support. This allows the execution of high-throughput DSP operations. Plus, only a few months ago, the company announced a license deal that will allow them to incorporate a MIPS32-4Kc core on their FPGAs. They also have an option to expand the license to include the 64-bit MIPS64-5Kc RISC processor core.

All of these developments at QuickLogic also are coupled to the company's unveiling of the next-generation Eclipse family of antifuse FPGAs. The Eclipse series will be based on a 0.25-µm five-layer metal process. This includes an enhanced logic supercell that has a fan-in of 30 (17 simultaneous inputs) and provides six outputs (Fig. 5). There will initially be four chips in the family that will pack from 248k to 580k system gates and from 46k to 83 kbits of embedded dual-port SRAM. Each chip will additionally pack from 256 to 512 I/O pads, providing an abundant number of I/O lines to handle wide bus interfaces.

QuickLogic will also be releasing other FPGAs with additional embedded functions. The first new device, slated for release next quarter, will tackle high-speed data transfers.

Furthermore, designers at Actel have worked hard to develop an embedded FPGA IP strategy. Rather than offer a series of dedicated system platform chips, they will start by extending Actel's family of antifuse FPGAs that contain dedicated blocks of SRAM. Actel will also leverage the flash-memory based FPGA that it offers through its acquisition of Gatefield. This is the company that developed a family of flash-based FPGAs called the ProASIC series. The ProASIC arrays offer chips with densities ranging from 98 kgates to about 1.1 million gates, and from 14 kbits to 138 kbits of embedded SRAM. Antifuse FPGAs available from the company include the SX-A family, which packs from 12k to 108 kgates.

Actel will offer flash-based blocks of FPGA logic. By acquiring a small design house that developed an SRAM-based FPGA architecture, Actel will also offer embeddable blocks of SRAM-based FPGA. Designers can then use these blocks as embedded cores (blocks of IP) to create a system chip that combines the cores and other ASIC circuitry.

The two other players in the FPGA arena, Cypress Semiconductor Corp. and Lattice Semiconductor Corp., haven't developed complex embedded functions for their respective FPGA offerings, the Delta 39K family and the ISP8000 series. Offering internal operating speeds of up to 250 MHz, the Delta39K series arrays are fabricated using 0.18-µm design rules. Arrays in the family provide complexities ranging from around 15 kgates up to about 350 kgates. Plus, these chips include blocks of embedded SRAM, ranging from about 40 kbits on the smallest device (the 39K15) to 672 kbits on the largest (the 39K350). The embedded blocks of SRAM can be used to form single-port memory blocks. If dual-port blocks are needed, designers have from 8 kbits to 168 kbits to work with.

The Lattice ISP8000 series devices are basic FPGAs or complex CPLDs. They pack up to about 60 kgates, but have no embedded RAM or other functions. The company hasn't released any details about their next-generation ISP family. But they do have plans to offer basically a ten-fold density improvement in their next generation.

In general, FPGA activities are gravitating back to almost an ASIC solution, with application-specific functions finding their way onto the FPGA silicon. At the same time, ASIC solutions are incorporating more programmable features, including embeddable blocks of FPGA logic. As design tools improve, it might actually become increasingly difficult to tell the difference between an FPGA and an ASIC as the two grow closer together.

Manufacturers of High-Density FPGAs And Programmable System Platforms
Actel Corp. (408) 739-1010 www.actel.com Altera Corp. (408) 544-7000 www.altera.com Anaydyne Micro electronics Inc. (408) 996-2091 www.anadyne-micro.com Atmel Corp. (408) 441-0311 www.atmel.com Chameleon Systems Inc. (408) 730-3300 www.chameleonsystems.com Cypress Semiconductor Corp. (408) 943-2600 www.cypresssemi.com	Lattice Semiconductor Corp. (800) 327-8425 www.latticesemi.com LSI Logic Inc. (408) 433-6855 www.lsil.com Lucent Technologies (800) 372-2447 www.lucent.com/orca QuickLogic Corp. (408) 990-4035 www.quicklogic.com Triscend Corp. (650) 968-8668 www.triscend.com Xilinx Inc. (408) 879-6146 www.xilinx.com

About the Author

Dave Bursky

Technologist

Dave Bursky, the founder of New Ideas in Communications, a publication website featuring the blog column Chipnastics – the Art and Science of Chip Design. He is also president of PRN Engineering, a technical writing and market consulting company. Prior to these organizations, he spent about a dozen years as a contributing editor to Chip Design magazine. Concurrent with Chip Design, he was also the technical editorial manager at Maxim Integrated Products, and prior to Maxim, Dave spent over 35 years working as an engineer for the U.S. Army Electronics Command and an editor with Electronic Design Magazine.