Scalable, Reconfigurable Processor Adjusts Logic For Top Performance

The evolving nature of communications and other audio/video systems demands signal-processing approaches to be scalable and flexible. With scalability, systems can tackle increasingly complex tasks. It also allows new features and additional resources, and improves the task performance. An example is adding more filtering when noise overpowers a signal. Flexibility is required for the system to speedily implement new algorithms.

All signal processing must occur in real time in communications systems like cellular telephones, and audio and video processing. Additionally, changes to the algorithms must be made quickly due to the short-lived market differentiation. Millisecond delays are too long when bits are flying past at tens of thousands to millions per second. Updatable structures need to reconfigure themselves in a single cycle to prevent losing a large block of bits or a dropped connection.

In the past, reconfigurable DSP blocks were implemented on SRAM-based field-programmable gate arrays (FPGAs). But, typically, updating the SRAM configuration cells requires several milliseconds for the megagate-density devices. During those milliseconds, the array isn't usable and all signal-processing activity halts. That's not acceptable when fast operation is necessary.

Chameleon Systems' designers resolved the flexibility and scalability issues with the CS2000 family of reconfigurable communications processors (RCPs). It combines features from multiple product types. Each RCP contains aspects of a DSP chip, a microprocessor, an FPGA, and a custom ASIC. It can't be classified as one or the other. Instead, it forms a new class of product—a configurable compute platform. This solution delivers higher system performance than a multichip mi-croprocessor/DSP and FPGA alternative. Programming is easier, shortening the time to market.

Combined in the CS2000 architecture are a 32-bit RISC processor, blocks of embedded memory, a proprietary reconfigurable processing fabric, and a large number of programmable I/O pins (Fig. 1). The RISC processor is based on the ARC core from ARC Cores Ltd. Also on-chip is a PCI v. 2.1-compliant, PCI-interface controller, for interfacing to a host system. And, there's an external-memory controller that connects to a memory bus that's 64 bits wide. The CS2000 offers a DMA subsystem with 16 distributed DMA engines. These enable high-speed data transfers in and out of the reconfigurable processing fabric.

A 128-bit-wide split-transaction bus, dubbed the RoadRunner system bus, provides a time-division multiplexed communications path. This ties the control portion of the chip to the configurable processing fabric. The fabric contains a configurable interconnect structure and repetitive blocks of compute logic known as slices.

The slices are independently configurable. They have user-configurable compute resources in the form of three sub-blocks, referred to as tiles. Every tile has seven 32-bit datapaths, two 16- by 24-bit single-cycle multipliers, four local-storage-memory (LSM) blocks—each 128 words deep by 32 bits wide—and a control logic unit (Fig. 2).

The configurable interconnect fabric enveloping all the slices is key to the flexibility and real-time performance of the RCP. In a single clock cycle, it can be reconfigured. There's no delay when a new circuit configuration takes on a task from the existing system configuration. Designers employ a configuration plane and a second "shadow" plane to accomplish this. With a simple signal, the shadow bit plane is substituted for the configuration data plane. The old data plane can then be updated in the background. Therefore, while executing its current task, the system prepares for the next potential change.

At Chameleon, designers developed the tools and support structure needed to create the software and port the algorithms to the architecture. The C~Side tools cover the development flow, runtime services, hardware and software debug, and verification aspects of software development. They employ standard C and HDL languages for design entry. Among the tools included is an optimized GNU C compiler for the 32-bit RISC processor. There's also an optimized HDL synthesizer for the reconfigurable processing fabric and a full-chip simulator.

Specially developed firmware solves the challenge of interfacing the 32-bit RISC processor in the control portion of the chip to the reconfigurable fabric. Created by Chameleon, it's called the eConfigurable Basic I/O Services (eBIOS). This software provides a seamless interface, allowing the processor to easily hand off tasks to the processing fabric. The eBIOS performs resource allocation, configuration management, and DMA services. Its calls are generated automatically at compile time, but they can be edited for precise control of any function.

In a typical application, the eBIOS first allocates required fabric resources into one or more slices. Next, the configuration loads into those slices. The eBIOS then synchronizes the local store memories and registers in the datapath units. After the DMA transfers are done, the algorithm executes on the configurable fabric. Finally, the eBIOS manages the return from execution.

In the first silicon implementation of the RCP architecture, the CS2112, designers opted to combine four slices, or 12 tiles, onto one chip. That gives the system designer 84 datapath units, 24 multipliers, and 48 local-store memories, totalling 196 kbits. With all blocks active and the clock running at 125 MHz, the chip provides a maximum compute throughput, with 16-bit data, of 24 BOPS, and 3 billion 16-bit multiply-accumulates/s. In terms of communications applications, that translates into the ability to implement 50 channels of cdma2000 processing.

In addition to the four-slice CS2112, Chameleon plans to release two scaled-down versions, the two-slice CS2106 and the single-slice CS2103. Both chips can function in less-compute-intensive applications. Their performance and I/O buses equal half or a quarter of those in the CS2112.

The initial market for RCPs includes basestations, fixed-point wireless local loops, smart antennas, voice-over-IP, secure communications, and very high bit-rate digital-subscriber-line (VDSL) systems. It encompasses various other communications applications that traditionally use DSPs and FPGAs.

Configurable compute tiles form the heart of the RCP. Each holds seven 32-bit datapath units, four local store memory blocks, two single-cycle multipliers, and a control unit. Routing multiplexers, a barrel shifter, registers, mask logic, a 32-bit operation block, and several output registers lay inside the datapath unit. The local store memories are multiported. They possess the ability to perform simultaneous reads and writes. They also can be concatenated to form wider or deeper memory blocks.

The datapath is able to handle, 16- and 32-bit word operations, and dual, independent 16-bit data streams. These streams are for operations like single-instructions/multiple data. Word- and byte-swapping and word duplication are done by the 32-bit barrel shifter. It can generate any 5-bit constant, to be employed by the register and the two 32-bit AND/OR mask operators.

The 16- by 24-bit multipliers provide a result in a single cycle. In 16-bit mode, they can produce a signed 32-bit product. In full-resolution mode, they create a 40-bit product that's rounded to 32 bits.

Equivalent to an ALU, the 32-bit operation block directly implements all C and Verilog operators. It performs number calculations, signed/unsigned shifting, and bit-field masking-data operations. All registers for the datapath have conditional enables to improve pipelining efficiency. At reconfiguration, registers can either initialize or preserve their state. Furthermore, there's an optional-use shifted-feedback mode for shift-register and LFSR implementations.

Tying all of the logic together, the interconnect fabric guarantees 100% routability through a fully enumerated interconnect hierarchy. The routing employs a rule-based timing model that's simple and deterministic—one clock cycle within a slice, and two clock cycles for other slices. Timing is independent of fanout.

There are three levels of hierarchy for routing in the dynamic interconnect. At the first level, local routes connect nearby datapath units with just a one-clock cycle delay. With the same delay, intraslice routes connect all datapath units within a slice. Finally, interslice routes connect datapath units in different slices with a delay of two clock cycles. In each datapath unit, routing multiplexers route signals through or around the datapath units. On a clock-by-clock basis, the multiplexers can be told to alter the data flow.

To conduct the quick personality change, the RCPs contain two configuration memory planes. The active plane holds the configuration for the function being executed. The shadow or background plane holds the most likely alternative configuration that the current algorithm may have to call upon. If it's called, the control logic only needs a single cycle to switch from the current to the background plane. The active plane then becomes the background plane. New configuration data is able to load into it from external system memory at a speed of about 3 µs per slice.

Using this updating approach makes multipart algorithms possible. An example of such an algorithm uses the four key parts of the power-control group employed by the cdma2000 chip-rate processing algorithm. Those parts include pseudonoise sequence generation, demodulation, finger searches, and access searches.

In a traditional ASIC, each piece of software is implemented as a different logic block and the four blocks are cointegrated on the chip. That gives designers little flexibility for updates or function changes.

In contrast, only enough resources for the most complex function must be allocated with Chameleon's approach. While one algorithm executes from the active plane, the pattern for the next function transfers to the shadow plane.

For the cdma2000 function, the four algorithms mentioned require 77, 615, 224, and 334 µs to execute. The four functions are referred to as one power control group. Depending on the algorithm, all or part of the reconfigurable processing fabric can be devoted to the computations. More reconfigurable resources can be used if that will improve the result's speed or quality.

A single CS2112 chip can implement 50 channels of cdma2000 chip-rate processing, which is about twice the rate of other application-specific chips. Moving results onto or off the chip is possible with up to 160 programmable-I/O pads divided into banks of 40 I/Os each. Every bank delivers data at transfer rates of up to 0.5 Gbytes/s. That speed allows high-performance data streaming for signal-processing and protocol-processing applications.

The CS2112 holds four I/O port banks, which provides a total possible bandwidth of 2 Gbytes/s. The programmable lines can be configured to provide interface and handshake signals for SRAMs, peripheral functions like analog-to-digital and digital-to-analog converters, FPGAs, and other support circuits.

The RISC processor core, associated PCI controller, memory controller, and DMA subsystem provide the intelligence to manage the chip's overhead operations. The core coordinates with the host to ensure that data and algorithms flow smoothly.

The ARC RISC core is a 32-bit processor. When clocked at 125 MHz, it delivers about 120 MIPS. The processor has a four-stage pipeline, 64 general-purpose 32-bit registers, a 32-bit address space, a 4-kbyte instruction cache, and a 4-kbyte data memory. Also, to ease the system debug phase, Chameleon's designers implanted a full JTAG interface that works with the debug tools.

The PCI controller offers a complete interface for the chip to tie directly to a 32-bit PCI backplane. Functioning at bus speeds of up to 66 MHz, it handles both master and slave modes. As part of the control logic resources, a 64-bit memory controller was integrated to control the external memory. It handles synchronous SRAM and SDRAM, in addition to flash memories. It permits automatic transparent SDRAM refresh and supports up to four banks of SDRAM. Burst sizes up to 8 kbytes are possible. Both parity and 8-bit ECC support are included.

A separate configuration controller manages the data flow into the two configuration planes. The controller is an optimized DMA controller that transfers configuration data from off-chip memory, through the 64-bit memory controller, and into the background configuration plane.

Developing configuration patterns for the chips is straightforward with the C~Side development environment. C source code can be written to run on the ARC processor. At the same time, library circuit elements and Verilog HDL source code can be assembled and then synthesized to craft the compute functions. After the logic is synthesized, placed, and routed, the configuration bitstream is linked with the ARC object code. The outcome is an executable file that will run on the reconfigurable communications processor (Fig. 3).

ChipSim is an integral part of the tool suite. It's a complete simulator, and can be used to model the entire RCP. On the front end, the popular GNU debugger is available. ChipSim guarantees 100% visibility into all memories and registers throughout the RCP, both in the fabric and in the ARC core. A development board speeds up the application integration phase and permits designers to test the applications at full speed. By using the PCI bus or the JTAG port, data can be transferred. All memories and registers can remain visible to the debug tools, just as with the simulator.

Price & AvailabilitySamples of the 12-tile CS2112 reconfigurable processor will be available in the third quarter of this year. In 100-unit sample quantities, the processor sells for $295 apiece, but high-volume pricing will decrease to the equivalent of less than $1 for a cdma2000 chip-rate processing channel by the middle of 2001. The C~Side development tools also will ship in the third quarter. The complete software tool suite, which runs on the Sun Solaris platform, sells for $25,000. The hardware development board and related driver software sells for $5000.

Chameleon Systems Inc., 1195 W. Fremont Ave., Sunnyvale, CA 94087; Bruce Kleinman, (408) 730-3300, www.cmln.com.