Programming parallel processors isn't easy, especially when the number of processing elements is large. No single technique applies to all situations. But in its Storm-1 architecture, Stream Processors narrows the focus to make parallel-processing hardware and software design significantly easier (Fig. 1).
One of the challenges of parallel processing is matching the architecture to the problem. Storm-1 addresses this by focusing on the signal processing of streaming data, which includes streaming video or data from radar. In both cases, the kind of signal-processing work remains consistent, and the chunks of data being processed at one time can be brought on-chip.
While this architecture may not fit many applications, the number of applications it does fit is growing rapidly. In fact, scalability is a key factor. Storm-1 is available in eight-lane and 16-lane versions. These lanes have no relation to the lanes used with hardware interfaces like PCI Express or Serial RapidIO. With Storm-1, a lane is a macro processing element (Fig. 2).
A pair of 64-bit MIPS 4KEc processors manages these lanes and handles the housekeeping. The data parallel unit (DPU) MIPS processor controls the DPU at a global level and drives the DPU dispatcher. The dispatcher controls the code that's loaded into the very-long-instruction-word (VLIW) instruction memory that in turn is used by each lane. A scalar unit handles simple chores that won't be accelerated if they're distributed to a lane.
Each lane operates on its own data and is independent of each other lane. The data passing through each lane will get the same general type of processing. Yet the data itself will affect which algorithms are applied as well as any state provided when the lane starts its processing.
WE DON'T NEED NO STINKING CACHE
Though the Storm-1 is designed for streaming data, it doesn't grab new data continuously. Instead, incoming data streams are moved into main memory. A chunk of data is moved into a lane's local memory and then moved into operand register files as it is being processed. The resulting data is moved back into main memory once the lane is done processing.
This works well because typically the incoming and outgoing streaming data is buffered and often moving through different channels at different data rates. If compression or decompression is being performed, the input and output stream sizes will differ significantly. This buffering approach is quite common even with conventional architectures. The direct memory access (DMA) can move data directly into the lane's local memory if necessary.
Caches are complex and take up lots of space. Eliminating the cache can provide key performance and power advantages. On-chip accesses via the cache are often a hundred times more expensive than register accesses— and it's worse for off-chip accesses.
In a conventional system with many DSPs, each DSP will cache information from main memory. With Storm-1, there are no caches, only very large local memory or register banks. This has several advantages, especially when it comes to determinism.
Primarily, it lets compilers generate very good code that will be executed consistently since stalls will never occur. In fact, the communication and memory subsystems complement each other and eliminate or reduce bottleneck effects. The current architecture can handle the eight- or 16-lane architectures.
The processing system within each lane is simpler than many DSP architectures because of this memory architecture. The five processing units have their own register files and ALUs. They also operate on the lane's data in parallel. Each lane operates in parallel to minimize cross-lane communication.
The ALU architecture mirrors most DSPs with multiple operand instructions as well as specialized multiply-accumulate (MAC) hardware. The single-instruction multiple-data (SIMD) architecture is tailored for applications such as video manipulation. Scatter/gather operations within a lane are also supported when accessing local memory.
Chunks of data can be moved from one lane to another, and the size of the chunks is chosen to fit into the confines of each lane. A scatter/gather DMA transfer approach enables logical data streams to be split among multiple lanes or even multiple chips.
Developers program the Storm-1 lanes using C. The original work was done using C++, but it was discarded in lieu of C, which provided a more elegant and efficient solution because it matched the way many stream processing applications were designed.
One set of C functions, called kernel functions, runs on the lanes. These functions are used as necessary and process data in parallel in each lane regardless of how many lanes are involved. Limits are based on the physical number of lanes and the data loaded into the lanes.
If only one lane is needed, only one will operate. The others can idle, conserving power. The eight-lane version consumes about half the power of the 16-lane version when all lanes are operational. Running fewer lanes at a higher speed is more efficient than running more lanes at a slower speed.
Kernel functions operate only on local lane data. They're used after the stream data has been moved into the lane's memory. One kernel function will be applied to all lanes at a single time. Kernel function execution can be conditional on a per lane basis. Libraries of kernel functions are available for common transformation and processing requirements.
Kernel functions don't depend on the number of lanes involved, so the architecture can be scaled up and down. This may lead to additional chips in the family or architectures that use multiple chips. In this case, the code to handle the lanes will be replicated but remain the same from chip to chip. Communication between lanes in different chips will be significantly more expensive, but this won't affect many applications.
The RapiDev Development Environment supports Storm-1. It includes the SPC compiler for Linux and Windows hosts and the cycle-accurate Target Code Simulator (TCS), which includes MIPSsim for the control processors. The Eclipse IDE ties everything together, including the simulator and VLIW profiler support.
Image processing, DSP, and general math libraries are included. The MIPS processors run Linux and can be programmed using any conventional set of programming tools. Libraries are provided for managing and load-balancing the memory, streams, and lanes.
Available individually, the SP16-G160 costs $99, and the SP8-G80 costs $59. A PCI board is available with a 16-lane version. The board has a Gigabit Ethernet interface, analog audio in/out, 512 Mbytes of SDRAM, and 32 Mbytes of flash. It can operate in standalone mode or be controlled by a host processor.
The Storm-1 architecture is just one of many. Architectures such as IBM's Cell processor or even symmetrical multiprocessing (SMP) systems will remain important in their niches using different parallel programming tools and techniques.
Versions: eight-lane SP8-G80 and 16-lane SP16-G160
Speed: 500 MHz
Memory: 128-bit DDR2
Stream I/O pins: 72 or 108 programmable pins, 165 MHz
Peripherals: 1-Gbit Ethernet, serial, 32-bit, 66-MHz PCI
Package: 31- by 31-mm 896-pin plastic ball-grid array (PBGA)