Third-generation (3G) base-station designers are facing tough design choices. They must keep up with increasing performance requirements while reducing system power dissipation. At lower power, a hundredfold increase in performance can be achieved. To attain it, use advanced process technology and partition the design into arrays of small, optimized data-processing functions. This distributed-processing approach increases 3G-system efficiency. After all, large arrays of small function-specific processing elements are inherently more efficient than the alternative—small arrays of large, do-everything processors.
For arrays of 3G function blocks to work in parallel, however, they must be connected with a dedicated system-level interconnect structure. It is possible to manage the system data flow using traditional methods like buses, crossbars, or tunnels. Beyond a small number of processing elements, though, data-transfer determinism is lost. The whole system may then unglue. Corner cases and unpredictable latencies are known to cause this problem.
3G designers are now turning to two-dimensional (2D) connectivity fabrics to meet the hundredfold performance increase at lower power. Such fabrics efficiently link arrays of data-processing elements. Their uniform and deterministic structures spread the data traffic over the entire design area to eliminate signal-routing congestion and data-transfer bottlenecks.
With data processing distributed in two dimensions, designers can more easily partition a 3G design. This partitioning can be done to achieve the right balance of size. Or, it may help to determine the number of data-processing components required to keep the chip circuitry busy most of the time. At the same time, partitioning will reduce the overall distance that data has to travel inside the ICs.
With billions of transistors switching at gigahertz frequencies, the problem of getting signals across a large die is becoming very complex. Timing, signal integrity, and power issues all have escalated to a new level. Figure 1 shows that instead of spanning the entire chip, 3G functions will have to be localized in small computing platforms. These platforms will be tuned to specific classes of tasks. To work in parallel, these computing islands will have to be bridged with an efficient data-communications structure.
Traditional interconnect methods, such as buses, crossbars, and linear tunnels, are all based on one-dimensional (1D) I/O structures. These structures can directly connect only a limited number of subsystems. In addition, they cannot provide any control over the data-transfer direction. Using these traditional I/O methods to connect large arrays of computing elements also introduces additional glue-logic components and long, non-uniform data routing paths.
For small designs, Figure 2 shows that those traditional I/O methods have worked. They become increasingly inefficient, however, for high-density systems operating at gigahertz clock rates. Finely tuned subsystems can be easily thrown out of tune if they are connected at the system level with a non-uniform mix of traditional I/O methods.
It is better to connect 2D arrays of data-processing elements with a uniform array of 2D data-transfer links. Such connectivity uses a uniform array of short point-to-point routing links. Plus, it requires no additional glue logic. One example of a 2D-fabric I/O structure comes from a company known as CrossBow Technologies. To efficiently connect arrays of subsystems, this company uses a uniform array of horizontal and vertical transport links. Subsystems can then talk to other subsystems through I/O wrappers, which are placed around connected subsystems. Each wrapper is assigned a unique set of YX coordinates. These coordinates are used to route data packets around the system.
On the inside, each 2D-fabric wrapper is accessed with conventional bus cycles—just like any other peripheral. On the outside, though, the adjacent wrappers are connected on all four sides with data transport links. This approach forms a deterministic system-level communications structure. It is uniform in two dimensions.
In Figure 3, a 2D-fabric structure is linking 12 data-processing subsystems inside a single IC. All connected subsystems are wrapped with 2D-fabric wrappers. To each subsystem, they appear as a conventional peripheral on the local bus. Adjacent wrappers are connected with an array of short horizontal and vertical data-transport links. These links enable the transfer of data between subsystems.
Figure 4 shows the write, transport, and read stages of a single data transfer between two subsystems. To launch and receive packets, the source and destination subsystems use conventional bus cycles. All system-level routing and arbitration is performed autonomously by 2D-fabric peripherals along the transfer path.
The source subsystem, which has 2D-fabric YX coordinates of 24h, sends data to subsystem 31h by issuing a write cycle. The local 2D-fabric peripheral converts the write cycle to a single packet. The payload consists of 4 B from the data bus and an optional control byte from the address bus. The packet header, which also is derived from the address bus, contains the YX coordinates of the destination subsystem that is equal to 31h.
The exit direction field of the address bus does not become a part of the packet. Instead, it is used to launch the packet in one of four possible directions (in this case, west). Following the launch, the packet is autonomously routed by three intermediate 2D-fabric wrappers (23h, 22h, and 21h) to arrive at the destination subsystem, 31h. Here, it is received with a read bus cycle.
To the software programmer, 2D-fabric appears as a conventional memory-mapped peripheral. Sending a word of data to another subsystem involves assigning a local variable to a pointer-referenced memory location. That local variable contains the data payload. The address pointer is assembled from four fields. The decode field selects the 2D-fabric peripheral. It also de-selects other peripherals on the local bus. The second field identifies the packet exit direction. The destination subsystem's XY coordinates are housed in the third field. Finally, the fourth field contains the optional control byte that describes the contents of the data payload.
The software programmer at the destination subsystem receives the arriving payload. To do so, he or she assigns the contents of the 2D-fabric memory location to a local variable. The read is typically triggered by the ready-to-read interrupt from the 2D-fabric wrapper.
In Figure 5, a portion of a C code is launching one packet with a single assignment statement. It uses both data and address pointer integers. The other side of the figure shows C code inside the receiving subsystem. It is reading the packet with another assignment statement.
The latency of each transfer is strictly proportional to the number of intermediate nodes. As a result, each destination can be easily mapped as a function of distance from the source. To determine the transfer latency, the hardware engineer can simply count the intermediate nodes after taking into account the local traffic. To the software designer, latency looks just like memory wait states. Each wait state represents the best-case latency, worst-case latency, or something in between. Using the wait-state model, system-communication latency can be determined. Simply reference a memory map that groups all potential destination subsystems in banks according to the number of wait-states that were needed to access them.
3G-system efficiency is heavily influenced by design partitioning, the optimization of individual data-processing components, and the streamlining of data flow between components. In order to keep data-processing components busy, interprocessor data transfers must have low latency. They also have to be precisely deterministic. If this is not the case, the components will waste valuable processing cycles waiting for data.
Low latency requires the removal of data-communication choke points. To eliminate them, spread the data flow over the entire chip area. Transfer determinism is achieved through a combination of low latency and a uniform data-communications structure inside the chip. With low latency and deterministic transfers, multiple processors can work in parallel without having to wait for data.
2D-fabric packets are received from the fabric through four input ports. They are transmitted to the fabric through four output ports. The packets exiting through any of the output ports can come from one of three sources. Those sources are: the diametrically opposite input port, an adjacent perpendicular input port, or the interface port to the local subsystem on the inside of the wrapper.
In a round-robin fashion, the three possible sources compete for access to the output port inside the 2D-fabric arbitration circuits (FIG. 6). In the best-case latency through the 2D-fabric node, the arriving packet is immediately routed to the output port without waiting. The worst-case latency occurs when the arriving packet is not allowed to exit until the other two packets have cleared the port.
Best-case latency takes place when no other data traffic interferes with a packet in transit. The best-case latency per 2D-fabric wrapper is about 25 ns at 200-MHz I/O clock. If a packet has to travel through three intermediate nodes before it gets to its destination, the total best-case latency will be 3 × 25 ns, or 75 ns. To transfer a block of data to a destination three nodes away, the source subsystem would thus be able to launch a new word every 75 ns.
In contrast, worst-case latency occurs when a packet in transit is temporarily stalled on the way to its destination. In the worst case, two other packets will stall it inside every intermediate node. The worst-case latency per 2D-fabric wrapper is about 75 ns at 200-MHz I/O clock. Each time an intermediate node is encountered, the data stream is slowed down by a factor of three. This is due to the fact that for every incoming packet, there may be up to three outgoing packets. For n = 3 intermediate nodes, the worst-case latency will therefore be 75 × 3(n−1), or 675 ns. To transfer a data block to a destination that is three nodes away, the source subsystem will be able to launch a new packet every 675 ns—regardless of other system traffic.
AVOID WORST-CASE LATENCY
Data-processing systems depend on the communications infrastructure for the efficient delivery of data to individual processing blocks. The goal of I/O structures is to deliver data at precisely the time that it is needed. If it is delivered too early, a place must be found to store it. If it's too late, precious data-processing cycles are wasted. The on-time delivery of data hinges on how much blocking occurs during transfer. It also depends on how well one can predict data delivery, regardless of other data traffic. Fast and deterministic data transfers are key to the efficient processing of data in 3G-processing systems.
To avoid worst-case latency, 2D fabrics spread the data traffic over the entire design area in two dimensions. In addition, the uniform structure of 2D fabrics guarantees speedy on-time data delivery—even if worst-case latency cannot be avoided.
Conventional I/O structures transfer data along a line or through a central point. In contrast, 2D fabrics disperse the traffic over the design's entire surface in two dimensions. Figure 7 shows the best-case scenario, in which all transfers are non-interfering. It also points out the worst-case scenario. There, multiple transfers are arriving at their destination from the same direction.
Unlike the other methods, 2D fabrics can launch data packets in specific directions. They can then spread multiple transfers over a large area. This approach avoids interference and worst-case latency. Occasionally, however, worst-case latency cannot be avoided. The 2D fabric's uniform I/O structure then makes latency determination as simple as counting intermediate nodes along the transfer's path.
Two-dimensional fabric can easily map data-flow algorithms to arrays of functional blocks. With shrinking process geometry, it is often more efficient to duplicate dedicated data-processing elements. Otherwise, silicon area could be wasted on moving data over metal wires to and from shared processing elements. Implementing data-flow algorithms is a simple four-step process:
- First, individual data-processing elements are identified. These elements can be either processor-based or hardwired. Typically, one processor controls the data flow through an array of hardwired computing primitives or higher-level functions.
- The second step involves placing the functions. Assign unique YX coordinates to each block. Each placed function is then logically linked to the other functions using input/output lines. The arrival of all inputs to any one block triggers the function to execute within that block. The generated results then trigger downstream functions.
- The logical connections are implemented in the third step. Assignment of the data flow is now ready to start.
- In the fourth step, the processor primes the starting blocks with inputs. This causes the results to propagate autonomously through the 2D-fabric transport links. The manner of propagation reflects both the pre-programmed destination coordinates and exit directions.
Due to the inherent ease with which data flows in two dimensions, 2D fabric is the I/O structure of choice for implementing data-flow algorithms in silicon. Figure 8 shows a multiply and accumulate algorithm mapped into a 2D array of multiply and add data-processing primitives. This is accomplished by simply assigning YX destination and exit directions to each 2D-fabric wrapper. In dedicated data-flow applications, each destination and exit direction is typically constant. In contrast, other processing architectures may extract them directly from the inputs. In tree database structures, for example, the YX destination and exit direction from inputs may be processed inside the function itself. This approach will control the paths of the tree traversals.
With shrinking process geometry, reaching 3G-design goals will increasingly hinge on achieving the right balance between the size and number of data-processing subsystems. It also will depend on the ability to keep the chip busy at all times while reducing the overall distances that the data has to travel inside the ICs. In many cases, it is advisable to use a large number of subsystems based on small, specialized processors. They will be more efficient at performing a given task than a smaller number of large, do-everything processors. For the large number of processors to work in parallel on a common task, though, they must be connected with a deterministic and uniform interconnect structure. Such a structure will enable the efficient transfer of data, control events, and configuration sequences.
With exploding gate counts, the use of traditional I/O methods introduces too much complexity and uncertainty into the system. They then fail to efficiently meet large system data-communications goals. In contrast, uniform and deterministic 2D connectivity fabrics eliminate I/O complexity while removing data-communications bottlenecks. They match the natural 2D layout of system components with interconnect that is also 2D in form and function.
For the third-generation functions that are implemented inside integrated circuits (ICs), 2D interprocessor communication fabrics enable fast and efficient data transfers between hundreds of data-processing elements. Linking processing elements with two-dimensional fabric also increases system performance. Multiple processors can then process data in parallel. At the same time, two-dimensional fabrics reduce power consumption. They minimize the total distance that data has to travel both inside chips and chip-to-chip.
By creating arrays of processing components, system designers can drastically increase processing throughput and I/O bandwidth. At the same time, they can retain current processor architectures and design tools. The two-dimensional fabric looks like conventional memory to the processors. It does not force software programmers to change their programming styles in order to benefit from higher performance. Serial programming code investment is thus preserved, as each processing element uses only one processor.