A parallel bus is the simplest architecture. It’s easy to add peripherals, but the bus bandwidth then needs to be shared when servicing them. Bus bandwidth often limits peripherals, yet some buses can handle all of the peripherals without a problem.
Most low-end processors that lack DMA or coprocessor support still can deal with every peripheral. In this case, though, the processor’s speed in fetching and executing instructions ultimately limits performance. The processor is directly involved with all peripheral accesses, so accelerating the processor will boost overall system performance.
A processor’s work may consist of moving data to and from peripherals, but these chores usually can be offloaded using DMA or multiple cores. In a single-bus system, these devices must share the bus bandwidth. This lets the processor run at a slower speed, thereby using less of the bus bandwidth. Likewise, the peripheral data load on a system isn’t always constant. In bursty environments, the DMA or the processor may use 100% of the bus bandwidth to the possible detriment of the other intelligent devices on the bus.
Splitting a single bus into multiple buses and distributing the intelligent peripherals among these buses is one way to raise the potential throughput, assuming transfers can be limited to a local bus. Moving data from one bus to another essentially hits the limitations of a single bus. This is why a block of memory often is dedicated to a particular bus—therefore, a DMA controller could move data between the local memory and a peripheral.
Systems with higher performance generally try to maximize throughput. One approach, a crossbar switch, connects one set of devices to another. In most cases, it’s a set of processors or DMA units and a set of memory blocks or peripherals. The crossbar switch can connect any device in one set to any device in the other set, but a device can only control or be controlled by one other device. If A and B want to use C, then A goes first and B waits, as with a bus. The difference between the bus and a crossbar switch is that the crossbar switch can support any number of simultaneous exchanges, as long as conflicts don’t occur.
Crossbar switches are expensive, especially when the devices within a set number more than half a dozen. Larger systems typically use a packet-switching system. One notable processor family, AMD’s Opteron, employs this tactic. The Opteron is based on HyperTransport, a high-speed, chip-to-chip interconnect.
This type of switch fabric is built around switches that pass packets of data around. Latency is higher than it would be with a crossbar switch, but caching can significantly reduce its impact. The advantage is that a fabric scales much better, and it’s also something that can be easily linked off-chip.