March Of The Multibus MCUs

High-end microcontrollers often use large, complex crossbar switches and other technologies to maximize throughput and performance. Low-end microcontrollers typically feature a simple bus structure. But as performance increases, so does the need for more advanced architectures. Low-to mid-range microcontrollers are now moving into a new realm where balance is key.

Crossbars offer some performance advantages. Unfortunately they do not scale well. This has lead to interesting switching architectures like the communication rings inside IBM's Cell processor's Element Interconnect Bus (EIB) (see "CELL Processor Gets Ready To Entertain The Masses" at www.electronicdesign.com, ED Online 9748). Yet crossbars and even switched, on-chip communication systems are too expensive for the low-to mid-range MCUs. Instead, multiple-bus architectures have found their way into a variety of novel architectures. These tend to be easier to implement while still meeting the performance requirements to support a chip's memory and peripherals.

The need to meet performance requirements is key, of course. But issues such as power usage, multicore solutions, and load balancing also influence new chip designs.

MULTIPLE-BUS MICROCONTROLLERS The growth of flash-memory sizes has had an interesting impact on low-to mid-range microcontrollers. Larger flash memory takes up more chip real estate. This increase does not translate into a corresponding space increase for the processor and memory support, making the processor core a much smaller percentage of the chip. This affects a chip designer's options for these components.

Moving to processors with higher performance, such as the 32-bit ARM or Freescale ColdFire architectures, is one way to exploit the space compared to using an 8- or 16-bit platform as the processor core. An equally interesting approach to enhancing a chip design is to add DMA support. A DMA channel is significantly simpler than a processor, so chip designers often can add a number of DMA channels.

Adding complexity to a DMA channel with features such as chained or double buffering is often a low cost option. Of course, adding DMA or going with a wider processor bus increases bandwidth requirements.

Sharing the bandwidth of a single bus can be useful, especially if the aggregate bus bandwidth meets the system design requirements. On the other hand, it's frequently possible to dedicate one or more DMA channels to different buses. In many cases, the multiple buses aren't identical in nature. Rather, they form a hierarchical interconnect, like Atmel's AVR32 (Fig. 1).

The AVR32 uses a type of crossbar switch for its top-level interconnect. The 32-bit processor and the DMA unit can access the high speeds, while the peripheral DMA can service the two advanced high-performance buses (AHBs). The processor can access the lower-speed peripherals off these two buses, but it's more efficient if the DMA can handle those chores.

Likewise, transfers on the faster AHB matrix can occur while slower transactions take place on the AHBs. Still, transfers for all buses occur between the on-chip memory or off-chip memory and the peripherals.

Less complex architectures don't necessarily mean lower performance. Instead, chip designers look to match the capabilities of the architecture with the system requirements. This is the case with Microchip's dsPIC line (Fig. 2). The architecture features four buses and one dual-port RAM. Even so, the main memory and code flash-memory buses resemble any Harvard architecture microcontroller.

The flash memory is dedicated to supplying the instruction decode, while the two peripheral buses are dedicated to the processor and multichannel DMA controller. The buses let the processor and DMA simultaneously access any peripheral as long as it isn't the same one at the same time. This commonly is the case when DMA is used with a peripheral.

The other difference from most conventional microcontroller architectures is the dual-port DMA memory, which is primarily intended for use by the DMA controller. The CPU typically will move a block of data in and out of this memory after a block transfer is completed.

Not surprisingly, the DMA controller can be set up to operate in ping-pong mode. In this case, a DMA channel alternates between two buffers, providing the CPU with an interrupt when it's done using one of the buffers and is switching to the other buffer. The CPU simply needs to handle the transaction before employing the next buffer. The DMA channels still must contend for the bandwidth of the DMA bus, but transfers are independent of the processor bus.

NXP's LPC2300 32-bit, ARM7-based microcontroller family sits between the Microchip dsPIC and the Atmel AVR32 in terms of complexity (Fig. 3). It only uses buses such as the dsPIC, as well as dedicated memory for peripherals, but it has a more complex hierarchy like the AVR32. This approach gives the chip a significantly higher peripheral data throughput without the complexity of a crossbar switch or other high-speed switching system.

The processor uses the LPC2300's high-speed local bus for its on-chip memory. Its different DMA channels are distributed throughout the system. For example, the Ethernet interface has its own DMA controller. Also, it can use a bank of 32 kbytes of SRAM. Transfers by the DMA between the Ethernet interface and the local SRAM don't affect the other buses.

The USB DMA controller offers a tighter linkage with its block of memory. A 4-kbyte SRAM block is dedicated to the USB interface and DMA, whereas the peripheral DMA that supports the low-speed peripherals uses the 8-kbyte SRAM block on the same AHB. The USB transfers don't go across the AHB, so both can operate simultaneously.

Of course, peripheral data still needs to move in and out of the processor memory. Generally, though, a program will move blocks quickly when a DMA interrupts the processor after a DMA block transfer is complete.

NXP designers looked closely at the performance of all the components when assembling the architecture. The processor runs at 72 MHz to match the flash memory and SRAM performance. It operates with zero wait states.

Assuming a single AHB, and that all peripherals are active, the bus utilization is about 98%. There's not much overhead for computation, though. Worse, on average, 60% utilization would result in significant collisions. While the bus could easily handle this and still reach full utilization, a significant number of peripherals would have to wait before transfers could be performed, requiring greater amounts of per peripheral buffering to prevent data loss.

By dedicating a bus to the Ethernet controller, the designers guaranteed 100-Mbit/s Ethernet support without contention. The processor can still operate at 100% efficiency. And, they saved power in two ways.

First, overall chip speed can be slower because multiple operations are performed in parallel. Second, the processor's speed and operating voltage, as well as the other bus and DMA controllers, are independently controlled. As a result, the Ethernet interface can operate at full speed while the processor slows down if it has less work to do. NXP provides a range of power-down modes so sections of the chip can be turned off completely when they aren't required.

MULTIPLE-CORE PROCESSORS DMA controllers can be seen as very simple processor cores. Increasing their intelligence helps reduce the main processor support overhead. Routinely, this is accomplished without significantly increasing the chip size. Moreover, it will improve benefits over the other multiple DMA architectures.

Innovasic Semiconductor takes this approach with its fido1100 microcontroller line (see "Back To The Future With The 68K" at www. electronicdesign.com, ED Online 13613). It's based on the CPU32 architecture initially used in Freescale's (then Motorola) 68K-based, 32-bit architecture—which in turn was used in the first Apple Macintosh.

The fido1100 has four Universal IO Controllers (UICs). Each UIC is a microprogrammable machine that can emulate a range of peripherals, like an Ethernet interface with intelligence that exceeds the typical peripheral/DMA combination. Each UIC is independent in terms of operation, speed, and power drain. Developers can use them as necessary and not pay a penalty if a UIC is idle.

Intellasys' SEAforth chip represents the extreme end of the spectrum (Fig. 4). It has 24 identical Forth microcontrollers linked in a 2D mesh. Each processor communicates with its nearest neighbors. The interfaces on the periphery of the mesh are conventional interfaces.

In essence, the outer processors are similar to the fido1100 UIC because they're programmable. Some overlap exists in capabilities, such as parallel and serial port support. But their purpose and designs diverge, as the UIC can handle Ethernet. Still, the Forth cores are programmable and very flexible.

This mesh architecture represents a logical plethora of buses within the chip, though the number of devices on the bus is limited. The communication interface chips are implemented more as a small bidirectional FIFO, since the bus in the processor core handles its processor and four interfaces.

As with the other architectures, each Forth core and the peripherals can be powered down to conserve power. In fact, the communication interfaces can wake up a processor so portions of the chip will turn on and off as data flows through the system.

Developers will need to take a closer look at processor architectures when choosing a platform. Faster clock speeds no longer guarantee better performance, and low-power modes don't guarantee power savings. The balance will make the difference. Understanding an application's requirements as well as the platform's capabilities will be important in creating products that can meet the demands of low power and high throughput.

NEED MORE INFORMATION?Atmel
www.atmel.comInnovasic Semiconductor
www.innovasic.com
IntellaSys
www.intellasys.com
Microchipwww.microchip.comNXP
www.nxp.com