Hardware-Based IP Speeds Up Processor Multitasking

Today's high-performance processors spend a significant portion of their processing time switching from task to task, often executing hundreds of instructions each time. In a PC, that frequently involves a multimedia task, such as viewing an MPEG file, handling a soft modem, running a word processor, accessing the disk drive, and controlling a printer. As CPU speeds go up, the processor can handle more tasks. Current PCs can simultaneously handle several major time-critical tasks and a number of less performance-critical tasks.

Embedded processors often face the same challenges and need to minimize their task-switching overheads to deal with many real-time events. Speeding up the processor is one approach to handling more tasks. But this method increases power consumption and may only allow the processor to handle one or two additional tasks.

Designers at Xyron Semiconductor have a better approach. They crafted and patented a hardware-based task manager that switches between tasks without instruction overhead. With such zero-overhead task switching, the processor can handle many more tasks without increasing the clock frequency. In fact, for some applications, a 50-MHz CPU could end up doing the work that previously required a 500-MHz processor.

For example, in a Pentium-based system that switches between three tasks, the task-switch overhead is typically about 360 cycles. When cycling through all three tasks, 1080 cycles are consumed for overhead operations. At 500 MHz, the 1080 cycles consume about 2.1 µs, and about 1.1 µs at 1 GHz. Although that time may seem small, that overhead can get in the way of real-time operations, such as multimedia playback, or high-speed data communications if the overhead is a large percentage of system execution time.

Xyron's technology is called ZOTS, short for zero-overhead task switching. It greatly reduces or eliminates the interrupt latency and task-change processing overhead delays in processor architectures. ZOTS greatly accelerates processor performance by enabling the system to completely save the task state, or restore the task state between cycles, without software intervention. It's available from Xyron in two forms.

The first form, a 32-bit proprietary architecture processor, executes a MIPS-like instruction set and has the ZOTS acceleration technology embedded within it. Also known as the Xyronium processor, it will come in the form of a synthesizable core configurable in a Virtex II FPGA from Xilinx Inc. To support application development using the Xyronium core, Xyron also offers a development board optimized to handle such applications as video manipulation and storage, real-time industrial control, and networking.

The second form of ZOTS technology is pure intellectual property (IP). Xyron expects to license the IP to companies that design CPUs, DSP engines, and other compute engines for internal ASIC consumption and resale.

Adding ZOTS to a processor requires the design team to add read and write ports to the register files and make small changes throughout the processor, so that it would become more task aware. The task manager and the task storage from Xyron are essentially unchanged from processor to processor, so minimal, if any, changes would be made to the ZOTS technology.

By placing the task management into the silicon and storing all task information in a local static RAM, Xyron's designers crafted a system that could manage hundreds of tasks. The ZOTS approach includes the ability to handle variable-priority ramping, enabling priorities to change and be re-evaluated every cycle (Fig. 1). It permits deadline scheduling for high-priority tasks, plus pre-emptive and round-robin scheduling.

Concentrate On Task Processing: The near elimination of overhead in the task-switching function lets the processor focus on task processing, rather than on task management. To eliminate the overhead, Xyron separated the function into two parts, task switching and task management. The task-switching portion allows the system to switch between two tasks without software-initiated intervention by the CPU. The task-management portion manages a set of tasks, complete with priority structure and arbitration.

Task management is functionally equivalent to the operations that a software-based RTOS might perform. But no CPU cycles are needed to execute the task switch, so there's zero overhead, and the task switch is totally transparent to the CPU.

In an ideal system that runs two tasks, two register sets would be adequate. One set would be used by the active instruction stream. Loaded into the second register set would be the pointers to the instruction stream to be switched to. When the task is switched, the system just swaps register sets with no noticeable latency. As more tasks are added, increasing the number of register sets gets unwieldy, silicon area increases, and bus loading slows down the system.

Taking a radically different approach, Xyron created a switching mechanism that lets every resident task use the CPU resources it needs. This is accomplished by first storing the task registers in a dedicated dual-ported on-chip memory called a Task RAM (Fig. 2). Two multiplexers on the input side of the registers connect the input bus, the Task RAM, and the CPU to either register set. The multiplexers on the output side connect either register set to the CPU, the output bus, and back to the Task RAM.

A delay in memory access or reading of an input port typically causes task suspension. Suspending an executing task lets the next-selected task run immediately, optimizing CPU usage. Likewise, a missed cache causes suspension of the running task. Cache management would then be important. Also, when executing instructions that belong to a running task, the CPU has read/write access to the task's register set in the usual manner, and the standby set isn't visible to the CPU.

Usually, task switching is initiated by an interrupt that must propagate through the system. Propagation typically consumes many CPU cycles to evaluate the priority and determine how the system should service the interrupt. But with the ZOTS approach, interrupts are simply requests for service. They're handled as any other task via a priority scheme that's part of the task-management portion.

As in any RTOS, a task-priority mechanism controls the task switching. In the ZOTS ap-proach, task control starts with a set of task modules, each containing task-specific information (Fig. 3). This structure is sort of like a dedicated state machine for each task, with each module including an Active flag, a priority counter, and some additional logic. Priority registers are loaded with the chosen priority limit and start priority. Then, the priority-ramping rate is loaded into the rate generator, and the counter is loaded from the start-priority register.

During each priority-increase interval (determined by the ramping rate), the priority is incremented. This lets the task's priority level increase to its maximum value. Upon each increment, the priority limit is compared to the current priority (the output of the counter), and a stop command is issued to the counter when the predetermined high-priority limit is reached. At that point, the task's priority remains constant at the high limit. A task, however, can run before its maximum priority is reached.

Basically, each task sends a "bid" for control of the CPU and its associated resources. If the task wins the bid, the priority can remain at its current level, or be reset to the low limit either by software, or a sleep instruction. In the latter case, the bidding process can begin anew. But in the first case, it can continue from its current level when another task takes control of the CPU.

The output of a task module is simply that task's priority, or bid for control of the CPU and associated resources. The information from the module is sent to the task control manager through a priority mediator, which identifies a task by the position of the connection containing the priority value (Fig. 4).

The task control manager has two main blocks—the priority mediator and a task controller. The mediator has access to each task's bid, which can be zero, as determined by an Active flag. The mediator also implements a scheme that performs pairwise comparisons in the form of a binary tree. A zero bit doesn't necessarily mean the task's priority is zero, but merely that the task is prevented from bidding due to other conditions, or simply that it's not yet activated.

The wining task's identification and priority (nonzero bid) are presented to the task controller, which keeps track of the executing task, the standby task, and the task winning the bid. With this information, the manager decides which task to run next, and whether or not it's a new task, or to enter an idle state if no tasks are to run next. This way, the controller can require that the standby register be loaded with the next task's state from the Task RAM, then demand a register switch.

These decisions are made in accordance with the relative priority levels among the three tasks of interest. All tasks are referenced by pointers to a block in the Task RAM, so carrying out any of these decisions is a straightforward control action.

Together, the task controller and priority mediator manage the bidding process. A running task is blocked from bidding if tasks of the same priority are requesting, but not if it's the only task of that priority. All tasks of priority lower than that of the running task participate in the bidding, ensuring possible access for all active tasks. If a resource like cache memory or a math unit isn't available to an executing task, that task is put to sleep. No CPU cycles are wasted on wait states, unless it's the only task.

When the resource becomes available and the task priority allows it, the task can run. Other aspects of the task manager handle semaphores in addition to priority levels, providing more software-transparent control over task activity.

The development board includes the Virtex II FPGA and the Xyronium processor IP. On-board are NTSC video inputs and outputs and a programmable XVGA output. For audio, the board also includes a 44-kHz, 16-bit stereo codec.

Software for application execution on the Xyronium processor will run just as it would on a standard MIPS processor. But the RTOS task management and interrupt functions should be redirected to the ZOTS hardware technology. This requires simple recompilation to target functions that were once included in the software RTOS, not a complete code rewrite as one might expect.

One way to ease into the ZOTS technology is to start with the existing code running as one task. If desired, multiple copies of an entire application or operation system can be run as separate tasks. The ZOTS-enabled processors will initially work with the GCC C and C++ compiler suite. Future software releases will include embedded Linux and commercial RTOS porting libraries for standard RTOS solutions.

Price & Availability The Xyronium processor synthesizable core can be licensed for $5000 plus an additional prepaid royalty for a single-instance use in an FPGA. Licenses for the full-IP disclosure for use in a processor are negotiable. The development board sells for $1495. All options are immediately available.

Xyron Semiconductor Inc., 203 S.E. Park Plaza Drive, Suite 210, Vancouver, WA 98684; (360) 449-8822; www.xyronsemi.com.