Co-Verify To Optimize Your Embedded Design

When it’s part of an existing tool flow, performance analysis promises improved end products with little or no cost increase.

Jim Kenney

May 1, 2004

12 min read

As the capabilities of wireless networks improve and become more sophisticated, the expectations and desires of wireless-device users seem to grow exponentially. The result is an ever-increasing demand for better levels of service and performance from mobile devices. For true data and communications mobility, bandwidth must increase—even as power requirements decrease and security improves. These pressures are driven by an accelerating, highly competitive wireless marketplace. Now more than ever before, design engineers must confront an increasingly difficult set of tasks.

The challenges of wireless systems design are many: integrating radio-frequency (RF), analog, and digital signals; reducing power requirements to help extend the battery life of mobile devices; and managing a growing set of functional demands that result in greater hardware and software complexity. More often than not, the growth of system-on-a-chip (SoC) and embedded-systems designs forces all of these competing priorities to be managed on a single piece of silicon (FIG. 1).

The consumer appetite for wireless voice data transfer is driving the burgeoning feature sets of ever-evolving wireless laptops, smart phones, and entertainment devices. Yet concerns about design performance are being raised in tandem with demands for increased device functionality. To respond to these challenges effectively, wireless systems designers must be able to explore alternative configurations early in the design cycle. They need the ability to easily make tradeoffs across mixed-signal boundaries. Designers also must be able to quickly and efficiently pose "what ifs" that shift functions between hardware and software. Having such capabilities in hand greatly enhances the chances of achieving the increased performance that balances the required size, function, and cost of embedded designs.

The answer to providing these options lies in performance-analysis technology. But embedded-systems design teams are already struggling to gain a high level of confidence in functional verification before tapeout. Who has any time to devote to performance analysis and tuning? Plus, performance analysis requires another toolset. With its associated learning-curve and support burden, this toolset is difficult to wedge into the project schedule. But what if performance analysis could be accomplished with minimal effort using a tool that's already in the flow? The market value of the end product could be enhanced substantially at little or no additional cost.

Correct functionality is vitally important to a design. If the design is not functionally correct, the speed at which the tasks are completed will be of little consequence. Success in the marketplace also is measured in terms of performance. A host of forgotten or unsuccessful products attests to the frailty of functional designs that failed to perform as expected. To meet the broader expectations of the wireless marketplace, performance verification—as measurable throughput of specific design architectures—must become a priority early in the design process. Once correct functionality has been confirmed, there is significant value in delivering a product that exceeds expectations and outperforms the competition.

Today's electronic-design-automation (EDA) and design communities place their emphasis on functional verification. So how can the performance optimization of embedded systems be elevated in the verification process? It's possible to add another discrete toolset and process step to focus on performance analysis and design tuning. Yet such an act is sure to be met by resistance from design teams. They wouldn't welcome the idea of wedging another toolset into an already tight project schedule. Nor would they want to deplete scarce design resources and pressure-constrained budgets with increased tool costs.

Ideally, performance analysis and optimization would be realized as a complementary augmentation of tools and steps that are already integrated into the flow. A tool is needed that can do performance analysis on data that's gathered in the simulation environment. To address the requirements of embedded-systems performance, both hardware and software execution will be essential.

DATA-GATHERING CHOICES The quality of the performance measurements that are presented to the design team depends on the environment in which the data is collected. Embedded systems are characterized by integrated hardware and software components. Each component acts in concert to effectively execute the tasks that are intended by the design. While there is value in analyzing them separately, hardware and software execution have interoperative effects. To draw meaningful conclusions about the performance of the design, those effects must be evaluated together. Examples of such effects include processor instruction fetches and data reads or writes on the bus; processor load on memory subsystems; and bus arbitration conflicts between the processor and other bus masters in the design.

In order for an environment to be considered for co-simulation of hardware and software, it must have sufficient visibility to collect the required performance data. Contending environments include: logic simulation that instantiates a full model of the processor; hardware emulation that incorporates the physical processor (or a model of it); and hardware/software co-verification.

Of these three environments, logic simulation is the least effective. Because of the difficulty in integrating a software debugger in this type of modeling, the data for profiling software functions isn't available. In addition, the execution speed of a logic simulator is limited to under 20 instructions per second. As a result, not enough software can be run to provide meaningful results. Logic simulators can be configured to provide data about bus and memory transactions. At best, however, they can only derive hardware performance data if they don't have the ability to correlate with software execution. More robust, full-system analysis is beyond the reach of this approach.

As an environment for performance-data collection, hardware emulation offers a substantial improvement over logic simulation. By representing the processor with a physical device that interfaces with a symbolic debugger, it can quickly derive data. That data can then be used to profile code. But in cases where the processor is either an emulator primitive or is implemented in one, the opportunities for capturing symbolic data are constrained. Here, the emulator can report on memory transactions. In addition, bus monitoring can be instantiated in the emulated design just as it is with logic simulation. Yet restricted symbolic data gathering does limit the robustness of system-performance analysis that can be achieved.

The environment with the greatest potential to provide a rich set of performance data is hardware/software co-verification. Comprised of both an instruction-set simulator (ISS) and software debugger, co-verification processor models are able to provide data for code profiling as well as cache hits and misses. The co-verification kernel processes all memory transactions to the memory subsystem, which is modeled in the logic simulator. It satisfies the data requirements for depicting memory activity graphically. Instantiating a bus monitor in the logic simulator provides data on bus loading and arbitration delay.

Hardware/software co-verification is the most promising environment for delivering a fully robust set of performance data. As a result, the discussion that follows will assume the use of a properly instrumented hardware/software co-verification tool.

PROFILING TRANSACTIONS By plotting the time that's required to execute each software module, designers get a profile of software transactions. It also is possible to combine ISS and software debuggers to collect data about software functions. But this approach doesn't integrate the hardware's impact on software execution (i.e., bus waits and interrupt handling). The integration of a hardware simulator with the ISS significantly improves the accuracy of the software profiling. The elapsed time for each discrete software function is displayed in nanoseconds. Included in this display are the time-to-service interrupts that are asserted by the hardware during function execution.

The significance of software-profiling data can be displayed with bar and Gantt charts. A bar chart can show the percentage of CPU resources that's consumed for each function (FIG. 2). A Gantt chart provides a sequential display of the function execution, calls, and returns. It shows the time that is required to perform each step (FIG. 3).

By knowing exactly how long it takes a time-critical function to execute, one can prevent system errors like incomplete data transfers or dropped packets. For example, one design team was not confident that the RAM copy routine (part of software initialization) would complete within the required time. They considered adding hardware to perform the RAM copy separately from the processor. But this alternative was both expensive and time consuming. In contrast, software profiling based on the hardware/software co-verification model provided the team with explicit information about the duration of the RAM copy function. The team could then be confident that the routine ran within the specified window. They avoided the time, effort, and expense of developing RAM copy hardware.

Software profiling can draw attention to critical functions that don't execute within the required time frame. It allows designers to improve overall system performance by speeding up selected code executions. Functions can be rewritten to improve efficiency in a number of ways: by implementing in assembly code rather than C, by changing the interrupt priorities while the function is executing, or by changing the implementation of the function from firmware to hardware.

MEMORY TRANSACTIONS Typically, embedded systems place significant demands on memory subsystems. The CPU and multiple hardware functions constantly compete for access to shared memory regions. As a result, a complete system simulation can be instrumental in identifying memory bottlenecks prior to tapeout.

A graphical depiction of memory transactions over time can highlight peak memory utilization. Designers can then focus on the most critical demands for memory bandwidth. The peaks and valleys that are represented in the graph indicate opportunities to balance memory access. Less time-critical functions can be shifted to a point at which memory bandwidth is underutilized (FIG. 4).

Firmware or hardware calls are correlated with a particular point in time. This approach helps designers identify and correct memory bottlenecks. By clicking at any point along the memory-activity graph, one sees the function names displayed as visual keys to memory usage. This capability aids designers in the annotation of memory reads and writes to improve memory performance.

Efficient cache activity also is crucial to overall system performance. If instruction and data caches are used effectively, the CPU load on the main memory can be minimized. In addition, firmware execution speeds may be improved. Excessive cache misses can indicate poor data locality for a given function. They also may suggest that the cache size should be increased. Once the nature of the problem has been identified, both cache size and configuration can be optimized to improve firmware execution and relieve memory loads.

Here, the "virtual world" that's invoked by a hardware/software co-verification tool delivers a significant benefit. Designers have the ability to quickly iterate on different cache configurations for optimal performance. Fast iterations speed the evaluation of proposed changes to cache size or algorithm. Those iterations are supported by a graphical software debugger, logic simulator, and the clear display of memory transactions. In contrast, attempting to optimize cache by manipulating a hardware prototype alone restricts a designer's options for change. It also provides indirect feedback on efforts to improve efficiency.

MANAGING BUS UTILIZATION In today's wireless embedded designs, bus bandwidth is a precious commodity (see "Co-Verification In Action," p. 36). Hardware components, which include CPUs, DMA controllers, peripherals, and data search engines, are all in competition for this resource. To display unique insight into the operation of a design, bus utilization can be charted over time as a percentage of total bus bandwidth (FIG. 5).

If bus utilization maxes out at 100%, it indicates a bandwidth-limited function. Such a condition can reflect a DMS transfer in which every available bus cycle is commonly in use. It also may reflect an unexpected peak in bus usage that requires further investigation. By reviewing the appropriate software-profile information, designers can act on this information to implement changes. They can determine which function calls are occurring during the bandwidth limitation. Identifying and eliminating bus-usage bottlenecks can dramatically improve system throughput.

The bus arbiter controls bus master access to a particular bus. It's difficult to choose the most effective arbitration scheme and set correct function call priorities in order to balance bus access. Adjustments are made to these parameters to ensure sufficient bus access for critical functions without ignoring lower-priority functions.

It can be challenging to correctly identify bus-arbitration problems. Buffers may back up and overflow or data may be dropped entirely. Often, it's unclear that the problematic behavior that's being observed is rooted in arbitration. By plotting the time that's required to grant bus access, one can gain insight into the source of arbitration problems. This process may help designers avoid what can otherwise be a slow and tedious task. Viewing arbitration delay makes balancing the bus access easier than monitoring changes in secondary effects.

Often, the task of balancing bus access requires multiple iterations. Improving access for one bus master can degrade access for others. To reduce the time that's required to complete each iteration, designers can review an arbitration-delay graph after changes are made to function call priorities or the arbitration scheme. Faster iterations enhance the achievement of an optimum bus-access balance. They also save critical design time.

VERIFICATION-TOOL PERFORMANCE From these examples, it's clear that performance analysis can deliver important benefits that enhance the value of embedded designs. The evaluation of functional-verification tools should include performance analysis as an important consideration along with speed, accuracy, capacity, and language support. The tools that incorporate performance-analysis technology can tune embedded hardware and software in order to achieve optimal throughput and efficiency. The future of this technology promises even greater benefits for automating changes that will enhance design performance (see "New Technology Spurs Performance Optimization," p. 38).

When a design's operation characteristics are presented in a clear, flexible, and easily accessible manner, substantial performance gains can be achieved. Such gains are possible even with a relatively small expenditure of development resources. The quick analysis of performance alternatives can result in designs that yield end products of superior value. These end products will be well poised for success in the demanding wireless marketplace.