Software Simulation Blasts Bugs In Network Hardware Designs

You can save time and money by finding system bugs and performance bottlenecks early in the design cycle without building a hardware prototype.

Michael J. Miller

July 7, 2003

15 min read

DESIGN VIEW is the summary of the complete DESIGN SOLUTION contributed article, which begins on Page 2.

Conventional hardware debugging techniques can't keep up with today's complex network products. As a result, design bugs remain unseen until system hardware is prototyped. Fortunately, designers have a better alternative: using software simulation to find system bugs and performance bottlenecks early in the design cycle. Soft-ware simulation changes the development flow to let designers check performance and correct an architecture as needed, without building a hardware prototype. Bugs are then found and fixed early in the design cycle.

In a conventional design-flow process, designers can debug, test, and optimize the hardware and software only after the software is executed on a prototype board. If a large performance problem emerges, the designer would have to re-adjust the architecture and potentially modify the schematic and circuit board. What this translates into is a schedule slip of typically two to five months or a sacrifice in function or performance. Further hindering the conventional flow is the increased use of sophisticated network processor units (NPUs) to process packets in the datapath and perform management functions in a separate CPU.

Software simulation of the NPU and elements that connect to it can circumvent conventional debugging problems. This system-level architectural modeling offers maximum visibility and control. System-level simulations are usually data-accurate models, which don't consider timing of the device. To address timing, both data- and cycle-accurate models exist.

Discussed in this article are the many benefits of system-level simulation. The article also points out what to look for once the decision is made to take the software route.

HIGHLIGHTS:
The Old Flow	The conventional design flow starts by deciding on an architecture. Then the schematic is captured and the circuit board fabricated while writing software. Once software is executed on a prototype, then you can debug and test.
Growing Complications	Use of network processor units (NPUs) makes the conventional flow more difficult. Designing an NPU-based line card requires balancing of compute resources.
Key Goals	For network products to achieve performance/cost goals, tradeoffs are made and development tools tweak the system early in the cycle for optimal performance. But today's complex designs are obsoleting conventional debuggers.
A Simulation Solution	Avoid conventional debugging problems by doing software simulation of the NPU and elements (e.g., coprocessors) that connect to the NPU. Such simulation provides maximum visibility and control.

Full article begins on Page 2

You can save time and money by finding system bugs and performance bottlenecks early in the design cycle—without building a hardware prototype.

When faced with the complexity of today’s latest network products, conventional hardware debugging techniques are running out of steam. These complex products include routers, switches, line cards, and other network equipment that use network processors. As a result, design bugs typically remain unseen until system hardware is prototyped. The result is serious schedule slips and, ultimately, endangering a product’s timely entry into the market.

Fortunately, designers now have a better alternative to conventional hardware debugging: using software simulation to find system bugs and performance bottlenecks early in the design cycle. Specifically, software simulation changes the development flow to let designers check performance and correct an architecture as needed, without building a hardware prototype. Thus, bugs are found and fixed early in the design cycle, reducing their impact on time-to-market.

With this level of visibility and control, designers can determine the best bus loading mixture, experiment with different component combinations on the buses, and try different mixes of macros to analyze the most efficient sequences. These advantages have prompted several vendors, including Integrated Device Technology Inc., Applied Micro Circuits Corp., and Intel Corp., to develop simulation models for their key network devices. Moreover, third-party software vendors such as Teja Technologies offer tools with common software environments that integrate the different models together to simplify system-level development and optimization.

The Old Flow In the older conventional design flow, the process starts by deciding on an architecture—in particular, choosing between using application-specific integrated circuits (ASICs) or a central processing unit (CPU) as the core packet processing system. ASICs offer a "fine-grain" architecture, giving designers the full flexibility of working at the gate level, albeit at the expense of hardware development time. In contrast, a CPU-based system can be assembled quickly, though software development time may become significant.

After selecting the architecture, the next steps in a conventional development flow are to capture the schematic and fabricate a circuit board while simultaneously writing the software (Fig. 1). In the past, when using a single-CPU system, application code would typically be written in C/C++ and coupled to a real-time operating system, libraries, drivers, and graphical-user and application-programming interfaces.

Only after executing the software on a prototype circuit board could the designer debug, test, and optimize the hardware and software to meet performance and functional requirements. In most cases, a source-level debugger would be used to step through the code to ensure that the software did what was expected. In addition, following this first level of debugging, the designer would wield profiling, coverage, and bounds-checking tools to further test and optimize the software.

Hardware tools that assist in debugging include an in-circuit emulator (ICE) for reading register contents and setting breakpoints, and a logic analyzer to examine traces of code. When all else fails, the printf command could be used to print out values that may not be accessible by hardware tools.

If a large performance problem appeared, the designer would likely need to go back and adjust the architecture, and possibly modify the schematic and circuit board. This typically translates into either a big schedule slip or a sacrifice in function or performance. For instance, changes to a schematic and subsequent board can take two to three weeks; modifications to a FPGA take one week; and changing a key component takes up to several weeks. Accounting for software changes, it’s not excessive to expect even a moderate architectural change to push back a schedule by two to five months.

Growing Complications Making this conventional development flow more difficult is the increased use of sophisticated network processor units (NPUs) to process packets in the datapath and perform management functions in a separate control-path CPU. Where single-CPU systems limit software debugging to one active thread executing at a time—even in a multitasking environmentצNP-based systems use multiple threads executing on multiple packet-processing elements, each executing short segments of microcode.

Indeed, today’s network-equipment marketplace calls for very complex real-time systems consisting of interconnected chassis, each with multiple shelves, and each shelf containing multiple line cards. Moreover, these line cards must handle numerous protocols and packet formats, as well as multiple simultaneous events occurring throughout the system.

To meet such expansive requirements, the latest line cards consist of a complex integration of a CPU, one or more NPU(s), dedicated coprocessors, packet framers, and physical-layer interfaces (Fig. 2). Among the most complex of these devices is the NPU, which typically contains packet interfaces to a framer and a switch fabric, a number of DRAM and SRAM controllers and buses, multiple internal packet-processing elements, and an external control-plane processor interface.

Consequently, designing an NPU-based line card, with its multiple buses and packet-processing elements, brings up a host of issues to resolve. These include memory utilization, computation scheduling, control-plane interaction, and power utilization.

For example, in considering multiple sets of DRAMs with which to work, a designer must decide what goes in which DRAM. In other words, what’s the best way to partition the contents so that all of the packet-processing elements don’t try to access the same DRAM at the same time? Like partitioning problems, scheduling problems can create inefficiency if several packet-processing elements try to access the same memory interface at once, causing them to wait for one packet-processing element to complete a task.

As for the control plane, which performs maintenance tasks on the NPU’s high-speed data plane, it also must interact efficiently with the memory system. Lastly, the designer must contend with the power utilization of NPUs, network search engines (NSEs), and real-time coprocessors, which can make up a considerable portion of a board’s power requirements.

Key Goals Ultimately, most network products, like line cards, aim to achieve the required data throughput, bus bandwidth, and system latency; keep power consumption low; minimize code space and memory use; meet specified functionality requirements; and hit cost and time-to-market targets. These goals are reached by making tradeoffs and using development tools that can tune, analyze, and retune a system to optimize performance as early as possible in the development cycle. Unfortunately, applying conventional debugging tools to a hardware prototype is not only too late, but also, given the complexity of today’s designs, wholly inadequate.

For example, an NPU that has multiple packet-processing elements, each executing multiple threads simultaneously with no operating system, outstrips the capability of conventional source-level debuggers. Conventional debuggers can’t trace multiple simultaneous packet-processing elements and other components. On top of that, they can’t show the timing interactions among computational elements. Moreover, in-circuit emulators and logic analyzers have access only through available package pins, limiting visibility.

As for the printf command, adding these instructions to existing code slows real-time execution. This creates an inaccurate performance picture.

A Simulation Solution To circumvent the problems of conventional debugging, designers can turn to software simulation of the NPU and other elements, such as coprocessors, that connect to the NPU. Software simulation, or system-level architectural modeling, offers the crucial benefits of maximum visibility and control. Packet streams, captured from real-time traffic, can be used as input data to the simulator to represent actual system data.

System-level simulations are typically data-accurate models that process and respond to commands exactly like the actual devices. However, while data-accurate models accurately simulate the functionality of the device, they do so immediately without considering the timing of the device.

To address the timing issue, there are both data- and cycle-accurate models. These models have cycle-accurate capabilities that take the device timing and system modeling requirements into consideration. That is, the model typically behaves exactly as the real device would perform in a true situation.

An example of a cycle-accurate model is our system-level architecture model, which provides a cycle-accurate ‘C’ model of IDT network search engines (NSEs) with integrated quad-data-rate (QDR) interfaces. Cycle-accurate models include a transactor that calls on all modules of the simulated design to perform the functions for each clock cycle, thus synchronizing the "execution" of elements in the simulation. Languages like Verilog and VHDL have this timing concept built into them, whereas C/C++ do not. Therefore, designers using C/C++ must build it into the simulator’s architecture.

Importantly, system-level simulation clears a path to an improved development flow in which the architecture is decided, simulated, and verified early in the design process (Fig. 3). In this way, the designer can flag performance problems and make the necessary changes without building a hardware prototype. After performance and functional requirements are confirmed, the designer would have the confidence to capture the schematic, fabricate a circuit board, and integrate and run the software.

Clearly, this development flow greatly reduces the risk of encountering a serious problem in the prototype. The benefit is a much smaller likelihood of schedule slips and missed market opportunities. Of course, this improved development flow requires that the NPU and related elements be accurately simulated. Each silicon manufacturer can guarantee this accuracy if the same model that simulates the architecture is also used to verify the product—thus correlating the model against the actual design.

In following this new development flow for NPU-based equipment, the hardware design proceeds much as before: It starts with a circuit design and schematic capture and ends in a prototype. From a software perspective, however, code must be developed and tested for both the control and data planes (Fig. 4). For the control plane that manages the NPU, a conventional source-level debugger is generally adequate. However, for the data-plane code with its real-time, high-speed, multithreaded and multiprocessor requirements, software debugging would need to be performed in a much more sophisticated environment.

Running code on the simulated NPU and related devices lets designers see what’s going on inside those components. The ability to analyze the operation step by step in relation to the original source-code macros without affecting the real-time nature of code execution, and without having to work hardware, is mandatory to achieve prompt time-to-market performance. Among the requirements for this simulation environment would be to make all operations visible, display the different threads of execution, show when a thread is active as well as when it’s blocked, and reveal the utilization of each processing element.

The multitude of statistics that a simulation can acquire, and the vast variety of possible graphic display and results processing mechanisms, is just the beginning. Designers constantly look for capabilities that extend beyond just mimicking the device, subsystems, and systems. So tools that permit viewing the internal functions—or measure bus utilization and power consumption—are also integral to the development process. For example, power-tracking software tools let designers analyze average and worst-case power-consumption scenarios using the expected database configurations and the developed microcode (Fig. 5).

Importantly, simulation can enable multiple "peering" points normally unavailable with real silicon. These peering points become increasingly important as standalone devices continue to be integrated into a single monolithic entity.

In the past, developers could connect logic analyzers and other monitoring and debugging devices to the outputs and inputs. Now, these buses are no longer visible to the developer. Simulations need to expose both the internal and external buses so that they can be monitored, examined, and graphically displayed via methods and manners appropriate to the bus type. Simulation must also allow peering into internal data structures. For example, the ability to examine the contents of a FIFO, free list, or ring fullness can be extremely advantageous to a developer at multiple times during the development process.

Besides supporting multiple peering points, simulation environments must become smarter and identify errors and bottlenecks not easily observed by the developer. Smarter simulations generally mean an increase in both positive and negative feedback, as well as the ability for the developer to specify the warning or error notification levels.

Errors such as internal device FIFO overruns, resource contention, or writing into nonexistent memory spaces are examples of events that the developer needs to identify immediately. Other errors, such as exceeding the system power budget, need to be monitored and evaluated by the system simulation on every cycle. If the error condition is met, then it must be reported to the developer. Execution stall conditions should be monitored, graphically reported, and the developer notified as desired on each occurrence. These include the following: stall conditions associated with resource depletion, or when a command FIFO full condition halts execution until the FIFO empties sufficiently in order to accept a new command.

Another advancement challenge for the simulation model is getting to the "ready-to-run" or "running" state as quickly and easily as possible—especially when not all components are available. A good example of this today is the lack of control-plane processors that typically perform most initialization and configuration processing.

Again, using the NSE as an example, the NSE database is generally initialized and managed via the control-plane processor. The lack of a control-plane processor poses the problem of both initially configuring the database characteristics, as well as adding search entries to the database.

Simulation models must facilitate the quick and easy supplying of the initial device configuration state along with the device’s data contents. The models must be able to consistently load or reload the data. Therefore, the same simulation configuration may be run multiple times as required to generate accurate and reproducible results.

The ability to quickly reload the state outside of the running simulation execution time can’t be overstated. The faster the simulation reaches the ready-to-run state, the more simulation sequences possible.

It also is important for the simulations to be able to save the current state. Then these saved profiles can be used to reload the system, provide for post-mortem analysis, and perhaps most importantly, become incorporated into an automated regression system. Enabling the system designer to configure the system quickly, and easily insert system-modeling data for a complete simulation run, is a necessity.

With the growing complexity of networking systems and the challenges involved with developing a NP-based system, software modeling is playing an increasingly important role for system designers. Going beyond these simulation models, some component vendors are introducing complete software development kits. These kits include application software, such as control-plane library code and data-plane macros, as well as system-analysis tools, diagnostic code, and flags to indicate when rules of a device’s operation are violated. The advent of these software-development tools, when paired with system-level simulation models, lets designers optimize their systems prior to silicon availability. This saves valuable development time and, ultimately, gives customers an enhanced, reliable solution.