It’s shortsighted to think that boundary scan can be used only for interconnect testing. The Computer Division at Unisys, with the help of ASSET InterTech, recently discovered that troubleshooting and ongoing maintenance support of systems in the field can also reap big benefits from scan test.
The Unisys commitment to providing maximum boundary scan coverage on the A11 central processing module (CPM) paid off when several systems began booting out jobs because of “bad records on disk.” The problems encountered in one system were particularly critical since it was being used by a large and important OEM customer.
The A11 CPM in question would run for hours at peak levels of performance and then suddenly and mysteriously boot out a job. By the time the frustrated and angry customer requested assistance, the system’s productivity had plummeted.
System error messages led technical field support to suspect problems with cache memory coherency. Field engineering swapped out the CPM card and shipped it back to the factory where test engineers and designers began asking questions. Boundary scan had been designed into the system for interconnect testing. In addition, maximum boundary scan coverage was implemented so the board could be more thoroughly tested.
How did an error get through manufacturing test and make it into the field? Could the problem be repeated in the lab? Was it a manufacturing or design fault? What would be the most effective way to fix the problem and make customers happy again?
Virtually all the components on the CPM card are connected to one of four boundary scan shift chains, giving the card extensive scan test capabilities, such as built-in self-test (BIST) and interconnect testing (Figure 1). Even the on-card control logic, implemented in very dense field programmable gate arrays (FPGAs) or programmable array logic (PAL), is connected to the scan chains.
These FPGAs are very powerful and highly integrated components that contained the equivalent of 5,000 logic gates and as many as 80 flip-flops. In the end, the programmability of all the CPM’s control logic through boundary scan became the system’s ace in the hole.
Replicating and Isolating the Problem
As a first step toward tracking down the problem, the design and test team reviewed the design debug phase of the A11 CPM card. An ASSET diagnostic system was used to perform extensive interconnect testing and the card had passed with flying colors.
Under normal operating conditions in the lab, the bad-records-on-disk problem did not occur on the failing customer’s CPM card. But once the voltage was lowered, the error occurred with regularity. The troubleshooting team began to suspect a design bug that had somehow slipped through the debug process or that the problem was caused by components with timing discrepancies. The manufacturing department had changed PAL vendors before the design debug.
Next, the team isolated the problem to the marginal timing on the cache memory’s bus monitoring logic. This logic monitored the system’s buses to ensure all memory addresses in the cache were up to date.
The CPM card included two system buses: the System A (SA) and System B (SB) buses (Figure 1). The cache monitoring logic for each of these buses was identical, except for the physical location of the components on the card. The nature of the timing problem made it particularly difficult to predict since it would occur rarely under normal operating conditions.
The cache monitoring logic asynchronously latched memory addresses from either of the two system buses. This mechanism was triggered when the monitoring logic received a command valid signal from the system module that was driving the bus at that particular time. Unfortunately, the propagation times and ranges for the control signal relative to the bus address signals varied significantly in comparison to the clock period for the bus.
A logic analyzer showed that the latching of addresses from modules located at the far end of the bus was marginal; sometimes an address was missed entirely. If the missed address involved a modification of system memory, data in the cache memory could become out of date (cache data incoherency). If old cache data was written to the disk, the operating system would detect it and report a disk-error condition, which was the error message given by the system when it booted out a job.
Boundary Scan to the Rescue
Finding the problem was difficult enough—solving the problem was even more complex. The system had been in manufacturing for months and many systems were already installed around the world.
Fortunately, a short-term and easy-to-implement solution was devised for installed systems while the troubleshooting team developed a permanent solution to implement during manufacturing. Since the System B bus seemed more prone to the failure, the short-term solution was to run the failing systems on just the System A bus with slightly degraded performance.
To solve the problem in the cache monitoring logic, several iterations of changes were necessary. Fortunately, the troubleshooting team realized that the programmability of the control logic and the access provided by boundary scan provided a very effective solution.
Not an etch was cut on the card. Nor was any component ever removed. Simply stated, no hardware change had to be made. By using the boundary scan technology designed into the board and a boundary scan interactive diagnostic tool, this serious timing problem was corrected by making soft changes to the system’s data files.
Two external sources provided intelligent control for the boundary scan shift chains. For the engineering lab and manufacturing test, the ASSET interactive diagnostic tool was connected directly to the card. The interactive debugging functions
helped set and monitor all the on-card logic. The team also realized that simple changes to the control logic PAL devices could be made through this mechanism.
Under normal operating conditions, a system control console (UNIX-based PC) provided the operator interface and was used to perform maintenance functions on all system modules. Figure 2 depicts the on-card maintenance logic.
The on-card microcontroller interfaced to the system console and drove the boundary scan controller and scan path linker devices. The system was designed so that the boundary scan chains would be used to initialize the board. All initialization states, including the code for the PAL devices, were held in on-card flash memory.
With the embedded software and the boundary scan chains, all start-up states could be quickly loaded at start-up time. New software releases for the flash memory could be loaded from the system control console, which was equipped with a cartridge tape and other portable media.
All the changes needed to solve the timing problem with the cache monitoring logic were made with boundary scan chains and the software tool. PAL changes were made on a PC and compiled in a .JED file. Then a utility was used to convert the .JED file into a .MAC file which ASSET could use to drive the boundary scan path on the card.
The file was then converted into a serial vector format for fast drive to the boundary scan parts on the A11 CPM card. This file was also distributed to those systems already installed in the field and the PAL devices were reprogrammed through boundary scan.
The solution was eventually implemented by merging this code into the system console data base for the CPM card and then loading the code into the on-board microcontroller at initialization. The microcontroller, in turn, would drive the new code down the boundary scan chains to the cache monitoring logic.
Hardware Changes the Soft Way
In this way, extensive hardware changes to the CPM card’s control logic were made in a totally soft manner. The final solution was tested completely and then released as a system console software release. A new cartridge tape was shipped to all customer sites in the field. The new code could also have been transmitted electronically by e-mail, as an FTP file or through a bulletin-board system.
Keeping It Running
Once the A11’s design team realized the power of the boundary scan test technology that was already built into the card, several other functions were implemented during a design phase update, including:
The BIST of all ASIC devices on the A11 was initialized and monitored through a boundary scan path.
Extensive testing and initialization procedures for memory devices not equipped with boundary scan connections were done through adjacent devices connected to the boundary scan paths. This was performed on RAM devices that stored processor microcode as well as two levels of cache memory RAM devices.
Very rapid initialization of cache tag RAMs was accomplished through boundary scan. When the card is initialized, the cache tag RAMs must be loaded with known data and valid parity values. Because of the large size of the cache tag RAMs, the serial shifting of addresses can be unacceptably slow.
The cache tag RAM devices, which were not compatible with boundary scan, were surrounded by buffers and registers that were fully compatible with boundary scan. By scanning the initial data and parity values into the surrounding buffers and registers and by using boundary scan’s control-counting feature on the address lines, the entire cache tag RAM array could be initialized in a fraction of a second.
No Pain Was User’s Gain
In the end, the decision to design maximum boundary scan coverage into the A11 CPM card proved fortunate indeed. With extensive boundary scan on board, the system could be thoroughly tested before it left the factory and effectively maintained throughout the rest of its installed life.
The system’s control-logic timing problem demonstrated that boundary scan can make a very difficult change quite painlessly for the user. The A11 CPM card’s hardware can be modified extensively and a timing problem solved without physically modifying the board. With boundary scan, hardware was made soft.
About the Authors
Bruce E. Whittaker is a Staff Engineer and Project Leader at Unisys. He is a principal on 50 active patents (16 granted, 34 pending). Mr. Whittaker has a B.S.E.E. degree from California State University and an M.B.A. degree from Pepperdine University. Unisys, Computer Division, 25725 Jeronimo Rd., M/S 225, Mission Viejo, CA 92691, (714) 380-5188.
Nicole A. Doherty is Director of Marketing at ASSET InterTech. She has a B.S. degree in electrical engineering and is an active member of the IEEE 1149.1 Working Group on Boundary Scan Test. Asset InterTech, 2201 N. Central Expressway, Suite 105, Richardson, TX 75080, (214) 437-2800.
Copyright 1996 Nelson Publishing Inc.
October 1996
|