A Cyberspace Debug Session in the Not-Too-Distant Future

Feb. 1, 2007
I have an intermittent, high-fan-speed warning on one of my blade servers; it seems to have started just after the last repair on the cabinet.
[system admin] I have an intermittent, high-fan-speed warning on one of my blade servers; it seems to have started just after the last repair on the cabinet. When I bring up the 9th blade, the fan speed jumps 200 rpm. [hardware repair] What does the cabinet-controller event log show? [system admin] The speed increase is due to a temperature rise in the 3.3-V power supplies. [hardware repair] Any alarms out of the power supply? [system admin] No. And this box has a 3.3-V backup supply, but it has the same problem. [hardware repair] What's the model number of the server and the supply? [system admin] Cabinet controller says the server is MS-class box, built 081509 and in service 4322 hours. The supply is a Dragon 120-45-2160, built 092010 and in service 336 hours. [hardware repair] I don't see a flag on either model or date code in the repair database. Try bringing the blades up one at a time and watch the 3.3-V load current. [system admin] Blades 3, 7 and 9 are averaging 40.3 to 42.6 A; the others are only pulling 30.6 A, worst case. [hardware repair] Are all the blades in the cabinet the same model? [system admin] Yes. Wait, no! Blades 3, 7 and 9 have standard power processors in them. [hardware repair] That's the problem. The standard power processors draw over 120 W on the 3.3-V supply, but the new cabinets are only rated for the new low-power processors. The repair guys must have recycled a couple of older blades into the new cabinet during the last repair. [system admin] So, why didn't the cabinet controller complain? [hardware repair] Well, the blades are compatible with the cabinet; the management software just assumed that some of the blades would be in-place spares and would only be powered when one of the other's was shut down. [system admin] So I can run the standard power blades? [hardware repair] You can. You just can't run all of them at the same time. [system admin] Can I upgrade to a bigger supply? [hardware repair] Sure, but the better solution is to put the right blades in the cabinet. I'll put you on the repair list, and I'll also ping the repair guys about not using the older blades in the new cabinets.

In the above scenario, we see two people, probably in two different cities, debug a heating problem in a server cabinet, which is located in a third city. The cabinet could interrogate its power supplies for model and operational data, turn individual servers on/off, turn on/off in-place spare supplies, plus control/monitor fan speed.

Does this sound like science fiction? Well, it's not at all. All of these features and functions described, as well as others, currently exist in many of the new power-supply control/monitoring communications protocols. This means that designers building the next generation of server and telecom boxes are probably planning to include these capabilities in their designs.

Simply put, cost is driving this remote monitoring and control craze. Current surveys put the cost of downtime at $20,000 to $40,000 an hour for the average e-commerce company. So even a relatively quick on-call repairman's response time can run into tens of thousands of dollars. And you can bet that every server farm, cell-phone site, auction house and Internet provider out there will want the capability to remotely troubleshoot and repair all but the most drastic failures on-line and in real time.

What this means for us, as suppliers in the server market, is a relentless migration toward remote monitoring and control in power supplies, cooling, ac power and lighting, and physical security. Office buildings have already gone in this direction for energy efficiency, using site-management software to control lighting, heating/cooling, phones and elevators. So it takes no great leap to realize that high-tech capital equipment will follow suit. That leaves us with a simple choice: Heed the message and help define this brave new world, or wait until it gets here and play catch-up.

Keith Curtis is principal applications engineer with Microchip Technology's Security, Microcontroller and Technology Development division, where he is responsible for developing training and reference designs for incorporating microcontrollers into intelligent power-supply designs.


To join the conversation, and become an exclusive member of Electronic Design, create an account today!