vincecavasin.com / writing / articles / tbw archive / no-fault computing

No-fault computing

By Vince "TC" Cavasin

q Tired of crashes? Fault tolerant computers virtually never go down.

"Hello, this is 911; can you hold?"

It sounds like a bad Saturday Night Live skit, but when you consider the complexity of a modern 911 system, it’s amazing it doesn’t happen more often. Fault tolerant computers help to ensure this reliability.

As you’ve probably already guessed, in the context of ultra-mission critical systems like those used to track 911 calls, a "fault tolerant" computer is one that seldom--if ever--crashes. But levels of fault tolerance can vary widely.

In the context of software, a fault is typically caused by the software trying to do something and getting an unexpected result; in the context of hardware, it’s typically caused by a physical component failure.

Thus "fault tolerance" can be used to describe everything from Windows95’s "application has been shut down" messages to systems that are literally guaranteed to stay up 24 hours a day, 7 days a week.

In between these extremes are numerous fault tolerant hardware and software components that can often be added individually, like optional safety equipment on a car. One that you’re sure to see more of is Microsoft’s Wolfpack product. This software allows two or more NT servers to be networked together so that they share mass storage (disk). The servers check each other periodically, and if one ever discovers that the other has crashed, it can take over the nonfunctioning system’s work. Typically, users will experience a delay and may lose some data, but can usually be automatically migrated to the backup machine.

Such systems offer cost-effective solutions as semi-critical database servers, for example, for a web-based business. Customers ordering through the machine that crashes may lose their orders, but upon reestablishing the connection would be automatically routed to the backup machine.

The weak link in such systems is the shared storage, which presents a single point of failure; it doesn’t matter how many servers can access the disk if the disk crashes. Some failover systems, like NSI’s DoubleTake for Windows NT, get around this problem by mirroring (maintaining two copies of) the data on redundant disk drives. These systems protect against both computer and disk crashes.

A Redundant Array of Inexpensive Disks (RAID) system is a hardware component that can provide fault tolerance for any computer. RAID systems are prepackaged groups of disks that can provide fault tolerance by mirroring data within the package, or by "striping" the data. Striping parallelizes reading and writing operations across several disks, increasing the speed with which they can execute, but it also encodes parity data on an extra disk that allows the original data to be mathematically reconstructed if any of the disks fail.

"Hot swappable" components are a step up on the fault tolerance scale. These components are usually redundant, and are designed so that when they fail they can be replaced without shutting down the machine. However, they must be designed into a system and require software support. Many high-end parallel servers offer hot swappable CPU boards.

The ultimate fault tolerant computers are made by Stratus (an HP partner) and Tandem (a Compaq subsidiary). Such systems are basically two computers in one; every component is redundant, and two copies of a given application run simultaneously in perfect lockstep on two autonomous CPUs. If any component--CPU, disk, I/O port, power supply, power cord, anything--fails, the OS automatically shuts it down gracefully and notifies the machine’s administrator, while continuing to run with the backup component. The user never knows anything went wrong. Of course, such reliability doesn’t come cheap; an entry-level Stratus goes for over $60k, and a high-end Tandem can sell for several million dollars.

It’s no surprise, then, that these systems are only used in situations where a crash can cost millions of dollars--or a human life. For example, Automated Teller Machine networks and other banking applications as well as 911 databases typically use such systems.

While your desktop machine isn’t likely to be fault tolerant anytime in the foreseeable future, you’ll certainly encounter these machines in the business world. Hopefully not during a 911 call! u

Vince Cavasin, ’99, usually crashes when he encounters a fault.