The 1ESS Emergency Action System

I’ve done two previous posts on the Bell Laboratories 1ESS telephone switch: one on the central processor architecture and another on the compiler and programming language developed for it. In this third and final (well, for now…) post on the 1ESS, I want to look at one of its most interesting components: the Emergency Action Sequencer.

AT&T and the Bell System, in what I’d call the “golden age of the telephone,” ran on a huge amount of pride in the reliability of the Bell System and the professionalism of the “telephone men” and operators (female, of course) that made it possible. Uptime was paramount; any interruption in telephone service was considered a very serious matter, and systems were carefully engineered for maximum durability. The 1ESS was no exception. It had two completely redundant processors in lockstep and a huge number of hardware and software checks for any malfunctions or inconsistencies. Generally, hardware failures in the switching fabric and other peripheral components were detected, the failing components were isolated and disabled without interrupting service, and the operator at the Central Control console was informed so that the faulty component could be replaced when convenient. As the BSTJ article on the 1ESS design notes, though, this system only works when the central processors are functioning and able to perform the relevant tests. What about when a fault occurs in the computer itself? This is when the Emergency Action Sequencer steps in.

The first secret to the 1ESS’s reliability design is, as mentioned, duplication. The entire computer system (central processor, program and data memory, and input/output) is present in duplicate, and during normal operation the two systems run in lockstep. A diagnostic circuit verifies that the output of the two duplicate systems is always identical, and any time that it is not, an interrupt is set which triggers a diagnostic program. This program runs both central controls though a set of “exercises” intended to test every function. If all exercises are successful, the error is assumed to be ephemeral and the system is set back to a good state and continues as usual. If there is an error, the component which failed is assumed to be faulty and is disabled until further servicing can be performed by the operators.

This arrangement is actually somewhat complicated by the fact that there is always a notion of an “active” and “standby” central control, even during normal lockstep operation. If a failure is found in the standby central control, the active central control can easily remove it from service by toggling a few flip-flops. If the failure is in the active central control, however, a somewhat more complicated process occurs in which the active central control disables itself and then triggers a high-level interrupt that essentially resets the machine, so that control will resume under the standby central control. In practice, this process is somewhat risky, which is one of the key reasons for the existence of the emergency action sequencer.

When a central control error is encountered, a 40ms timer is started. Normally, the diagnostic program completes and takes action within 40ms. If, however, a central control is faulty (particularly the active central control), it may not complete in this time period. In that case, the emergency action sequencer is triggered and begins intervening under the assumption that the active central control is nonfunctional.

The emergency action sequencer works in a somewhat curious way. Although two complete redundant central controls are present and they normally work as two units, flip-flops are provided that allow for switching in and out specific components of each central control when necessary. The goal of the emergency action sequencer is to identify some set of components of both central controls that, when combined, make up one fully-functional central control. In this way, even a situation where both central controls are malfunctioning can be recovered so long as they are not malfunctioning in the exact same subsystems.

This happens, as the name suggests, in a sequence. The emergency action sequencer keeps track of the number of times that it has been triggered, and this is the main input to its operation. Note that its operation is very simple, and this is for a reason: to serve its function, the emergency action sequencer must be completely independent and not rely on the program store or processors. For this reason, all of its logic must be implemented in hardware!

Each time the emergency action sequencer is triggered, it increments its counter and then, from the counter state, selects a new combination of components and performs a test. This is essentially a random-guessing approach, which sounds a bit kludgy, but keep in mind that this is an absolute last-resort system for recovering a machine that is assumed to be insane – indeed, as you often see in computers of the era and their descendants, this process is referred to as “sanity checking.”

Sanity is tested using a diagnostic program that is rather dramatically referred to as “the maze.” The maze program runs through a large number of operations with various control transfers intended to through the central control “of course” if there are any errors, and is designed to take only about 100 cycles to complete. This is important, since it may be executed many times. Every time the maze is started, a timer is set for 128 cycles. If 128 cycles pass without the maze reaching its final instruction (which resets the timer), the emergency action sequencer reshuffles components and tries again. This repeats until a configuration completes the maze. At this point, an “emergency action recovery” program initiates and performs a few more tests before reloading data to resume call processing.

So, let’s look at this in narrative form. Imagine that an operating 1ESS encounters an error in the standby call store (such as a failed checksum on some loaded data). This is detected and results in a C-level interrupt, which triggers the diagnostic program to determine if the error was ephemeral or if the call store is faulty. This also triggers a 40ms recovery timer in the emergency action sequencer. Let’s imagine that the processor itself of the active primary control turns out to be faulty. Perhaps these two faults are related, but it may simply be coincidental. Because of the fault in the active processor, the diagnostic program fails to complete, and after 40ms the emergency action timer hits zero, triggering the emergency action sequencer. The emergency action sequencer advances to whatever its next state is: depending on where it is in its state loop, it may swap out the active program store for the standby program store, for example. The emergency action sequencer then resets the execution pointers in the active central control (which now includes a component that used to be part of the standby central control!) and triggers a B-level interrupt, resetting the central control so that the maze program will begin, and sets a timer for 128 cycles (less than one millisecond).

Let’s imagine that the faulty processor was not swapped out, and so the central control is still not sane. A control transfer in the maze program goes awry due to some bad computation, as intended in this case, and after 128 cycles the maze has not been finished. The timer triggers the emergency action sequencer again, which swaps out yet another component and restarts the maze program. This time, we’ll imagine that the faulty processor has been swapped for the standby, and the maze program completes within 128 cycles. The end of the maze program disables the timer and begins the emergency action recover program, which will run a more complete test and then prepare the machine to handle phone calls once again.

This entire process should actually complete in a very short period of time, which is quite important, as during this time the machine will not handle incoming calls. The operator at the master control will also be kept appraised of the current situation, and at the end of the recovery process will receive a printout of maintenance orders (produced by the recovery program’s diagnostics) indicating that the components now in standby require more thorough testing. Hopefully, the operators isolate the problem and repair the hardware, allowing the machine to return to its normal lockstep operation.

And that’s just another day in the life of a 1ESS operator.

  • history/computers/emergencyaction.txt
  • Last modified: 2020/11/16 23:46
  • (external edit)