Chapter 4. Continuous availability and manageability
153
Error checkers
IBM POWER6 process-based systems contain specialized hardware detection circuitry that
can detect erroneous hardware operations. Error checking hardware ranges from parity error
detection coupled with processor instruction retry and bus retry, to ECC correction on caches
and system buses. All IBM hardware error checkers have distinct attributes, as follows:
Continually monitoring system operations to detect potential calculation errors.
Attempting to isolate physical faults based on runtime detection of each unique failure.
Initiating a wide variety of recovery mechanisms designed to correct the problem. The
POWER6 process-based systems include extensive hardware and firmware recovery
logic.
Fault isolation registers
Error checker signals are captured and stored in hardware Fault Isolation Registers (FIRs).
The associated
who’s on first
logic circuitry is used to limit the domain of an error to the first
checker that encounters the error. In this way, runtime error diagnostics can be deterministic
so that for every check station, the unique error domain for that checker is defined and
documented. Ultimately, the error domain becomes the field replaceable unit (FRU) call, and
manual interpretation of the data is not normally required.
First failure data capture (FFDC)
First failure data capture (FFDC) is an error isolation technique, which ensures that when a
fault is detected in a system through error checkers or other types of detection methods, the
root cause of the fault gets captured without the need to recreate the problem or run an
extended tracing or diagnostics program.
For the vast majority of faults, a good FFDC design means that the root cause can be
detected automatically without intervention of a service representative. Pertinent error data
related to the fault is captured and saved for analysis. In hardware, FFDC data is collected
from the fault isolation registers and
who’s on first
logic. In firmware, this data consists of
return codes, function calls, and others.
FFDC
check stations
are carefully positioned within the server logic and data paths to ensure
that potential errors can be quickly identified and accurately tracked to an FRU.
This proactive diagnostic strategy is a significant improvement over the classic, less accurate
reboot and diagnose
service approaches.
Figure 4-7 on page 154 shows a schematic of a fault isolation register implementation.
Summary of Contents for Power 595
Page 2: ......
Page 120: ...108 IBM Power 595 Technical Overview and Introduction...
Page 182: ...170 IBM Power 595 Technical Overview and Introduction...
Page 186: ...174 IBM Power 595 Technical Overview and Introduction...
Page 187: ......