IBM E850C Technical Overview And Introduction Download

Page: 120 / 160

106

IBM Power System E850C: Technical Overview and Introduction

4.3 Availability

The more reliable a system or subsystem is, the more available it should be. Nevertheless, a
lot of effort is made to design systems that can detect faults that do occur and take steps to
minimize or eliminate the outages that are associated with them. These design capabilities
extend availability beyond what can be obtained through the underlying reliability of the
hardware.

This design for availability begins with implementing an architecture for ED/FI.

First-Failure data capture (FFDC) is the capability of IBM hardware and microcode to
continuously monitor hardware functions. Within the processor and memory subsystem,
detailed monitoring is done by circuits within the hardware components themselves. Fault
information is gathered into fault isolation registers (FIRs) and reported to the appropriate
components for handling.

Processor and memory errors that are recoverable in nature are typically reported to the
dedicated service processor built into each system. The dedicated service processor then
works with the hardware to determine the course of action to be taken for each fault.

4.3.1 Correctable error introduction

Intermittent or soft errors are typically tolerated within the hardware design by using error
correction code or advanced techniques to try operations again after a fault.

Tolerating a correctable solid fault runs the risk that the fault aligns with a soft error and
causes an uncorrectable error situation. A correctable error might also be predictive of a fault
that continues to worsen over time, resulting in an uncorrectable error condition.

You can predictively deallocate a component to prevent correctable errors from aligning with
soft errors or other hardware faults and causing uncorrectable errors. However, unconfiguring
components, such as processor cores or entire caches in memory, can reduce the
performance or capacity of a system. This process in turn typically requires that the failing
hardware is replaced in the system. The resulting service action can also temporarily affect
system availability.

To avoid such situations in solid faults in POWER8, processors or memory might be
candidates for correction by using the “self-healing” features built into the hardware. It , for ,
take advantage of a spare DRAM module within a memory DIMM, a spare data lane on a
processor or memory bus, or spare capacity within a cache module.

When such self-healing is successful, the need to replace any hardware for a solid
correctable fault is avoided. The ability to predictively unconfigure a processor core is still
available for faults that cannot be repaired by self-healing techniques or because the sparing
or self-healing capacity is exhausted.

4.3.2 Uncorrectable error introduction

An uncorrectable error can be defined as a fault that can cause incorrect instruction execution
within logic functions, or an uncorrectable error in data that is stored in caches, registers, or
other data structures. In less sophisticated designs, a detected uncorrectable error nearly
always results in the termination of an entire system. More advanced system designs in some
cases might be able to terminate just the application that uses the hardware that failed. Such
designs might require that uncorrectable errors are detected by the hardware and reported to