IBM E850C Technical Overview And Introduction Download

Page: 121 / 160

Chapter 4. Reliability, availability, and serviceability

107

software layers. The software layers must then be responsible for determining how to
minimize the impact of faults.

The advanced RAS features that are built in to POWER8 processor-based systems handle
certain “uncorrectable” errors in ways that minimize the impact of the faults. These features
can even keep an entire system up and running after experiencing such a failure.

Depending on the fault, such recovery might use the virtualization capabilities of PowerVM in
such a way that the operating system or any applications that are running in the system are
not affected or required to participate in the recovery.

4.3.3 Processor Core/Cache correctable error handling

Layer 2 (L2) and Layer 3 (L3) caches and directories can correct single bit errors and detect
double bit errors (SEC/DED ECC). Soft errors that are detected in the level 1 caches are also
correctable by a try again operation that is handled by the hardware. Internal and external
processor “fabric” busses have SEC/DED ECC protection as well.

SEC/DED capabilities are also included in other data arrays that are not directly visible to
customers.

Beyond soft error correction, the intent of the POWER8 design is to manage a solid
correctable error in an L2 or L3 cache by using techniques to delete a cache line with a
persistent issue, or to repair a column of an L3 cache dynamically by using spare capability.

Information about column and row repair operations is stored persistently for processors. This
process allows more permanent repairs to be made during processor reinitialization (during
system reboot, or individual Core Power on Reset using the Power On Reset Engine).

4.3.4 Processor Instruction Retry and other try again techniques

Within the processor core, soft error events might occur that interfere with the various
computation units. When such an event is detected before a failing instruction is completed,
the processor hardware might be able to try the operation again by using the advanced RAS
feature that is known as

Processor Instruction Retry

Processor Instruction Retry allows the system to recover from soft faults that otherwise result
in an outage of applications or the entire server.

Try again techniques are used in other parts of the system as well. Faults that are detected on
the memory bus that connects processor memory controllers to DIMMs can be tried again. In
POWER8 systems, the memory controller is designed with a replay buffer that allows memory
transactions to be tried again after certain faults internal to the memory controller faults are
detected. This process complements the try again abilities of the memory buffer module.

4.3.5 Alternative processor recovery and Partition Availability Priority

If Processor Instruction Retry for a fault within a core occurs multiple times without success,
the fault is considered to be a solid failure. In some instances, PowerVM can work with the
processor hardware to migrate a workload running on the failing processor to a spare or
alternative processor. This migration is accomplished by migrating the pertinent processor
core state from one core to another with the new core taking over at the instruction that failed
on the faulty core. Successful migration keeps the application running during the migration
without needing to terminate the failing application.