IBM Power 595 Technical Overview And Introduction Download

Page: 146 / 188

134

IBM Power 595 Technical Overview and Introduction

actions include incidents that effect system availability and incidents that are concurrently
repaired.

Unscheduled incident repair action (UIRA) is a hardware event that causes a system or
partition to be rebooted in full or degraded mode. The system or partition will experience
an unscheduled outage. The restart can include some level of capability degradation, but
remaining resources are made available for productive work.

High impact outage (HIO) is a hardware failure that triggers a system crash, which is not
recoverable by immediate reboot. This is usually caused by failure of a component that is
critical to system operation and is, in some sense, a measure of system single
points-of-failure. HIOs result in the most significant availability impact on the system,
because repairs cannot be effected without a service call.

A consistent, architecture-driven focus on system RAS (using the techniques described in this
document and deploying appropriate configurations for availability), has led to almost
complete elimination of HIOs in currently available POWER processor servers.

The clear design goal for Power Systems is to prevent hardware faults from causing an
outage: platform or partition. Part selection for reliability, redundancy, recovery and
self-healing techniques, and degraded operational modes are used in a coherent, methodical
strategy to avoid HIOs and UIRAs.

4.2 Availability

IBMs extensive system of FFDC error checkers also supports a strategy of Predictive Failure
Analysis®, which is the ability to track intermittent correctable errors and to vary components
off-line before they reach the point of hard failure causing a crash.

The FFDC methodology supports IBMs autonomic computing initiative. The primary RAS
design goal of any POWER processor-based server is to prevent unexpected application loss
due to unscheduled server hardware outages. To accomplish this goal, a system must have a
quality design that includes critical attributes for:

Self-diagnosing and self-correcting during run time

Automatically reconfiguring to mitigate potential problems from suspect hardware

Self-healing or automatically substituting good components for failing components

4.2.1 Detecting and deallocating failing components

Runtime correctable or recoverable errors are monitored to determine if there is a pattern of
errors. If these components reach a predefined error limit, the service processor initiates an
action to deconfigure the faulty hardware, helping to avoid a potential system outage and to
enhance system availability.

Persistent deallocation

To enhance system availability, a component that is identified for deallocation or
deconfiguration on a POWER6 processor-based system is flagged for persistent deallocation.
Component removal can occur either dynamically (while the system is running) or at
boot-time (initial program load (IPL)), depending both on the type of fault and when the fault is
detected.

In addition, runtime unrecoverable hardware faults can be deconfigured from the system after
the first occurrence. The system can be rebooted immediately after failure and resume