IBM Power 595 Technical Overview And Introduction Download

Page: 153 / 188

Chapter 4. Continuous availability and manageability

141

As the manager of system memory, at boot time the POWER Hypervisor decides which
memory to make available for server use and which to put in the unlicensed or spare pool,
based upon system performance and availability considerations.

– If the service processor identifies faulty memory in a server that includes CoD memory,

the POWER Hypervisor attempts to replace the faulty memory with available CoD
memory. As faulty resources on POWER6 or POWER5 process-based offerings are
automatically

demoted

to the system's unlicensed resource pool, working resources

are included in the active memory space.

– On POWER5 midrange systems (p5-570, i5-570), only memory associated with the

first card failure is spared to available CoD memory. If simultaneous failures occur on
multiple memory cards, only the first memory failure found is spared.

– Because these activities reduce the amount of CoD memory available for future use,

repair of the faulty memory should be scheduled as soon as is convenient.

Upon reboot, if not enough memory is available; the POWER Hypervisor will reduce the
capacity of one or more partitions. The HMC receives notification of the failed component,
triggering a service call.

4.2.2 Special uncorrectable error handling

Although it is rare, an uncorrectable data error can occur in memory or a cache. POWER6
processor systems attempt to limit, to the least possible disruption, the impact of an
uncorrectable error using a well defined strategy that first considers the data source.
Sometimes, an uncorrectable error is temporary in nature and occurs in data that can be
recovered from another repository. For example:

Data in the instruction L1 cache is never modified within the cache itself. Therefore, an
uncorrectable error discovered in the cache is treated like an ordinary cache miss, and
correct data is loaded from the L2 cache.

The POWER6 processor's L3 cache can hold an unmodified copy of data in a portion of
main memory. In this case, an uncorrectable error in the L3 cache would simply trigger a

reload

of a cache line from main memory. This capability is also available in the L2 cache.

In cases where the data cannot be recovered from another source, a technique called special
uncorrectable error (SUE) handling is used to determine whether the corruption is truly a
threat to the system. If, as is sometimes the case, the data is never actually used but is simply
overwritten, then the error condition can safely be voided and the system will continue to
operate normally.

When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the

standard

ECC is no longer valid. The

service processor is then notified, and takes appropriate actions. When running AIX V5.2 or
greater or Linux

and a process attempts to use the data, the operating system is informed of

the error and terminates only the specific user program.

Critical data is dependant on the system type and the firmware level. For example, on
POWER6 process-based servers, the POWER Hypervisor will in most cases, tolerate
partition data uncorrectable errors without causing system termination. It is only in the case
where the corrupt data is used by the POWER Hypervisor that the entire system must be
rebooted, thereby preserving overall system integrity.

Depending on system configuration and source of the data, errors encountered during I/O
operations might not result in a machine check. Instead, the incorrect data is handled by the

SLES 10 SP1 or later, and in RHEL 4.5 or later (including RHEL 5.1)