IBM E850C Technical Overview And Introduction Download

Page: 122 / 160

108

IBM Power System E850C: Technical Overview and Introduction

Successful migration requires that enough spare capacity is available to reduce the overall
processing capacity within the system by one processor core. Typically, in highly virtualized
environments, the requirements of partitions can be reduced to accomplish this task without
any further impact to running applications.

In systems without sufficient reserve capacity, it might be necessary to terminate at least one
partition to free resources for the migration. In advance, PowerVM users can identify which
partitions have the highest priority and which do not. When you use this Partition Priority
feature of PowerVM, if a partition must be terminated for alternative processor recovery to
complete, the system can terminate lower priority partitions to keep the higher priority
partitions up and running. This prioritization can even occur when an unrecoverable error
occurred on a core running the highest priority workload.

Partition Availability Priority is assigned to partitions by using a weight value or integer rating.
The lowest priority partition is rated at zero and the highest priority partition is rated at 255.
The default value is set to 127 for standard partitions and 192 for Virtual I/O Server (VIOS)
partitions. Priorities can be modified through the Hardware Management Console (HMC).

4.3.6 Core Contained Checkstops and other PowerVM error recovery

PowerVM can handle certain other hardware faults without terminating applications, such as
an error in certain data structures (faults in translation tables or lookaside buffers).

Other core hardware faults that alternative processor recovery or Processor Instruction Retry
cannot contain might be handled in PowerVM by a technique called

Core Contained

Checkstops

. This technique allows PowerVM to be signaled when such faults occur and

terminate code in use by the failing processor core (typically just a partition, although
potentially PowerVM itself if the failing instruction were in a critical area of PowerVM code).

Processor designs without Processor Instruction Retry typically must resort to such
techniques for all faults that can be contained to an instruction in a processor core.

4.3.7 Cache uncorrectable error handling

If a fault within a cache occurs that cannot be corrected with SEC/DED ECC, the faulty cache
element is unconfigured from the system. Typically, this process is done by purging and
deleting a single cache line. Such purge and delete operations are contained within the
hardware itself, and prevent a faulty cache line from being reused and causing multiple errors.

During the cache purge operation, the data that is stored in the cache line is corrected where
possible. If correction is not possible, the associated cache line is marked with a special ECC
code that indicates that the cache line itself has bad data.

Nothing within the system terminates just because such an event is encountered. Instead, the
hardware monitors the usage of pages with marks. If such data is never used, hardware
replacement is requested, but nothing terminates as a result of the operation. Software layers
are not required to handle such faults.

Only when data is loaded to be processed by a processor core, or sent out to an I/O adapter,
is any further action needed. In such cases, if the data is from a logical partition's host OS,
the partition operating system might be responsible for terminating itself or just the program
using the marked page. If data is owned by the hypervisor, the hypervisor might choose to
terminate, resulting in a system-wide outage.