Chapter 2. Architecture and technical overview
39
After a failure on one of the CDIMMs containing hypervisor data occurs, all the server
operations remain active and the flexible service processor (FSP) isolates the failing
CDIMMs. Systems stay in the partially mirrored state until the failing CDIMM is replaced.
Some components are not mirrored because they are not vital to the regular server
operations and require a larger amount of memory to accommodate their data:
Advanced Memory Sharing Pool
Memory used to hold the contents of platform dumps
With AMM, uncorrectable errors in data that is owned by a partition or application are handled
by the existing Special Uncorrectable Error handling methods in the hardware, firmware, and
operating system.
2.5.7 Memory Error Correction and Recovery
Many features within the Power E850C memory subsystem are designed to reduce the risk of
errors, or to minimize the impact of any errors that do occur. These features help ensure that
those errors do not affect critical enterprise data.
Each memory chip has error detection and correction circuitry built in. This circuitry is
designed so that the failure of any one specific memory module within an ECC word can be
corrected without any other fault.
In addition, a spare DRAM per rank on each memory CDIMM provides for dynamic DRAM
device replacement during runtime operation. Also, dynamic lane sparing on the memory link
allows for replacement of a faulty data lane without affecting performance or throughput.
Other memory protection features include retry capabilities for certain faults detected at both
the memory controller and the memory buffer.
Memory is also periodically scrubbed to allow for soft errors to be corrected and for solid
single-cell errors reported to the hypervisor. This process supports operating system
deallocation of a page that is associated with a hard single-cell fault.
For more information about Memory RAS, see 4.3.10, “Memory protection” on page 110.
2.5.8 Special Uncorrectable Error handling
Special Uncorrectable Error (SUE) handling prevents an uncorrectable error in memory or
cache from immediately causing the system to terminate. Rather, the system tags the data
and determines whether it will ever be used again. If the error is irrelevant, it does not force a
checkstop. If the data is used, termination can be limited to the program/kernel or hypervisor
that owns the data, or can freeze the I/O adapters controlled by an I/O hub controller if data is
to be transferred to an I/O device.
Partition data: Active Memory Mirroring will
not
mirror partition data. It was designed to
mirror only the hypervisor code and its components, allowing this data to be protected
against a CDIMM failure.
Summary of Contents for E850C
Page 2: ......
Page 36: ...22 IBM Power System E850C Technical Overview and Introduction...
Page 114: ...100 IBM Power System E850C Technical Overview and Introduction...
Page 154: ...140 IBM Power System E850C Technical Overview and Introduction...
Page 158: ...144 IBM Power System E850C Technical Overview and Introduction...
Page 159: ......
Page 160: ...ibm com redbooks Printed in U S A Back cover ISBN 0738455687 REDP 5412 00...