IBM E850C Technical Overview And Introduction Download

Page: 125 / 160

Chapter 4. Reliability, availability, and serviceability

111

Two ports are combined into an ECC word and supply 128 bits of data. The ECC that is
deployed can correct the result of an entire DRAM module that is faulty. This is also known as

Chipkill

correction. Then, it can correct at least an additional bit within the ECC word.

The additional spare DRAM modules are used so that when a DIMM experiences a Chipkill
event within the DRAM modules under a port, the spare DRAM module can be substituted for
a failing module. This process avoids the need to replace the DIMM for a single Chipkill event.

Depending on how DRAM modules fail, it might be possible to tolerate up to four DRAM
modules failing on a single DIMM without needing to replace the DIMM, and then still correct
an additional DRAM module that is failing within the DIMM.

Other DIMMs are offered with these systems. A 32 GB DIMM has two ranks, where each rank
is similar to the 16 GB DIMM with DRAM modules on four ports, and each port has 10 x8
DRAM modules.

In addition, a 64 GB DIMM is offered through x4 DRAM modules that are organized in four
ranks.

In addition to the protection that is provided by the ECC and sparing capabilities, the memory
subsystem also implements scrubbing of memory to identify and correct single bit soft errors.
Hypervisors are informed of incidents of single-cell persistent (hard) faults for deallocation of
associated pages. However, because of the ECC and sparing capabilities that are used, such
memory page deallocation is not relied on for repair of faulty hardware.

If a more substantial fault persists after all the self-healing capabilities are used, the
hypervisor also can dynamically moving logical memory blocks from faulty memory to unused
memory blocks in other parts of the system. This feature can take advantage of memory that
is otherwise reserved for Capacity on Demand capabilities.

Finally, if an uncorrectable error in data is encountered, the memory that is affected is marked
with a special uncorrectable error code and handled as described for cache uncorrectable
errors.

4.3.11 I/O subsystem availability and Enhanced Error Handling

Use multi-path I/O and VIOS for I/O adapters and RAID for storage devices to prevent
application outages when I/O adapter faults occur.

To permit soft or intermittent faults to be recovered without failover to an alternative device or
I/O path, Power Systems hardware supports Enhanced Error Handling (EEH) for I/O adapters
and PCIe bus faults.

EEH allows EEH-aware device drivers to try again after certain non-fatal I/O events to avoid
failover, especially in cases where a soft error is encountered. EEH also allows device drivers
to terminate if an intermittent hard error or other unrecoverable errors occur, while protecting
against reliance on data that cannot be corrected. This action typically is done by “freezing”
access to the I/O subsystem with the fault. Freezing prevents data from flowing to and from an
I/O adapter and causes the hardware/firmware to respond with a defined error signature
whenever an attempt is made to access the device. If necessary, a special uncorrectable error
code can be used to mark a section of data as bad when the freeze is started.