154
IBM Power 750 and 760 Technical Overview and Introduction
4.2.3 Memory protection
A memory protection architecture that provides good error resilience for a relatively small L1
cache might be inadequate for protecting the much larger system main store. Therefore, a
variety of protection methods are used in all POWER processor-based systems to avoid
uncorrectable errors in memory.
Memory protection plans must take into account many factors, including the following items:
Size
Desired performance
Memory array manufacturing characteristics
processor-based systems have several protection schemes designed to prevent,
protect, or limit the effect of errors in main memory:
Chipkill
Chipkill is an enhancement that enables a system to sustain the failure of an entire
DRAM chip. An ECC word uses 18 DRAM chips from two DIMM pairs, and a failure on any
of the DRAM chips can be fully recovered by the ECC algorithm. The system can continue
indefinitely in this state with no performance degradation until the failed DIMM can
be replaced.
72-byte ECC
In , an ECC word consists of 72 bytes of data. Of these, 64 bytes are used to
hold application data. The remaining eight bytes are used to hold check bits and additional
information about the ECC word.
This innovative ECC algorithm from IBM research works on DIMM pairs on a rank basis.
(A
rank
is a group of nine DRAM chips.) With this ECC code, the system can dynamically
recover from an entire DRAM failure (by Chipkill) but can also correct an error even if
another
symbol
(a byte, accessed by a 2-bit line pair) experiences a fault (an improvement
from the double error detection or single error correction ECC implementation found on
the POWER6 processor-based systems).
Hardware scrubbing
Hardware scrubbing is a method used to deal with intermittent errors. IBM POWER
processor-based systems periodically address all memory locations. Any memory
locations with a correctable error are rewritten with the correct data.
Cyclic redundancy check (CRC)
The bus that is transferring data between the processor and the memory uses CRC error
detection with a failed operation-retry mechanism and the ability to dynamically retune the
bus parameters when a fault occurs. In addition, the memory bus has spare capacity to
substitute a data bit-line whenever it is determined to be faulty.
memory subsystem
The processor chip contains two memory controllers with four channels per
memory controller. Each channel connects to a single DIMM, but as the channels work in
pairs, a processor chip can address four DIMM pairs, two pairs per memory controller.
The bus transferring data between the processor and the memory uses CRC error detection
with a failed operation retry mechanism and the ability to dynamically retune bus parameters
when a fault occurs. In addition, the memory bus has spare capacity to substitute a spare
data bit-line for one that is determined to be faulty.