Intel® Server Board SE7520BD2 Technical Product Specification
Product Overview
Revision 1.3
Intel Confidential
17
Hardware additions for this feature include the implementation of a tracking register per DIMM
to maintain a history of error occurrence, and a programmable register to hold the fail-over error
threshold level. The operational model is straightforward: set the fail-over threshold register to a
non-zero value to enable the feature, and if the count of errors on any DIMM exceeds that
value, fail-over will commence. The tracking registers themselves are implemented as “leaky
buckets,” such that they do not contain an absolute cumulative count of all errors since power-
on; rather, they contain an aggregate count of the number of errors received over a running time
period. The “drip rate” of the bucket is selectable by DIMM’s software algorithm, so it is possible
to set the threshold to a value that will never be reached by a “healthy” memory subsystem
experiencing the rate of errors expected for the size and type of memory devices in use.
The fail-over mechanism is slightly more complex. Once fail-over has been initiated the MCH
must execute every write twice; once to the primary DIMM, and once to the spare. (This
requires that the spare DIMM be at least the size of the largest primary DIMM in use.) The MCH
will also begin tracking the progress of its built-in memory scrub engine. Once the scrub engine
has covered every location in the primary DIMM, the duplicate write function will have copied
every data location to the spare. At that point, the MCH can switch the spare into primary use,
and take the failing DIMM off-line.
Note that this entire mechanism requires no additional software support since it is programmed
in BIOS, until the threshold detection has been triggered to request a data copy. Hardware will
detect the threshold initiating fail-over, and escalate the occurrence of that event as directed
(signal an SMI, generate an interrupt, or wait to be discovered via polling). Whatever software
routine responds to the threshold detection must select a victim DIMM (in case multiple DIMMs
have crossed the threshold prior to sparing invocation) and initiate the memory copy. Hardware
will automatically isolate the “failed” DIMM once the copy has completed. The data copy is
accomplished by address aliasing within the DDR control interface, thus it does not require
reprogramming of the DRAM row boundary (DRB) registers, nor does it require notification to
the operating system that anything has occurred in memory.
The memory mirroring feature and DIMM sparing are exclusive of each other, only one may be
activated during initialization. The selected feature must remain enabled until the next power-
cycle. (There is no provision in hardware to switch from one feature to the other without
rebooting, nor is there a provision to “back out” of a feature once enabled without a full reboot.)
2.8.4.2 Memory
Mirroring
The memory mirroring feature is fundamentally a way for hardware to maintain two copies of all
data in the memory subsystem, such that a hardware failure or uncorrectable error is no longer
fatal to the system. When an uncorrectable error is encountered during normal operation,
hardware simply retrieves the “mirror” copy of the corrupted data, and no system failure will
occur unless both the primary and mirror copies of the same data are corrupt simultaneously
(statistically very unlikely).
Mirroring is supported on dual-channel DIMM populations symmetric across both channels and
within each channel. As seen in the following figure, potential mirroring pairs are DIMM 1A with
DIMM 2B, or DIMM 1B with DIMM 2A. As a result, on the Server Board SE7520BD2 there are
two supported configurations for memory mirroring:
•
Four-DIMM population of completely identical devices (two per channel). DIMMs
labeled 1A, 2A, 1B and 2B must be identical.