Chapter 4. Reliability, availability, and serviceability
103
The processor design now supports a spare data lane on each fabric bus, which is used to
communicate between processor modules. A spare data lane can be substituted for a
failing one dynamically during system operation.
A POWER8 processor module has improved performance compared to ,
including support of a maximum of 12 cores compared to a maximum of eight cores in
. Doing more work with less hardware in a system provides greater reliability by
concentrating the processing power and reducing the need for extra communication
fabrics and components.
The memory controller within the processor is redesigned. From a RAS standpoint, the
ability to use a replay buffer to recover from soft errors is added.
I/O subsystem
The POWER8 processor now integrates PCIe controllers. PCIe slots that are directly
driven by PCIe controllers can be used to support I/O adapters directly in the systems, or
can be used to attach external I/O drawers.
For greater I/O capacity, the POWER8 processor-based Power E850C server also
supports a PCIe switch to provide more integrated I/O capacity.
Memory subsystem
Custom DIMMs (CDIMMS) are used. These provide the ability to correct a single dynamic
random access memory (DRAM) fault within an error-correction code (ECC) word (and
then an additional bit fault) to avoid unplanned outages. They also contain a spare DRAM
module per port (per nine DRAMs for x8 DIMMs), which can be used to avoid replacing
memory.
After all self-healing and other RAS-related features are implemented, the hypervisor
might still detect that a DIMM has a substantial fault that, when combined with a future
fault, could cause an outage. In such a case, the hypervisor attempts to migrate data from
the failing memory to other available memory in the system, if any is available. This feature
is intended to further reduce the chances of an unplanned outage, and can take
advantage of any deallocated memory, including memory reserved for Capacity on
Demand capabilities.
Power distribution and temperature monitoring
The processor module integrates a new On Chip Controller (OCC). This OCC is used to
handle Power Management and Thermal Monitoring without the need for a separate
controller, as was required in . In addition, the OCC can also be programmed to
run other RAS-related functions independent of any host processor.
4.1.2 RAS enhancements for enterprise servers
The following are RAS enhancements for enterprise servers:
Memory Subsystem
The Power E850C server has the option of mirroring the memory used by the hypervisor.
This process reduces the risk of system outage linked to memory faults, as the hypervisor
memory is stored in two distinct memory CDIMMs. The Active Memory Mirroring feature is
only available on enterprise systems.
Power Distribution and Temperature Monitoring
All systems make use of voltage converters that transform the voltage level provided by
the power supply to the voltage level needed for the various components within the
system. The Power E850C server has redundant or spare voltage converters for each
voltage level provided to any processor or memory CDIMM.
Summary of Contents for E850C
Page 2: ......
Page 36: ...22 IBM Power System E850C Technical Overview and Introduction...
Page 114: ...100 IBM Power System E850C Technical Overview and Introduction...
Page 154: ...140 IBM Power System E850C Technical Overview and Introduction...
Page 158: ...144 IBM Power System E850C Technical Overview and Introduction...
Page 159: ......
Page 160: ...ibm com redbooks Printed in U S A Back cover ISBN 0738455687 REDP 5412 00...