Chapter 3. Reliability, availability, and serviceability
51
L1 and L2 caches and L2 and L3 directories on the POWER5 chip are manufactured with
spare bits in their arrays that can be accessed via programmable steering logic to replace
faulty bits in the respective arrays. This is analogous to the redundant bit-steering
employed in main storage as a mechanism that is designed to help avoid physical repair,
and is also implemented in POWER5 systems. The steering logic is activated during
processor initialization and is initiated by the built-in self-test (BIST) at power-on time.
L3 cache redundancy is implemented at the cache line level. Exceeding correctable error
thresholds while running causes a dynamic L3 cache line delete function to be invoked.
3.1.3 Service processor
The service processor included in the OpenPower 710 server is designed for an immediate
means to diagnose, check status, and sense operational conditions of a remote system, even
when the main processor is inoperable.
The service processor enables firmware and operating system surveillance, several
remote power controls, environmental monitoring (only critical errors are supported under
Linux), reset, boot features, remote maintenance, and diagnostic activities, including
console mirroring.
The service processor can place calls to report surveillance failures, critical environmental
faults, and critical processing faults.
For more detailed information on the service processor refer to 2.9.5, “Service processor” on
page 44.
3.1.4 Fault monitoring functions
The following are a few of the fault monitoring systems included with an OpenPower 710
server.
BIST and power-on self-test (POST) check the processor, L3 cache, memory, and
associated hardware required for proper booting of the operating system every time the
system is powered on. If a noncritical error is detected or if the errors occur in the
resources that can be removed from the system configuration, the booting process is
designed to proceed to completion. The errors are logged in the system nonvolatile RAM
(NVRAM).
Disk drive fault tracking can alert the system administrator of an impending disk failure
before it impacts client operation.
The Linux log (where hardware and software failures are recorded and analyzed by the
Error Log Analysis (ELA) routine) warns the system administrator about the causes of
system problems. This also enables service representatives to bring along probable
replacement hardware components when a service call is placed, thus minimizing system
repair time.
3.1.5 Mutual surveillance
The service processor monitors the operation of the POWER Hypervisor firmware during the
boot process and watches for loss of control during system operation. It also allows the
POWER Hypervisor to monitor service processor activity.
The service processor can take appropriate action, including calling for service, when it
detects the POWER Hypervisor firmware has lost control. Likewise, the POWER Hypervisor
can request a service processor repair action if necessary.
Summary of Contents for 9123710 - eServer OpenPower 710
Page 2: ......
Page 68: ...58 IBM eServer OpenPower 710 Technical Overview and Introduction ...
Page 69: ......