IBM E850C Technical Overview And Introduction Download

Page: 133 / 160

Chapter 4. Reliability, availability, and serviceability

119

With this Host Boot initialization, new progress codes are available. An example of an FSP
progress code is C1009003. During the Host Boot IPL, progress codes, such as CC009344,
appear.

If a failure occurs during the Host Boot process, a new Host Boot System Dump is collected
and stored. This type of memory dump includes Host Boot memory and is offloaded to the
HMC when it is available.

Run time

All Power Systems servers can monitor critical system components during run time, and they
can take corrective actions when recoverable faults occur. The hardware error-check
architecture can report non-critical errors in the system in an

out-of-band

communications

path to the service processor without affecting system performance.

A significant part of the runtime diagnostic capabilities originate with the service processor.
Extensive diagnostic and fault analysis routines were developed and improved over many
generations of POWER processor-based servers, and enable quick and accurate predefined
responses to both actual and potential system problems.

The service processor correlates and processes runtime error information by using logic that
is derived from IBM engineering expertise. This logic is used to count recoverable errors
(called

thresholding

) and predict when corrective actions must be automatically initiated by

the system. These actions can include the following items:

򐂰

Requests for a part to be replaced

򐂰

Dynamic invocation of built-in redundancy for automatic replacement of a failing part

򐂰

Dynamic deallocation of failing components so that system availability is maintained

Device drivers

In certain cases, diagnostic tests are best performed by operating system-specific drivers,
most notably adapters or I/O devices that are owned directly by a logical partition. In these
cases, the operating system device driver often works with I/O device microcode to isolate
and recover from problems. Potential problems are reported to an operating system device
driver that logs the error. In non-HMC managed servers, the OS can start the Call Home
application to report the service event to IBM. For HMC managed servers, the event is
reported to the HMC, which can initiate the Call Home request to IBM. I/O devices can also
include specific exercisers that can be started by the diagnostic facilities for problem
re-creation (if required by service procedures).

4.5.5 Reporting

In the unlikely event that a system hardware or environmentally induced failure is diagnosed,
IBM Power Systems servers report the error through various mechanisms. The analysis
result is stored in system NVRAM. Error log analysis (ELA) can be used to display the failure
cause and the physical location of the failing hardware.

Using the Call Home infrastructure, the system can automatically send an alert or call for
service during a critical system failure. A hardware fault also illuminates the amber system
fault LED, which is on the system unit, to alert the user of an internal hardware problem.

On POWER8 processor-based servers, hardware and software failures are recorded in the
system log. When a management console is attached, an ELA routine analyzes the error. The
routine then forwards the event to the Service Focal Point (SFP) application that runs on the
management console, and can notify the system administrator that it isolated a likely cause of
the system problem. The service processor event log also records unrecoverable checkstop