Chapter 4. Continuous availability and manageability
109
Draft Document for Review September 2, 2008 5:05 pm
4405ch04 Continuous availability and manageability.fm
4.3.3 Reporting problems
In the unlikely event of a system hardware or environmentally induced failure is diagnosed,
POWER6 processor-based systems report the error through a number of mechanisms. This
ensures that appropriate entities are aware that the system may be operating in an error
state. However, a crucial piece of a solid reporting strategy is ensuring that a single error
communicated through multiple error paths is correctly aggregated, so that later notifications
are not accidently duplicated.
Error logging and analysis
Once the root cause of an error has been identified by a fault isolation component, an error
log entry is created with some basic data such as:
An error code uniquely describing the error event
The location of the failing component
The part number of the component to be replaced, including pertinent data like
engineering and manufacturing levels
Return codes
Resource identifiers
First Failure Data Capture data
Data containing information on the effect that the repair will have on the system is also
included. Error log routines in the operating system can then use this information and decide
to call home to contact service and support, send a notification message, or continue without
an alert.
Remote support
The Remote Management and Control (RMC) application is delivered as part of the base
operating system, including the operating system running on the Hardware Management
Console. RMC provides a secure transport mechanism across the LAN interface between the
operating system and the Hardware Management Console and is used by the operating
system diagnostic application for transmitting error information. It performs a number of other
functions as well, but these are not used for the service infrastructure.
Manage serviceable events
A critical requirement in a logically partitioned environment is to ensure that errors are not lost
before being reported for service, and that an error should only be reported once, regardless
of how many logical partitions experience the potential effect of the error. The
Manage
Serviceable Events
task on the Hardware Management Console (HMC) is responsible for
aggregating duplicate error reports, and ensures that all errors are recorded for review and
management.
When a local or globally reported service request is made to the operating system, the
operating system diagnostic subsystem uses the Remote Management and Control
Subsystem (RMC) to relay error information to the Hardware Management Console. For
global events (platform unrecoverable errors, for example) the Service Processor will also
forward error notification of these events to the Hardware Management Console, providing a
redundant error-reporting path in case of errors in the RMC network.
The first occurrence of each failure type will be recorded in the
Manage Serviceable Events
task on the Hardware Management Console. This task will then filter and maintain a history of
duplicate reports from other logical partitions or the Service Processor. It then looks across all
active service event requests, analyzes the failure to ascertain the root cause and, if enabled,