IBM Power 595 Technical Overview And Introduction Download

Page: 168 / 188

156

IBM Power 595 Technical Overview and Introduction

Error logging and analysis

After the root cause of an error has been identified by a fault isolation component, an error log
entry is created and that includes basic data such as:

An error code uniquely describing the error event

The location of the failing component

The part number of the component to be replaced, including pertinent data like
engineering and manufacturing levels

Return codes

Resource identifiers

FFDC data

Data containing information about the effecte repair can have on the system is also included.
Error log routines in the operating system can tthis information and decide to call home to
contact service and support, send a notification message, or continue without an alert.

Remote support

The Remote Management and Control (RMC) application is delivered as part of the base
operating system, including the operating system running on the HMC. The RMC provides a
secure transport mechanism across the LAN interface between the operating system and the
HMC and is used by the operating system diagnostic application for transmitting error
information. It performs a number of other functions as well, but these are not used for the
service infrastructure.

Manage serviceable events

A critical requirement in a logically partitioned environment is to ensure that errors are not lost
before being reported for service, and that an error should only be reported once, regardless
of how many logical partitions experience the potential effect of the error. The

Manage

Serviceable Events

task on the HMC is responsible for aggregating duplicate error reports,

and ensuring that all errors are recorded for review and management.

When a local or globally-reported service request is made to the operating system, the
operating system diagnostic subsystem uses the RMC subsystem to relay error information to
the HMC. For global events (platform unrecoverable errors, for example) the service
processor will also forward error notification of these events to the HMC, providing a
redundant error-reporting path in case of errors in the RMC network.

The first occurrence of each failure type will be recorded in the

Manage Serviceable Events

task on the HMC. This task then filters and maintains a history of duplicate reports from other
logical partitions or the service processor. It then looks across all active service event
requests, analyzes the failure to ascertain the root cause, and, if enabled, initiates a call home
for service. This method ensures that all platform errors will be reported through at least one
functional path, ultimately resulting in a single notification for a single problem.

Extended error data (EED)

Extended error data (EED) is additional data collected either automatically at the time of a
failure or manually at a later time. The data collected depends on the invocation method but
includes information like firmware levels, operating system levels, additional fault isolation
register values, recoverable error threshold register values, system status, and any other
pertinent data.