144
IBM Power 595 Technical Overview and Introduction
GX+ bus adapters
The GX+ bus provides the primary high bandwidth path for RIO-2 or GX 12x Dual Channel
adapter connection to the system CEC. Errors in a GX+ bus adapter, flagged by system
persistent deallocation
logic, cause the adapter to be varied offline upon a server reboot.
PCI error recovery
IBM estimates that PCI adapters can account for a significant portion of the hardware based
errors on a large server. Although servers that rely on boot time diagnostics can identify
failing components to be replaced by hot-swap and reconfiguration, run time errors pose a
more significant problem.
PCI adapters are generally complex designs involving extensive onboard instruction
processing, often on embedded microcontrollers. They tend to use industry standard grade
components with an emphasis on product cost relative to high reliability. In some cases, they
might be more likely to encounter internal microcode errors, and many of the hardware errors
described for the rest of the server.
The traditional means of handling these problems is through adapter internal error reporting
and recovery techniques in combination with operating system device driver management
and diagnostics. In some cases, an error in the adapter might cause transmission of bad data
on the PCI bus itself, resulting in a hardware detected parity error and causing a global
machine check interrupt, eventually requiring a system reboot to continue.
In 2001, IBM introduced a methodology that uses a combination of system firmware and
Enhanced Error Handling (EEH) device drivers that allows recovery from intermittent PCI bus
errors. This approach works by recovering and resetting the adapter, thereby initiating system
recovery for a permanent PCI bus error. Rather than failing immediately, the faulty device is
frozen and restarted, preventing a machine check. POWER6 and POWER5 processor
servers extend the capabilities of the EEH methodology. Generally, all PCI adapters
controlled by operating system device drivers are connected to a PCI secondary bus created
through a PCI-to-PCI bridge, designed by IBM. This bridge isolates the PCI adapters and
supports
hot-plug
by allowing program control of the
power state
of the I/O slot. PCI bus
errors related to individual PCI adapters under partition control can be transformed into a PCI
slot freeze condition and reported to the EEH device driver for error handling. Errors that
occur on the interface between the PCI-to-PCI bridge chip and the Processor Host Bridge
(the link between the processor remote I/O bus and the primary PCI bus) result in a
bridge
freeze
condition, effectively stopping all of the PCI adapters attached to the bridge chip. An
operating system can recover an adapter from a bridge freeze condition by using POWER
Hypervisor functions to remove the bridge from freeze state and resetting or reinitializing the
adapters. This same EEH technology will allow system recovery of PCIe bus errors in
POWER6 processor-based servers.
Figure 4-6 on page 145 illustrates PCI error recovery in GX bus connections.
Summary of Contents for Power 595
Page 2: ......
Page 120: ...108 IBM Power 595 Technical Overview and Introduction...
Page 182: ...170 IBM Power 595 Technical Overview and Introduction...
Page 186: ...174 IBM Power 595 Technical Overview and Introduction...
Page 187: ......