background image

6

RAS Design Philosophy

Realization of a mainframe-class continuous operation through the pursuit of    

 

reliability and availability in a single server construct

Mainframe-class RAS Features

Clustering

Dependable Server Technology

Continuous operations through failures 

Redundant components, error prediction and error 

correction allows for continuous operation

Minimized spread of failures

Technology to minimize the effects of hardware failures on

the system.  Reduction of performance degradation and 

multi-node shutdown

Smooth recovery after failures

Ability to replace failed components without

shutting down operations

Impr

oved system availability

Impr

oved r

eliability and availability as a stand alone server

Generally, in order to achieve reliability and availability on an 
open server, clustering would be implemented.  However, 
clustering comes with a price tag.  To keep costs at a minimum, 
the Express5800/1000 series servers were designed to 
achieve a high level of reliability and availability, but within a 
single server.

The Express5800/1000 series server’s powerful RAS features 
were developed through the pursuit of dependable server 
technology.

Continuous operations throughout failures; minimize the 
spread of failures; and smooth recovery after failures were 
goals set forth which lead to implementation of technologies 
such as memory mirroring, increased redundancy of intricate 
components, and modularization.  Through these technologies 
a mainframe level of continuous operation was achieved.

Mainflame

Level

Conventional

open server

Level

PC Server

Level

Reliability

Availability

Serviceability

Center

plane

Chipset

Clock

Core I/O

PCI card

Memory

CPU

L3 cache

Power

HDD

No chipset on the center plane

ECC protection of main

data paths Intricate error

detectionof the high-

speed interconnects

Partial chipset degradation/

Dynamic recovery

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Duplexed*

1

16 processor domain 

segmentation*

2

Core I/O Relief

ECC protection

SDDC Memory

Memory

Mirroring*

1

Intel

®

 Cache Safe

Technology*

3

N+1 Redundant

Two independent 

power sources

Software RAID

Hardware RAID

*1 Available only on the 1320Xf/1160Xf
*2 Available only on the 1320Xf
*3 Intel

®

 technology designed to avoid cache based failures

*4 Replacement of failed component without shutting down other partitions.

The Dual-Core Intel

®

 Itanium

®

 processor MCA  

(Machine Check Architecture)

The framework for hardware, firmware and OS error handling

The Dual-Core Intel

®

 Itanium

®

 processor, designed for high-end 

enterprise servers, not only excels in performance, but is also 
abundant in RAS features. At the core of the processor’s RAS 
feature set, is the error handling framework, called MCA.

MCA provides a 3 stage error handling mechanism – hardware, 
firmware, and operating system. In the first stage, the CPU and 
chipset attempt to handle errors through ECC (Error Correcting 
Code) and parity protection. If the error can not be handled by 
the hardware, it is then passed to the second stage, where the 
firmware attempts to resolve the issue.  In the third stage, if the 
error can not be handled by the first two stages, the operating 
system runs recovery procedures based on the error report 
and error log that was received. In the event of a critical error, 
the system will automatically reset, to significantly reduce the 
possibility of a system failure.

Application Layer

Operating System

The OS logs the error, and then starts the recovery process

Hardware

CPU and chipset ECC and parity protection 

The Firmware and OS aid in the correction of complex platform errors to restore the system
Error details are logged, and then a report flow is defined for the OS
Detects and corrects a wide range of hardware errors for main data structures 

Firmware

Seamlessly handles the error 

Summary of Contents for INTEL 5800/1000

Page 1: ...EC Express5800 1000 Technology Guide Vol 1 Powered by the Dual Core Intel Itanium Processor Reliability and Performance through the fusion of the NEC A3 chipset and the Dual Core Intel Itanium process...

Page 2: ...enterprises With the new Dual Core Intel Itanium processor 9000 series and the NEC designed third generation chipset A3 from chipset board to system level design NEC has never compromised to realize...

Page 3: ...etransmission of error data Two independent power sources Avoid system shutdown due to failures of the power distribution units Serviceability Autonomic reporting of logs with pinpoint prognosis of fa...

Page 4: ...llelization is achieved however it is not maximized nor efficient Parallel processing with EPIC architecture In the EPIC architecture parallelization is run at compile time allowing for maximum parall...

Page 5: ...se applications performance through reduced cache memory access latency Very Large Cache VLC Architecture Intel Itanium 2 processor Madison L3 9MB Latency Dual Core Intel Itanium processor Montvale L3...

Page 6: ...ts Partial chipset degradation Dynamic recovery Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Duplexed 1 16 processor domain segmentation 2 Core I O R...

Page 7: ...e may result in a multi partition shutdown To resolve this issue the Express5800 1000 series servers have been designed to allow for the partial degradation of chipsets Within each of the LSI chips wh...

Page 8: ...de that is linked directly to the failed crossbar will be temporarily shutdown The failed crossbar card can be replaced without halting other business operations Cell Cell Cell Cell Cell Cell Cell Cel...

Page 9: ...distribution mechanisms so that system downtime can be minimized The 1320Xf system allows for the division of the system into two 16 processor segments where one segment utilizes one system clock and...

Page 10: ...ected by the chipset in the event of an error The BID is able to diagnose the location of the error and will pinpoint the required FRU Field Replaceable Unit so that the time required to replace the c...

Page 11: ...iguration Small footprint and a highly scalable I O Along with the industry s prevalent Microsoft Windows operating system the Express5800 1000 series servers also support the Linux operating system B...

Page 12: ...tel logo Itanium and Itanium inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries Microsoft and Windows are registered trade...

Reviews: