General Optimization Guidelines
2
2-71
Recommendation
: Use the compiler switch to generate SSE2 scalar
floating-point code over x87 code.
When working with scalar SSE/SSE2 code, pay attention to the need for
clearing the content of unused slots in an xmm register and the
associated performance impact. For example, loading data from
memory with
movss
or
movsd
causes an extra micro-op for zeroing
the upper part of the xmm register.
On Pentium M, Intel Core Solo and Intel Core Duo processors; this
penalty can be avoided by using
movlpd
. However, using
movlpd
causes performance penalty on Pentium 4 processors.
Another situation occurs when mixing single-precision and
double-precision code. On Pentium 4 processors, using
cvtss2sd
has
performance penalty relative to the alternative sequence:
xorps
xmm1, xmm1
movss
xmm1, xmm2
cvtps2pd
xmm1, xmm1
On Intel Core Solo and Intel Core Duo processors, using
cvtss2sd
is
more desirable over the alternative sequence.
Memory Operands
Double-precision floating-point operands that are eight-byte aligned
have better performance than operands that are not eight-byte aligned,
since they are less likely to incur penalties for cache and MOB splits.
Floating-point operation on a memory operands require that the operand
be loaded from memory. This incurs an additional µop, which can have
a minor negative impact on front end bandwidth. Additionally, memory
operands may cause a data cache miss, causing a penalty.
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...