IA-32 Intel® Architecture Optimization
4-38
SSE3 provides an instruction LDDQU for loading from memory
address that are not 16 byte aligned. LDDQU is a special 128-bit
unaligned load designed to avoid cache line splits. If the address of the
load is aligned on a 16-byte boundary, LDQQU loads the 16 bytes
requested. If the address of the load is not aligned on a 16-byte
boundary, LDDQU loads a 32-byte block starting at the 16-byte aligned
address immediately below the address of the load request. It then
provides the requested 16 bytes. If the address is aligned on a 16-byte
boundary, the effective number of memory requests is implementation
dependent (one, or more).
LDDQU is designed for programming usage of loading data from
memory without storing modified data back to the same address. Thus,
the usage of LDDQU should be restricted to situations where no
store-to-load forwarding is expected. For situations where store-to-load
forwarding is expected, use regular store/load pairs (either aligned or
unaligned based on the alignment of the data accessed).
Example 4-29 Video Processing Using LDDQU to Avoid Cache Line Splits
// Average half-pels horizonally (on // the “x” axis),
// from one reference frame only.
nextLinesLoop:
lddqu xmm0, XMMWORD PTR [edx] // may not be 16B aligned
lddqu xmm0, XMMWORD PTR [edx+1]
lddqu xmm1, XMMWORD PTR [edx+eax]
lddqu xmm1, XMMWORD PTR [edx+eax+1]
pavgbxmm0, xmm1
pavgbxmm2, xmm3
movdqaXMMWORD PTR [ecx], xmm0 //results stored elsewhere
movdqaXMMWORD PTR [ecx+eax], xmm2
// (repeat ...)
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...