IA-32 Intel® Architecture Optimization
4-36
Let us now consider a case with a series of small loads after a large store
to the same area of memory (beginning at memory address
mem
) as
shown in Example 4-26. Most of the small loads will stall because they
are not aligned with the store; see “Store Forwarding” in Chapter 2 for
more details.
The word loads must wait for the quadword store to write to memory
before they can access the data they require. This stall can also occur
with other data types (for example, when doublewords or words are
stored and then words or bytes are read from the same area of memory).
When you change the code sequence as shown in Example 4-27, the
processor can access the data without delay.
Example 4-26
A Series of Small Loads after a Large Store
movq
mem, mm0
; store qword to address “mem"
:
:
mov
bx, mem + 2
; load word at “mem + 2" stalls
mov
cx, mem + 4
; load word at “mem + 4" stalls
Example 4-27
Eliminating Delay for a Series of Small Loads after a Large Store
movq
mem, mm0
; store qword to address “mem"
:
:
movq
mm1, mem
; load qword at address “mem"
movd
eax, mm1
; transfer “mem + 2" to eax from
; MMX register, not memory
psrlq
mm1, 32
shr
eax, 16
movd
ebx, mm1
; transfer “mem + 4" to bx from
; MMX register, not memory
and
ebx, 0ffffh
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...