Optimizing Cache Usage
6
6-37
In scenario to the right, in Figure 6-7, keeping the data in one way of the
second-level cache does not improve cache locality. Therefore, use
prefetcht0
to prefetch the data. This amortizes the latency of the
memory references in passes 1 and 2, and keeps a copy of the data in
second-level cache, which reduces memory traffic and latencies for
passes 3 and 4. To further reduce the latency, it might be worth
considering extra
prefetchnta
instructions prior to the memory
references in passes 3 and 4.
In Example 6-7, consider the data access patterns of a 3D geometry
engine first without strip-mining and then incorporating strip-mining.
Note that 4-wide SIMD instructions of Pentium
III
processor can
process 4 vertices per every iteration.
Example 6-7
Data Access of a 3D Geometry Engine without Strip-mining
while (nvtx < MAX_NUM_VTX) {
prefetchnta vertex
i
data
// v =[x,y,z,nx,ny,nz,tu,tv]
prefetchnta vertex
i+1
data
prefetchnta vertex
i+2
data
prefetchnta vertex
i+3
data
TRANSFORMATION code
// use only x,y,z,tu,tv of a vertex
nvtx+=4
}
while (nvtx < MAX_NUM_VTX) {
prefetchnta vertex
i
data
// v =[x,y,z,nx,ny,nz,tu,tv]
// x,y,z fetched again
prefetchnta vertex
i+1
data
prefetchnta vertex
i+2
data
prefetchnta vertex
i+3
data
compute the light vectors
// use only x,y,z
LOCAL LIGHTING code
// use only nx,ny,nz
nvtx+=4
}
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...