IA-32 Intel® Architecture Optimization
6-28
Prefetch concatenation can bridge the execution pipeline bubbles
between the boundary of an inner loop and its associated outer loop.
Simply by unrolling the last iteration out of the inner loop and
specifying the effective prefetch address for data used in the following
iteration, the performance loss of memory de-pipelining can be
completely removed. Example 6-5 gives the rewritten code.
This code segment for data prefetching is improved and only the first
iteration of the outer loop suffers any memory access latency penalty,
assuming the computation time is larger than the memory latency.
Inserting a prefetch of the first data element needed prior to entering the
nested loop computation would eliminate or reduce the start-up penalty
for the very first iteration of the outer loop. This uncomplicated
high-level code optimization can improve memory performance
significantly.
Example 6-4
Using Prefetch Concatenation
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 32; jj+=8) {
prefetch a[ii][jj+8]
computation a[ii][jj]
}
}
Example 6-5
Concatenation and Unrolling the Last Iteration of Inner Loop
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 24; jj+=8) { /* N-1 iterations */
prefetch a[ii][jj+8]
computation a[ii][jj]
}
prefetch a[ii+1][0]
computation a[ii][jj]/* Last iteration */
}
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...