IA-32 Intel® Architecture Optimization
1-24
Thus, software optimization of a data access pattern should emphasize
tuning for hardware prefetch first to favor greater proportions of
smaller-stride data accesses in the workload; before attempting to
provide hints to the processor by employing software prefetch
instructions.
Loads and Stores
The Pentium 4 processor employs the following techniques to speed up
the execution of memory operations:
•
speculative execution of loads
•
reordering of loads with respect to loads and stores
•
multiple outstanding misses
•
buffering of writes
•
forwarding of data from stores to dependent loads
Performance may be enhanced by not exceeding the memory issue
bandwidth and buffer resources provided by the processor. Up to one
load and one store may be issued for each cycle from a memory port
reservation station. In order to be dispatched to a reservation station,
there must be a buffer entry available for each memory operation. There
are 48 load buffers and 24 store buffers
3
. These buffers hold the µop and
address information until the operation is completed, retired, and
deallocated.
The Pentium 4 processor is designed to enable the execution of memory
operations out of order with respect to other instructions and with
respect to each other. Loads can be carried out speculatively, that is,
before all preceding branches are resolved. However, speculative loads
cannot cause page faults.
3.
Pentium 4 processors with CPUID model encoding equal to 3 have more than 24 store
buffers.
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...