Optimizing Cache Usage
6
6-53
The baseline for performance comparison is the throughput (bytes/sec)
of 8-MByte region memory copy on a first-generation Pentium M
processor (CPUID signature 0x69n) with a 400-MHz system bus using
byte-sequential technique similar to that shown in Example 6-10. The
degree of improvement relative to the performance baseline for newer
IA-32 processors and platforms with higher system bus speed using
different coding techniques are compared.
The second coding technique moves data at 4-Byte granularity using
REP string instruction. The third column compares the performance of
the coding technique listed in Example 6-11. The fourth column of
performance compares the throughput of fetching 4-KBytes of data at a
time (using hardware prefetch to aggregate bus read transactions) and
writing to memory via 16-Byte streaming stores.
Increases in bus speed is the primary contributor to throughput
improvements. The technique shown in Example 6-12 will likely take
advantage of the faster bus speed in the platform more efficiently.
Additionally, increasing the block size to multiples of 4-KBytes while
keeping the total working set within the second-level cache can improve
the throughput slightly.
The relative performance figure shown in Table 6-2 is representative of
clean microarchitectual conditions within a processor (e.g. looping s
simple sequence of code many times). The net benefit of integrating a
specific memory copy routine into an application (full-featured
applications tend to create many complicated micro-architectural
conditions) will vary for each application.
Deterministic Cache Parameters
If CPUID support the function leaf with input EAX = 4, this is referred
to as the deterministic cache parameter leaf of CPUID (see CPUID
instruction in
IA-32 Intel® Architecture Software Developer’s Manual,
). Software can use the deterministic cache parameter leaf to
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...