210
January, 2004
Developer’s Manual
Intel XScale® Core
Developer’s Manual
Optimization Guide
A.5.1.1.
Scheduling Load and Store Double (LDRD/STRD)
The Intel XScale
®
core introduces two new double word instructions: LDRD and STRD. LDRD
loads 64-bits of data from an effective address into two consecutive registers, conversely, STRD
stores 64-bits from two consecutive registers to an effective address. There are two important
restrictions on how these instructions may be used:
•
the effective address must be aligned on an 8-byte boundary
•
the specified register must be even (r0, r2, etc.).
If this situation occurs, using LDRD/STRD instead of LDM/STM to do the same thing is more
efficient because LDRD/STRD issues in only one/two clock cycle(s), as opposed to LDM/STM
which issues in four clock cycles. Avoid LDRDs targeting R12; this incurs an extra cycle of issue
latency.
The LDRD instruction has a result latency of 3 or 4 cycles depending on the destination register
being accessed (assuming the data being loaded is in the data cache).
add r6, r7, r8
sub r5, r6, r9
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
orr r8, r1, #0xf
mul r7, r0, r7
In the code example above, the ORR instruction would stall for 3 cycles because of the 4 cycle
result latency for the second destination register of an LDRD instruction. The code shown above
can be rearranged to remove the pipeline stalls:
; The following ldrd instruction would load values
; into registers r0 and r1
ldrd r0, [r3]
add r6, r7, r8
sub r5, r6, r9
mul r7, r0, r7
orr r8, r1, #0xf
Any memory operation following a LDRD instruction (LDR, LDRD, STR and so on) would stall
for 1 cycle.
; The str instruction below would stall for 1 cycle
ldrd r0, [r3]
str r4, [r5]