background image

TMS DSP

Little-Endian DSP Library

Programmer’s Reference

Literature Number: SPRUEB8

February 2006

Summary of Contents for TMS320C64x DSP

Page 1: ...TMS320C64x DSP Little Endian DSP Library Programmer s Reference Literature Number SPRUEB8 February 2006 ...

Page 2: ...tute a license from TI to use such products or services or a warranty or endorsement thereof Use of such information may require a license from a third party under the patents or other intellectual property of the third party or a license from TI under the patents or other intellectual property of TI Reproduction of information in TI data books or data sheets is permissible only if reproduction is...

Page 3: ...ers above and its read write properties below A legend explains the notation used for the properties J Reserved bits in a register figure designate a bit that is used for future device expansion Related Documentation From Texas Instruments The following books describe the C6000 devices and related support tools Copies of these documents are available on the Internet at www ti com Tip Enter the lit...

Page 4: ...as Instruments TMS320C64x digital signal processor DSP to the TMS320C64x DSP The objective of this document is to indicate differences between the two cores Functionality in the devices that is identical is not included Trademarks C6000 TMS320C64x TMS320C64x C64x are trademarks of Texas Instruments ...

Page 5: ... 2 2 4 DSPLIB Testing Allowable Error 2 4 2 2 5 DSPLIB Overflow and Scaling Issues 2 4 2 2 6 Interrupt Behavior of DSPLIB Functions 2 5 2 3 How to Rebuild DSPLIB 2 5 3 DSPLIB Function Tables 3 1 Provides tables containing all DSPLIB functions a brief description of each and a page refer ence for more detailed information 3 1 Arguments and Conventions Used 3 2 3 2 DSPLIB Functions 3 3 3 3 DSPLIB Fu...

Page 6: ... A 1 Performance Considerations A 2 A 2 Fractional Q Formats A 3 A 2 1 Q3 12 Format A 3 A 2 2 Q 15 Format A 3 A 2 3 Q 31 Format A 4 B Software Updates and Customer Support B 1 Provides information about warranty issues software updates and customer support B 1 DSPLIB Software Updates B 2 B 2 DSPLIB Customer Support B 2 C Glossary C 1 Defines terms and abbreviations used in this book ...

Page 7: ...ation 3 4 3 4 FFT 3 4 3 5 Filtering and Convolution 3 5 3 6 Math 3 6 3 7 Matrix 3 6 3 8 Miscellaneous 3 7 3 9 Obsolete Functions 3 7 3 10 Functions Optimized in the C64x DSPLIB 3 8 A 1 Q3 12 Bit Fields A 3 A 2 Q 15 Bit Fields A 3 A 3 Q 31 Low Memory Location Bit Fields A 4 A 4 Q 31 High Memory Location Bit Fields A 4 ...

Page 8: ...vi ...

Page 9: ...oduction to the TI C64x DSP Libraries DSPLIB shows the organization of the routines contained in the library and lists the features and benefits of the DSPLIB Topic Page 1 1 Introduction to the TI C64x DSPLIB 1 2 1 2 Features and Benefits 1 4 Chapter 1 ...

Page 10: ... use DSP functions TI DSPLIB can significantly shorten your DSP application development time The TI DSPLIB includes commonly used DSP routines Source code is provided that allows you to modify functions to match your specific needs The routines contained in the library are organized into the following seven different functional categories Adaptive filtering J DSP_firlms2 Correlation J DSP_autocor ...

Page 11: ...SP_fir_r8 J DSP_fir_r8_hM16_rM8A8X8 J DSP_fir_sym J DSP_iir Math J DSP_dotp_sqr J DSP_dotprod J DSP_maxval J DSP_maxidx J DSP_minval J DSP_mul32 J DSP_neg32 J DSP_recip16 J DSP_vecsumsq J DSP_w_vec Matrix J DSP_mat_mul J DSP_mat_trans Miscellaneous J DSP_bexp J DSP_blk_eswap16 J DSP_blk_eswap32 J DSP_blk_eswap64 J DSP_blk_move J DSP_fltoq15 J DSP_minerror J DSP_q15tofl ...

Page 12: ...fits Hand coded assembly optimized routines C and linear assembly source code C callable routines fully compatible with the TI C6x compiler Fractional Q 15 format operands supported on some benchmarks Benchmarks time and code Tested against C model ...

Page 13: ...talling and Using DSPLIB This chapter provides information on how to install and rebuild the TI C64x DSPLIB Topic Page 2 1 How to Install DSPLIB 2 2 2 2 Using DSPLIB 2 3 2 3 How to Rebuild DSPLIB 2 5 Chapter 2 ...

Page 14: ... files lib library and source archives support fft twiddle generation functions Please install the contents of the lib directory in the default directory indicated by your C_DIR environment If you choose not to install the contents in the default directory update the C_DIR environment variable for example by adding the following line in autoexec bat file SET C_DIR install_dir lib install_dir inclu...

Page 15: ...931348623157e 308 Unless specifically noted DSPLIB operates on Q 15 fractional data type elements Appendix A presents an overview of Fractional Q formats 2 2 1 2 DSPLIB Arguments TI DSPLIB functions typically operate over vector operands for greater efficiency Even though these routines can be used to process short arrays or even scalars unless a minimum size requirement is noted they will be slow...

Page 16: ... can expect identical results between Reference C implementation and its Assembly implementation when using test routines that focus on fixed point type results The test routines that deal with floating points typically allow an error margin of 0 000001 when comparing the results of reference C code and DSPLIB assembly code 2 2 5 DSPLIB Overflow and Scaling Issues The DSPLIB functions implement th...

Page 17: ...iods of time with small windows of interruptibility Examples include a function with a nested loop where the inner loop is non interruptible and the outer loop permits interrupts between executions of the inner loop Non interruptible These functions disable interrupts for nearly their entire duration Interrupts may happen for a short time during the setup and exit sequence Note that all three func...

Page 18: ...2 6 ...

Page 19: ...all DSPLIB functions a brief description of each and a page reference for more detailed information Topic Page 3 1 Arguments and Conventions Used 3 2 3 2 DSPLIB Functions 3 3 3 3 DSPLIB Function Tables 3 4 3 4 Differences Between the C64x and C64x DSPLIBs 3 8 Chapter 3 ...

Page 20: ... such as higher multiply throughput While these new functions perform better they can also lead to problems if not carefully used For example DSP_autocor_rA8 is faster than DSP_autocor but the output buffer must be aligned to an 8 byte boundary Therefore the new functions are named with any additional restrictions Three types of restrictions are specified to a pointer minimum buffer size M buffer ...

Page 21: ...B Functions The routines included in the DSP library are organized into eight functional categories and listed below in alphabetical order Adaptive filtering Correlation FFT Filtering and convolution Math Matrix functions Miscellaneous Obsolete functions ...

Page 22: ...place Forward FFT mixed radix with digit reversal Input Output data in Im Re order 4 11 void DSP_fft16x16r int nx short x short w unsigned char brev short y int radix int offset int n_max Cache optimized mixed radix FFT with scaling and rounding digit reversal out of place Input and output 16 bits Twiddle factor 16 bits 4 14 void DSP_fft16x32 short w int nx int x int y Extended precision mixed rad...

Page 23: ... Twiddle factor 32 bits 4 36 Table 3 5 Filtering and Convolution Functions Description Page void DSP_fir_cplx short x short h short r int nh int nx Complex FIR Filter nh is a multiple of 2 4 38 void DSP_fir_cplx_hM4X4 short x short h short r int nh int nx Complex FIR Filter nh is a multiple of 4 4 38 void DSP_fir_gen short x short h short r int nh int nr FIR Filter any nh 4 42 void DSP_fir_gen_hM1...

Page 24: ...e of a Vector 4 62 int DSP_maxidx short x int nx Index of the Maximum Element of a Vector 4 63 short DSP_minval short x int nx Minimum Value of a Vector 4 65 void DSP_mul32 int x int y int r short nx 32 bit Vector Multiply 4 66 void DSP_neg32 int x int r short nx 32 bit Vector Negate 4 68 void DSP_recip16 short x short rfrac short rexp short nx 16 bit Reciprocal 4 69 int DSP_vecsumsq short x int n...

Page 25: ... of Memory 4 84 void DSP_fltoq15 float x short r short nx Float to Q15 Conversion 4 85 int DSP_minerror short GSP0_TABLE short errCoefs int savePtr_ret Minimum Energy Error Search 4 87 void DSP_q15tofl short x float r short nx Q15 to Float Conversion 4 89 Table 3 9 Obsolete Functions Functions Description Page void DSP_bitrev_cplx int x short index int nx Use DSP_fft16x16 instead 4 88 void DSP_rad...

Page 26: ...itten to take advantage of the new C64x instructions and of the SPLOOP feature Table 3 10 Functions Optimized in the C64x DSPLIB Function C64x Optimized Optimization Type DSP_firlms2 No DSP_autocor No DSP_autocor_rA8 Yes Kernel re design SPLOOP Optimization resulted in new requirements New name is used DSP_fft16x16 Yes New Function Optimized C64x DSP_fft16x16_imre Yes New Function Optimized C64x D...

Page 27: ...o DSP_fir_gen_hM17_rA8X8 Yes Kernel re design SPLOOP Optimization resulted in new requirements New name is used DSP_fir_r4 No DSP_fir_r8 No DSP_fir_r8_hM16_rM8A8X8 Yes Kernel re design SPLOOP Optimization resulted in new requirements New name is used DSP_fir_sym No DSP_iir No DSP_iirlat No DSP_dotp_sqr No DSP_dotprod Yes SPLOOP conversion DSP_maxval No DSP_maxidx No DSP_minval No DSP_mul32 No DSP_...

Page 28: ...x Optimized DSP_blk_eswap16 No DSP_blk_eswap32 No DSP_blk_move Yes SPLOOP conversion DSP_fltoq15 No DSP_minerror No DSP_q15tofl No DSP_bitrev_cplx No Obsolete DSP_radix2 No Obsolete DSP_r4fft No Obsolete DSP_fft No Obsolete DSP_fft16x16t No Obsolete Any functions which were not optimized for the C64x have the same performance as on the C64x ...

Page 29: ...unctions within each category are listed in alphabetical order and include arguments descriptions algorithms benchmarks and special requirements Topic Page 4 1 Adaptive Filtering 4 2 4 2 Correlation 4 4 4 3 FFT 4 8 4 4 Filtering and Convolution 4 38 4 5 Math 4 58 4 6 Matirx 4 73 4 7 Miscellaneous 4 76 4 8 Obsolete Functions 4 90 Chapter 4 ...

Page 30: ...by adding the weighted error times the inputs to the original coefficients The input array includes the last nh inputs followed by a new single sample input The coefficient array includes nh coefficients Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply long DSP_firlms2 short h short x short b int n...

Page 31: ...DSPLIB Reference Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible The loop is unrolled 4 times Benchmarks Cycles 3 nh 4 17 Codesize 148 bytes ...

Page 32: ...e accepts an input array of length nx nr and performs nr autocorrelations each of length nx producing nr output results This is typically used in VSELP code Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply void DSP_autocor short r short x int nx int nr int i k sum for i 0 i nr i sum 0 for k nr k nx...

Page 33: ...The code is interrupt tolerant but not interruptible The inner loop is unrolled 8 times The outer loop is unrolled 4 times The outer loop is conditionally executed in parallel with the inner loop This allows for a zero overhead outer loop Benchmarks Cycles nx 40 6 nr nr 4 20 nx 40 nx nr 8 2 nr 20 Codesize 304 bytes ...

Page 34: ...nx producing nr output results This is typically used in VSELP code Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply void DSP_autocor short r short x int nx int nr int i k sum for i 0 i nr i sum 0 for k nr k nx nr k sum x k x k i r i sum 15 Special Requirements nx must be a multiple of 8 nr must be...

Page 35: ...DSP_autocor_rA8 4 7 C64x DSPLIB Reference Benchmarks Cycles nx 40 6 nr 20 nx 40 nx nr 8 2 nr 20 Codesize 304 bytes ...

Page 36: ...real and imaginary parts The code uses a special ordering of FFT coefficients also called twiddle factors and memory accesses to improve performance in the presence of cache Algorithm All stages are radix 4 except the last one which can be radix 2 or radix 4 depending on the size of the FFT All stages except the last one scale by two the stage output data Special Requirements In place computation ...

Page 37: ...ts can be made based on above observations 1 Inner loop i0 iterates a variable number of times In particular the number of iterations quadruples every time from 1 N 4 Hence software pipelining a loop that iterates a variable number of times is not profitable 2 Outer loop j iterates a variable number of times as well However the number of iterations is quartered every time from N 4 1 Hence the beha...

Page 38: ...4 where N is the number of complex points to be transformed The routine that generates the modified twiddle factor array was presented earlier With the above transformation of the FFT both the input data and the twiddle factor array can be accessed using double word wide loads to enable packed data processing The final stage is optimized to remove the multiplication as w0 1 This stage also perform...

Page 39: ...The code uses a special ordering of FFT coefficients also called twiddle factors and memory accesses to improve performance in the presence of cache Algorithm All stages are radix 4 except the last one which can be radix 2 or radix 4 depending on the size of the FFT All stages except the last one scale by two the stage output data Special Requirements In place computation is not allowed The size o...

Page 40: ...iterates a variable number of times In particular the number of iterations quadruples every time from 1 N 4 Hence software pipelining a loop that iterates a variable number of times is not profitable 2 Outer loop j iterates a variable number of times as well However the number of iterations is quartered every time from N 4 1 Hence the behavior in a and b are exactly opposite to each other 3 If the...

Page 41: ...f size 3N 4 where N is the number of complex points to be transformed The routine that generates the modified twiddle factor array was presented earlier With the above transformation of the FFT both the input data and the twiddle factor array can be accessed using double word wide loads to enable packed data processing The final stage is optimized to remove the multiplication as w0 1 This stage al...

Page 42: ...d for decomposing FFT into sub FFTs See notes offset Index in complex samples of sub FFT from start of main FFT nmax Size of main FFT in complex samples Description This routine implements a cache optimized complex forward mixed radix FFT with scaling rounding and digit reversal Input data x output data y and coefficients w are 16 bit The output is returned in the separate array y in normal order ...

Page 43: ...arg fy_0 fx_0 co fx_1 si fy_1 fx_1 co fx_0 si y 2 k short 2 fy_0 sqrt n y 2 k 1 short 2 fy_1 sqrt n Scaling by 2 i e 1 takes place at each radix 4 stage except the last one A radix 4 stage could give a maximum bit growth of 2 bits which would require scaling by 4 To completely prevent overflow the input data must be scaled by 2 BT BS where BT total number of bit growth log2 nx and BS number of sca...

Page 44: ... the size These continue down to the end when the butterfly is of size 4 They use an index to the main twiddle factor array of 0 75 2 N This is because the twiddle factor array is composed of successively decimated versions of the main array N not equal to a power of 4 can be used i e 512 In this case the following would be needed to decompose the FFT DSP_fft16x16r 512 x 0 w 0 y brev 2 0 512 is eq...

Page 45: ... are required to perform N 4 size FFTs In fact the ordering of these 4 FFTs amongst themselves does not matter and thus from a cache perspective it helps to go through the remaining 4 FFTs in exactly the opposite order to the first This is illustrated as follows DSP_fft16x16r N x 0 w 0 brev y N 4 0 N DSP_fft16x16r N 4 x 2 3 N 4 w 2 3 N 4 brev y rad 3 N 4 N DSP_fft16x16r N 4 x 2 N 2 w 2 3 N 4 brev ...

Page 46: ...x w short ptr_x0 ptr_x2 y0 unsigned int j k j0 j1 k0 k1 short x0 x1 x2 x3 x4 x5 x6 x7 short xh0_0 xh1_0 xh0_1 xh1_1 short xl0_0 xl1_0 xl0_1 xl1_1 short yt3 yt4 yt5 yt6 yt7 unsigned a num stride n n is the number of complex samples tw_offset 0 while stride radix j 0 fft_jmp stride stride 1 h2 stride 1 l1 stride l2 stride stride 1 x ptr_x w ptr_w tw_offset for i 0 i n 1 i 2 co1 w j 0 si1 w j 1 co2 w...

Page 47: ...2p1 x_l2p1 xl20 x_h2 x_l2 xl21 x_h2p1 x_l2p1 ptr_x0 x ptr_x0 0 short xh0 xh20 1 ptr_x0 1 short xh1 xh21 1 ptr_x2 ptr_x0 x 2 predj j fft_jmp if predj x fft_jmp if predj j 0 xt0 xh0 xh20 yt0 xh1 xh21 xt1 xl0 xl21 yt2 xl1 xl20 xt2 xl0 xl21 yt1 xl1 xl20 l1p1 l1 1 h2p1 h2 1 l2p1 l2 1 ptr_x2 l1 xt1 co1 yt1 si1 0x00008000 16 ptr_x2 l1p1 yt1 co1 xt1 si1 0x00008000 16 ptr_x2 h2 xt0 co2 yt0 si2 0x00008000 1...

Page 48: ...e j offset 2 ptr_x0 ptr_x y0 y determine _norm nmax 17 l0 31 if nmax 31 1 1 num nmax else num nmax if num l0 32 else a num 0xFFFF0000 if a l0 16 num a a num 0xFF00FF00 if a l0 8 num a a num 0xF0F0F0F0 if a l0 4 num a a num 0xCCCCCCCC if a l0 2 num a a num 0xAAAAAAAA if a l0 1 l0 1 l0 17 if radix 2 radix 4 for i 0 i n i 4 reversal computation j0 j 0x3F j1 j 6 0x3F k0 brev j0 k1 brev j1 ...

Page 49: ...x0 1 x2 ptr_x0 2 x3 ptr_x0 3 x4 ptr_x0 4 x5 ptr_x0 5 x6 ptr_x0 6 x7 ptr_x0 7 ptr_x0 8 xh0_0 x0 x4 xh1_0 x1 x5 xh0_1 x2 x6 xh1_1 x3 x7 if radix 2 xh0_0 x0 xh1_0 x1 xh0_1 x2 xh1_1 x3 yt0 xh0_0 xh0_1 yt1 xh1_0 xh1_1 yt4 xh0_0 xh0_1 yt5 xh1_0 xh1_1 xl0_0 x0 x4 xl1_0 x1 x5 xl0_1 x2 x6 xl1_1 x3 x7 if radix 2 xl0_0 x4 xl1_0 x5 ...

Page 50: ... Complex input data x twiddle factors w and output array y must be double word aligned Real values are stored in even word imaginary in odd All data are in short precision or Q 15 format Allowed input dynamic range is 16 log2 nx ceil log4 nx 1 Output results are returned in normal order The FFT coefficients twiddle factors are generated using the program tw_fft16x16 provided in the directory suppo...

Page 51: ... generated above produces the FFT This collapses the inner 2 loops in the traditional Burrus and Parks implementation The revised FFT uses a redundant sequence of twiddle factors to allow a linear access through the data This linear access enables data and instruction level parallelism The butterfly is bit reversed i e the inner 2 points of the butterfly are crossed over This makes the data come o...

Page 52: ...me as the one used for the DSP_fft16x16r routine Algorithm The C equivalent of the assembly code without restrictions is similar to the one shown for the DSP_fft16x16t routine For further details see the source code of the C version of this function which is provided with this library Note that the assembly code is hand optimized and restrictions may apply Special Requirements In place computation...

Page 53: ...nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage is also a radix 4 transform otherwise it is a radix 2 transform See the fft16x16t implementation notes as similar ideas are used Benchmarks Cycles 10 25 nx 8 10 ceil log4 nx 1 6 nx 4 81 Codesize 1056 bytes ...

Page 54: ...6r routine except that the factors are maintained at 32 bit precision Algorithm The C equivalent of the assembly code without restrictions is similar to the one shown for the DSP_fft16x16t routine For further details see the source code of the C version of this function which is provided with this library Note that the assembly code is hand optimized and restrictions may apply Special Requirements...

Page 55: ...4 nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage is also a radix 4 transform otherwise it is a radix 2 transform See the fft16x16t implementation notes as similar ideas are used Benchmarks Cycles 12 nx 8 12 ceil log4 nx 1 6 nx 4 79 Codesize 1056 bytes ...

Page 56: ...actors is the same one used for the DSP_fft32x32 routine Scaling by 2 i e 1 takes place at each radix 4 stage except for the last one A radix 4 stage can add a maximum of 2 bits which would require scaling by 4 to completely prevent overflow Thus the input data must be scaled by 2log2 nx ceil log4 nx 1 Algorithm The C equivalent of the assembly code without restrictions is similar to the one shown...

Page 57: ...conflicts occur Interruptibility The code is interruptible Scaling is performed at each stage by shifting the results right by 1 preventing overflow The routine uses log4 nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage is also a radix 4 transform otherwise it is a radix 2 transform See ...

Page 58: ...onjugating again This allows fft16x16 to perform the IFFT as well However if the double conjugation needs to be avoided then this routine uses the same twiddle factors as the FFT and performs an IFFT The change in the sign of the twiddle factors is adjusted for in the routine Hence this routine uses the same twiddle factors as the fft16x16 routine Algorithm The C equivalent of the assembly code wi...

Page 59: ...og4 nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage is also a radix 4 transform otherwise it is a radix 2 transform See the fft16x16 implementation notes as similar ideas are used Benchmarks Cycles 6 nx 8 19 ceil log4 nx 1 8 nx 8 30 Codesize 864 bytes ...

Page 60: ...njugating again This allows fft16x16_imre to perform the IFFT as well However if the double conjugation needs to be avoided then this routine uses the same twiddle factors as the FFT and performs an IFFT The change in the sign of the twiddle factors is adjusted for in the routine Hence this routine uses the same twiddle factors as the fft16x16_imre routine Algorithm The C equivalent of the assembl...

Page 61: ... log4 nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage is also a radix 4 transform otherwise it is a radix 2 transform See the fft16x16 implementation notes as similar ideas are used Benchmarks Cycles 6 nx 8 19 ceil log4 nx 1 8 nx 8 30 Codesize 864 bytes ...

Page 62: ...FT and conjugating again This allows fft16x32 to perform the IFFT as well However if the double conjugation needs to be avoided then this routine uses the same twiddle factors as the FFT and performs an IFFT The change in the sign of the twiddle factors is adjusted for in the routine Hence this routine uses the same twiddle factors as the fft16x32 routine Algorithm The C equivalent of the assembly...

Page 63: ... overflow Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interruptible The routine uses log4 nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage is also a radix 4 transform otherwise it is a radix 2 transform See the fft16x16t implementation notes a...

Page 64: ... conjugating again This allows fft32x32 to perform the IFFT as well However if the double conjugation needs to be avoided then this routine uses the same twiddle factors as the FFT and performs an IFFT The change in the sign of the twiddle factors is adjusted for in the routine Hence this routine uses the same twiddle factors as the fft32x32 routine Algorithm The C equivalent of the assembly code ...

Page 65: ...vent overflow Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interruptible The routine uses log4 nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage is also a radix 4 transform otherwise it is a radix 2 transform See the fft16x16t implementation not...

Page 66: ... filter has nr output samples and nh coefficients Each array consists of an even and odd term with even terms representing the real part and the odd terms the imaginary part of the element The pointer to input array x must point to the nh th complex sample i e element 2 nh 1 upon entry to the function The coefficients are expected in normal order Algorithm This is the C equivalent of the assembly ...

Page 67: ...cts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible The outer loop is unrolled 4 times while the inner loop is not unrolled Both inner and outer loops are collapsed in one loop ADDAH and SUBAH are used along with PACKH2 to perform accumulation shift and data packing Collapsed one stage of epilog and prolog each Benchmarks Cycles nr nh 2 7 Codesize 448 ...

Page 68: ...er has nr output samples and nh coefficients Each array consists of an even and odd term with even terms representing the real part and the odd terms the imaginary part of the element The pointer to input array x must point to the nh th complex sample i e element 2 nh 1 upon entry to the function The coefficients are expected in normal order Algorithm This is the C equivalent of the assembly code ...

Page 69: ...tation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is fully interruptible The outer loop is unrolled 4 times while the inner loop is not unrolled Both inner and outer loops are collapsed in one loop ADDAH and SUBAH are used along with PACKH2 to perform accumulation shift and data packing Collapsed one stage of epilog and prolog each Benchmarks Cycles nr nh 9 16 40 Codesi...

Page 70: ...o calculate Must be a multiple of 4 Description Computes a real FIR filter direct form using coefficients stored in vector h The real data input is stored in vector x The filter output result is stored in vector r It operates on 16 bit data with a 32 bit accumulate The filter calculates nr output samples using nh coefficients Algorithm This is the C equivalent of the assembly code without restrict...

Page 71: ... bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Load double word instruction is used to simultaneously load four values in a single clock cycle The inner loop is unrolled four times and will always compute a multiple of 4 of nh and nr If nh is not a multiple of 4 the code will fill in zeros to make nh a multiple of 4 This code yields best performance whe...

Page 72: ...r Number of samples to calculate Must be a multiple of 8 Description Computes a real FIR filter direct form using coefficients stored in vector h The real data input is stored in vector x The filter output result is stored in vector r It operates on 16 bit data with a 32 bit accumulate The filter calculates nr output samples using nh coefficients Algorithm This is the C equivalent of the assembly ...

Page 73: ...nk Conflicts No bank conflicts occur Interruptibility The code is fully interruptible Load double word instruction is used to simultaneously load four values in a single clock cycle The inner loop is unrolled four times and will always compute a multiple of 4 of nh and nr If nh is not a multiple of 4 the code will fill in zeros to make nh a multiple of 4 This code yields best performance when the ...

Page 74: ...r Number of samples to calculate Must be multiple of 4 Description Computes a real FIR filter direct form using coefficients stored in vector h The real data input is stored in vector x The filter output result is stored in vector r This FIR operates on 16 bit data with a 32 bit accumulate The filter calculates nr output samples using nh coefficients Algorithm This is the C equivalent of the assem...

Page 75: ... be a multiple of 4 and greater than or equal to 4 Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible The load double word instruction is used to simultaneously load four values in a single clock cycle The inner loop is unrolled four times and will always compute a multiple of 4 output samples Benchmarks Cycles 8 nh nr ...

Page 76: ...vector h The real data input is stored in vector x The filter output result is stored in vector r This FIR operates on 16 bit data with a 32 bit accumulate The filter calculates nr output samples using nh coefficients Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply void DSP_fir_r8 short x short h ...

Page 77: ...oad double word instruction is used to simultaneously load four values in a single clock cycle The inner loop is unrolled 4 times and will always compute a multiple of 4 output samples The outer loop is conditionally executed in parallel with the inner loop This allows for a zero overhead outer loop Benchmarks Cycles nh nr 4 17 Codesize 336 bytes ...

Page 78: ...of 8 16 nr Number of samples to calculate Must be multiple of 8 8 Description Computes a real FIR filter direct form using coefficients stored in vector h The real data input is stored in vector x The filter output result is stored in vector r This FIR operates on 16 bit data with a 32 bit accumulate The filter calculates nr output samples using nh coefficients Algorithm This is the C equivalent o...

Page 79: ...ble word aligned Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interruptible The load double word instruction is used to simultaneously load four values in a single clock cycle The inner loop is unrolled 4 times and will always compute a multiple of 4 output samples The outer loop is conditionally executed in parallel with the inner loop This allows for a...

Page 80: ...or Q 15 input data and coefficients Description This function applies a symmetric filter to the input samples The filter tap array h provides nh 1 total filter taps The filter tap at h nh forms the center point of the filter The taps at h nh 1 through h 0 form a symmetric filter about this central tap The effective filter length is thus 2 nh 1 taps The filter is performed on 16 bit data with 16 bi...

Page 81: ... half nh 1 are required nr must be a multiple of 4 x and h must be double word aligned r must be word aligned Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interruptible The load double word instruction is used to simultaneously load four values in a single clock cycle The inner loop is unrolled eight times Benchmarks Cycles 10 nh 8 15 nr 4 26 Codesize 66...

Page 82: ...ssive filter coefficients h1 0 is not used nr Number of output samples Must be 8 Description The IIR performs an auto regressive moving average ARMA filter with 4 auto regressive filter coefficients and 5 moving average filter coefficients for nr output samples Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restriction...

Page 83: ...onflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Output array r1 contains nr 4 locations r2 contains nr locations for storing nr output samples The output samples are stored with an offset of 4 into the r1 array The inner loop that iterated through the filter coefficients is completely unrolled Benchmarks Cycles 4 nr 21 Codesize 276 bytes ...

Page 84: ...ilter consists of nk lattice stages Each stage requires one reflection coefficient k and one delay element b The routine takes an input vector x and returns the filter output in r Prior to the first call of the routine the delay elements in b should be set to zero The input data may have to be pre scaled to avoid overflow or achieve better SNR The reflections coefficients lie in the range 1 0 k 1 ...

Page 85: ... Conflicts for avoiding bank conflicts Implementation Notes Bank Conflicts nk should be a multiple of 2 otherwise bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Prolog and epilog of the inner loop are partially collapsed and overlapped to reduce outer loop overhead Benchmarks Cycles 2 nk 7 nx 9 without bank conflicts Codesize 352 bytes ...

Page 86: ... 12 return int New value of G Description This routine performs an nx element dot product of x and y and stores it in r It also squares each element of y and accumulates it in G G is passed back to the calling function in register A4 This computation of G is used in the VSELP coder Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand opti...

Page 87: ...cial Requirements nx must be a multiple of 4 and greater than or equal to 12 Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Benchmarks Cycles nx 2 21 Codesize 128 ...

Page 88: ...tors and calculates their dot product The inputs are 16 bit short data and the output is a 32 bit number Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply int DSP_dotprod short x short y int nx int sum int i sum 0 for i 0 i nx i sum x i y i return sum Special Requirements The input length must be a ...

Page 89: ... y are offset by 4 half words 8 bytes Interruptibility The code is fully interruptible The code is unrolled 4 times to enable full memory and multiplier bandwidth to be utilized Interrupts are masked by branch delay slots only Prolog collapsing has been performed to reduce codesize Benchmarks Cycles nx 4 14 Codesize 64 bytes ...

Page 90: ...ector and returns that value Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply short DSP_maxval short x int nx int i max max 32768 for i 0 i nx i if x i max max x i return max Special Requirements nx is a multiple of 8 and greater than or equal to 32 Implementation Notes Bank Conflicts No bank confl...

Page 91: ...re interleaved throughout the array If values in different columns are equal to the maximum value then the element in the leftmost column is returned If two values within a column are equal to the maximum then the one with the lower index is returned Column takes precedence over index Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand o...

Page 92: ... utilized This splits the search into 16 sub ranges The global maximum is then found from the list of maximums of the sub ranges Then using this offset from the sub ranges the global maximum and the index of it are found using a simple match For common maximums in multiple ranges the index will be different to the above C code This code requires 40 bytes of stack space for a temporary buffer Bench...

Page 93: ...uivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply short DSP_minval short x int nx int i min min 32767 for i 0 i nx i if x i min min x i return min Special Requirements nx is a multiple of 4 and greater than or equal to 20 Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt toler...

Page 94: ... intermediate multiplies are accumulated into a 40 bit long register pair as there could be potential overflow The contribution of the multiplication of the two lower 16 bit halves are not considered The output is in Q 30 format Results are accurate to least significant bit Algorithm In the comments below X and Y are the two input values Xhigh and Xlow represent the upper and lower 16 bits of X Th...

Page 95: ...han or equal to 16 Input and output vectors must be double word aligned Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible The MPYHI instruction is used to perform 16 x 32 multiplies to form 48 bit intermediate results Benchmarks Cycles 9 nx 8 18 Codesize 512 bytes ...

Page 96: ...nd output arrays must not be overlapped except for where the input and output pointers are exactly equal Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply void DSP_neg32 int x int r short nx short i for i nx i 0 i r x Special Requirements nx must be a multiple of 4 and greater than or equal to 8 The...

Page 97: ...ion rfrac is returned in Q 15 format Since the reciprocal is always greater than 1 it returns an exponent such that rfrac i 2rexp i true reciprocal The output is accurate up to the least significant bit of rfrac but note that this bit could carry over and change rexp For a reciprocal of 0 the procedure will return a fractional part of 7FFFh and an exponent of 16 Algorithm This is the C equivalent ...

Page 98: ... neg b b if originally negative negate rfrac b store fraction Special Requirements None Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interruptible The conditional subtract instruction SUBC is used for division SUBC is used once for every bit of quotient needed 15 Benchmarks Cycles 8 nx 14 Codesize 196 bytes ...

Page 99: ...ivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply int DSP_vecsumsq short x int nx int i sum 0 for i 0 i nx i sum x i x i return sum Special Requirements nx must be a multiple of 4 and greater than or equal to 32 Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not...

Page 100: ...puts and output are 16 bit numbers Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply void DSP_w_vec short x short y short m short r short nr short i for i 0 i nr i r i m x i 15 y i Special Requirements nr must be a multiple of 8 and 8 Vectors x and y must be double word aligned Implementation Notes ...

Page 101: ...s x and y The columnar dimension of x must match the row dimension of y The resulting matrix has the same number of rows as x and the same number of columns as y The values stored in the matrices are assumed to be fixed point or integer values All intermediate sums are retained to 32 bit precision and no overflow checking is performed The results are right shifted by a user specified amount and th...

Page 102: ...therwise As a result interrupts can be blocked for up to 0 25 c1 16 cycles at a time The i loop and k loops are unrolled 2x The j loop is unrolled 4x For dimensions that are not multiples of the various loops unroll factors this code calculates extra results beyond the edges of the matrix These extra results are ultimately discarded This allows the loops to be unrolled for efficient operation on l...

Page 103: ... Note that the assembly code is hand optimized and restrictions may apply void DSP_mat_trans short x short rows short columns short r short i j for i 0 i columns i for j 0 j rows j r i rows j x i columns j Special Requirements Rows and columns must be a multiple of 4 Matrices are assumed to have 16 bit elements Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code i...

Page 104: ...s of all values in the input vector x and returns the minimum exponent This will be useful in determining the maximum shift value that may be used in scaling a block of data Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply short DSP_bexp const int x short nx int min_val _norm x 0 short n int i for ...

Page 105: ...bexp 4 77 C64x DSPLIB Reference Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Benchmarks Cycles nx 2 21 Codesize 216 bytes ...

Page 106: ...s reversed This facilitates moving big endian data to a little endian system or vice versa When the r pointer is non NULL the endian swap occurs out of place similar to a block move When the r pointer is NULL the endian swap occurs in place allowing the swap to occur without using any additional storage Algorithm This is the C equivalent of the assembly code without restrictions Note that the asse...

Page 107: ...at the operation occurs in place The input array and output array are expected to be double word aligned and a multiple of 8 half words must be processed Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Benchmarks Cycles nx 8 18 Codesize 104 bytes ...

Page 108: ...rray is reversed This facilitates moving big endian data to a little endian system or vice versa When the r pointer is non NULL the endian swap occurs out of place similar to a block move When the r pointer is NULL the endian swap occurs in place allowing the swap to occur without using any additional storage Algorithm This is the C equivalent of the assembly code without restrictions Note that th...

Page 109: ...o not overlap except where r NULL so that the operation occurs in place The input array and output array are expected to be double word aligned and a multiple of 4 words must be processed Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Benchmarks Cycles nx 4 20 Codesize 116 bytes ...

Page 110: ...ay is reversed This facilitates moving big endian data to a little endian system or vice versa When the r pointer is non NULL the endian swap occurs out of place similar to a block move When the r pointer is NULL the endian swap occurs in place allowing the swap to occur without using any additional storage Algorithm This is the C equivalent of the assembly code without restrictions Note that the ...

Page 111: ...pecial Requirements Input and output arrays do not overlap except when r NULL so that the operation occurs in place The input array and output array are expected to be double word aligned and a multiple of 2 double words must be processed Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Benchmarks Cycles nx 2 20 Codes...

Page 112: ... The source and destination blocks can be overlapped Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply void DSP_blk_move short x short r int nx int i if r x for I 0 I nx i r i x i else for I nx 1 I 0 i r i x i Special Requirements nx must be a multiple of 8 and 32 Implementation Notes Twin input and...

Page 113: ...point numbers stored in vector x into Q 15 format numbers stored in vector r Results are truncated toward zero Values that exceed the size limit will be saturated to 0x7fff if value is positive and 0x8000 if value is negative All values too small to be correctly represented will be truncated to 0 Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly co...

Page 114: ...oq15 4 86 Implementation Notes Loop is unrolled twice Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Benchmarks Cycles 3 nx 2 14 Codesize 224 bytes ...

Page 115: ...ndex Pointer to GSP0_TABLE max_index found return int Maximum dot product result Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply int minerr const short restrict GSP0_TABLE const short restrict errCoefs int restrict max_index int val maxVal 50 int i j for i 0 i GSP0_NUM i for val 0 j 0 j GSP0_TERMS...

Page 116: ...s No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible The load double word instruction is used to simultaneously load four values in a single clock cycle The inner loop is completely unrolled The outer loop is 4 times unrolled Benchmarks Cycles 256 4 9 17 593 Codesize 352 bytes ...

Page 117: ...rts the values stored in vector x in Q 15 format to IEEE floating point numbers in output vector r Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply void DSP_q15tofl short x float r int nx int i for i 0 i nx i r i float x i 0x8000 Special Requirements nx must be a multiple of 2 Implementation Notes ...

Page 118: ...ts in vector x nx must be a power of 2 Description This function bit reverses the position of elements in complex vector x This function is used in conjunction with FFT routines to provide the correct format for the FFT input or output data The bit reversal of a bit reversed order array yields a linear order array Algorithm TI retains all rights title and interest in this code and only authorizes ...

Page 119: ...i i 1 nbits nbot nbits 1 ndiff nbits 1 ntop nbot ndiff n2 1 ntop mask n2 1 halfn nx 1 for i0 0 i0 halfn i0 2 b i0 mask a i0 nbot if b ia index a ib index b ibs ib nbot j0 ibs ia t i0 j0 xi0 x i0 xj0 x j0 if t x i0 xj0 x j0 xi0 i1 i0 1 j1 j0 halfn xi1 x i1 xj1 x j1 x i1 xj1 x j1 xi1 i3 i1 halfn j3 j1 1 xi3 x i3 xj3 x j3 ...

Page 120: ...nx 4K you can use the char 8 bit data type for the index variable This requires changing the LDH when loading index values in the assembly routine to LDB This further reduces the size of the Index Table by half Implementation Notes Interruptibility The code is interrupt tolerant but not interruptible Benchmarks The performance of this function has not yet been characterized on the C64x ...

Page 121: ... output sequences Size 2 nx elements w nx Pointer to vector of FFT coefficients of size nx elements Description This routine is used to compute FFT of a complex sequence of size nx a power of 2 with decimation in frequency decomposition method The output is in bit reversed order Each complex value is with interleaved 16 bit real and imaginary parts Algorithm This is the C equivalent of the assembl...

Page 122: ...erent word boundaries to minimize memory bank hits x data is stored in the order real 0 image 0 real 1 The FFT coefficients twiddle factors are generated using the program tw_radix2 provided in the directory support fft Implementation Notes Bank Conflicts See Benchmarks Interruptibility The code is interrupt tolerant but not interruptible Loads input x and coefficient w as words Both loops j and i...

Page 123: ...uences Size 2 nx elements w nx Pointer to vector of FFT coefficients of size nx elements Description This routine is used to compute FFT of a complex sequence size nx a power of 4 with decimation in frequency decomposition method The output is in digit reversed order Each complex value is with interleaved 16 bit real and imaginary parts Algorithm This is the C equivalent of the assembly code witho...

Page 124: ...1 x 2 i0 x 2 i2 r2 x 2 i0 x 2 i2 t x 2 i1 x 2 i3 x 2 i0 r1 t r1 r1 t s1 x 2 i0 1 x 2 i2 1 s2 x 2 i0 1 x 2 i2 1 t x 2 i1 1 x 2 i3 1 x 2 i0 1 s1 t s1 s1 t x 2 i2 r1 co2 s1 si2 15 x 2 i2 1 s1 co2 r1 si2 15 t x 2 i1 1 x 2 i3 1 r1 r2 t r2 r2 t t x 2 i1 x 2 i3 s1 s2 t s2 s2 t x 2 i1 r1 co1 s1 si1 15 x 2 i1 1 s1 co1 r1 si1 15 x 2 i3 r2 co3 s2 si3 ...

Page 125: ...ord boundary to minimize memory bank hits x data is stored in the order real 0 image 0 real 1 The FFT coefficients twiddle factors are generated using the program tw_r4fft provided in the directory support fft Implementation Notes Bank Conflicts See Benchmarks Interruptibility The code is interrupt tolerant but not interruptible Loads input x and coefficient w as words Both loops j and i0 shown in...

Page 126: ... FFT factors and memory accesses to improve performance in the presence of cache Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply The following macro is used to obtain a digit reversed index of a given number i into j where the number of bits in i is m For the natural form of C code this is done by...

Page 127: ...0FFFF 16 _ 0x0000FFFF 16 j _ m while 0 endif void fft_cn const short restrict w int n short restrict x short restrict y int stride i j k t s m short xh0 xh1 xh20 xh21 short xl0 xl1 xl20 xl21 short xt0 yt0 xt1 yt1 short xt2 yt2 xt3 yt3 Inform the compiler that the input array x twiddle factor array w and output array y are double word aligned In addition the number of points to be transformed is as...

Page 128: ... inputs of the radix 4 butterfly The C code breaks the FFT into two cases one when the stride between the elements is greater than 4 other when the stride is less than 4 Since stride is greater than 16 it can be guaranteed that s is greater than or equal to 4 In addition it can also be shown that the loop that shares this stride will iterate at least once The number of times this loop iterates dep...

Page 129: ... i j k s 1 x2r x 2 i j k 2 s 0 x2i x 2 i j k 2 s 1 x3r x 2 i j k 3 s 0 x3i x 2 i j k 3 s 1 Read the six twiddle factors that are needed for 3 of the four outputs The first output has no mpys w1s w t 2 k 6 j 0 w1c w t 2 k 6 j 1 w2s w t 2 k 6 j 4 w2c w t 2 k 6 j 5 w3s w t 2 k 6 j 8 w3c w t 2 k 6 j 9 Calculate the four outputs remembering that radix4 FFT accepts 4 inputs and produces 4 outputs If we ...

Page 130: ...he underlying DIF structure similar to DSP_radix2 as follows X 4k x n x n N 2 x n N 4 x n 3N 4 X 4k 1 x n x n N 2 j x n N 4 x n 3N 4 x 4k 2 x n x n N 2 x n N 4 x n 3N 4 X 4k 3 x n x n N 2 j x n N 4 x n 3N 4 which leads to the real and imaginary values as foll y0r x0r x2r x1r x3r xh0 xh20 y0i x0i x2i x1i x3i xh1 xh21 y1r x0r x2r x1i x3i xl0 xl21 y1i x0i x2i x1r x3r xl1 xl20 y2r x0r x2r x1r x3r xh0 ...

Page 131: ... for a normal FFT are C j S Since the factors that are stored are C j S this is corrected for in the multiplies Y1 xt1 jyt1 c js xc ys yc xs y0r xt0 y0i yt0 y1r xt1 w1c yt1 w1s 15 y1i yt1 w1c xt1 w1s 15 y2r xt2 w2c yt2 w2s 15 y2i yt2 w2c xt2 w2s 15 y3r xt3 w3c yt3 w3s 15 y3i yt3 w3c xt3 w3s 15 Store the final results back to the input array x 2 i j k 0 y0r x 2 i j k 1 y0i x 2 i j k s 0 y1r x 2 i j...

Page 132: ...ne of the nice features of this last stage are that no multiplies are required In addition the data always strides by a fixed amount namely 8 elements Since the data is stored as interleaved pairs of real and imaginary data the first eight elements contain the data for the first four complex inputs These are the inputs to the first radix4 butterfly ifndef NOASSUME pragma MUST_ITERATE 4 4 endif for...

Page 133: ...i x0i x2i x1i x3i y3r x0r x2r x1i x3i y3i x0i x2i x1r x3r Digit reverse our address to convert the digit reversed input into a linearized output order This actually results in a digit reversed store pattern since we re loading linearly but the end result is that the FFT bins are in linear order DIG_REV i m j Note Result is assigned to j by the macro Store out the final FFT results y 2 j 0 0 y0r y ...

Page 134: ... factors must be double word aligned and are generated using the program tw_fft16x16 provided in the directory support fft Implementation Notes Bank Conflicts No bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible Loads input x and coefficient w as double words Both loops j and i0 shown in the C code are placed in the inner loop of the assembly code Benchmark...

Page 135: ...e in the presence of cache Algorithm This is the C equivalent of the assembly code without restrictions Note that the assembly code is hand optimized and restrictions may apply The following macro is used to obtain a digit reversed index of a given number i into j where the number of bits in i is m For the natural form of C code this is done by first interchanging every set of 2 bit pairs followed...

Page 136: ..._0 yt2_0 short xt0_1 yt0_1 xt1_1 yt1_1 xt2_1 yt2_1 short xh0_0 xh1_0 xh20_0 xh21_0 xl0_0 xl1_0 xl20_0 xl21_0 short xh0_1 xh1_1 xh20_1 xh21_1 xl0_1 xl1_1 xl20_1 xl21_1 short x_0 x_1 x_2 x_3 x_l1_0 x_l1_1 x_l1_2 x_l1_3 x_l2_0 x_l2_1 short xh0_2 xh1_2 xl0_2 xl1_2 xh0_3 xh1_3 xl0_3 xl1_3 short x_4 x_5 x_6 x_7 x_l2_2 x_l2_3 x_h2_0 x_h2_1 x_h2_2 x_h2_3 short x_8 x_9 x_a x_b x_c x_d x_e x_f short si10 si...

Page 137: ...et to N For every stride 6 stride twiddle factors are accessed The tw_offset is the offset within the current twiddle factor sub table This is set to zero at the start of the code and is used to obtain the appropriate sub table twiddle pointer by offsetting it with the base pointer ptr_w stride npoints tw_offset 0 fft_jmp 6 stride while stride radix At the start of every iteration of the outer loo...

Page 138: ...utterflies share the twiddle factors These are grouped together On the very first stage there are no butterflies that share the twiddle factor all N 4 butter flies have different factors On the next stage two sets of N 8 butterflies share the same twiddle factor Hence after half the butterflies are performed j the index into the factor array resets to 0 and the twiddle factors are reused When this...

Page 139: ...the successive complex inputs of the butterfly are seperated by a fixed amount known as stride The stride starts out at N 4 and quarters with every stage x_l1_0 x l1 x_l1_1 x l1 1 x_l1_2 x l1 2 x_l1_3 x l1 3 x_l2_0 x l2 x_l2_1 x l2 1 x_l2_2 x l2 2 x_l2_3 x l2 3 x_h2_0 x h2 x_h2_1 x h2 1 x_h2_2 x h2 2 x_h2_3 x h2 3 Two butterflies are evaluated in parallel The following results will be shown for on...

Page 140: ...words are consumed in every iteration The input data pointer increments by 4 Note that within a stage the stride does not change and hence the offsets for the other three legs 0 h2 l1 l2 j 12 x 4 predj j fft_jmp if predj x fft_jmp if predj j 0 These four partial results can be re written to show the underlying DIF structure similar to DSP_radix2 as follows X 4k x n x n N 2 x n N 4 x n 3N 4 X 4k 1 ...

Page 141: ...l0_1 xl21_1 yt2_1 xl1_1 xl20_1 xt2_1 xl0_1 xl21_1 yt1_1 xl1_1 xl20_1 Store out first output of the four outputs of a radix4 butterfly Since two such radix4 butterflies are per formed in parallel there are 2 such 1st outputs x2 0 y0r x2 1 y0i x2 2 y4r x2 3 y4i Perform twiddle factor multiplies of three terms top term does not have any multiplies Note the twiddle factors for a normal FFT are C j S S...

Page 142: ...2_1 si31 xt2_1 15 The following code performs either a standard radix4 pass or a DSP_radix2 pass Two pointers are used to access the input data The input data is read N 4 complex samples apart or N 2 words apart using pointers x0 and x2 This produces out puts that are 0 N 4 N 2 3N 4 for a radix4 FFT and 0 N 8 N 2 3N 8 for radix 2 y0 ptr_y y2 ptr_y int npoints x0 ptr_x x2 ptr_x int npoints 1 if rad...

Page 143: ...oints or a quarter of the complex points have been exhausted to jump to pervent double reversal j 0 for i 0 i npoints i 8 Digit reverse the index starting from 0 The increment to j is either by 4 or 8 DIG_REV j l1 h2 Read in the input data from the first eight locations These are transformed either as a radix4 or as a radix 2 x_0 x0 0 x_1 x0 1 x_2 x0 2 x_3 x0 3 x_4 x0 4 x_5 x0 5 x_6 x0 6 x_7 x0 7 ...

Page 144: ... n31 x_5 x_7 y0 2 h2 n00 y0 2 h2 1 n01 y1 2 h2 n10 y1 2 h2 1 n11 y2 2 h2 n20 y2 2 h2 1 n21 y3 2 h2 n30 y3 2 h2 1 n31 Read in the next eight inputs and perform radix4 or DSP_radix2 decomposition x_8 x2 0 x_9 x2 1 x_a x2 2 x_b x2 3 x_c x2 4 x_d x2 5 x_e x2 6 x_f x2 7 x2 8 xh0_2 x_8 x_c xh1_2 x_9 x_d xl0_2 x_8 x_c xl1_2 x_9 x_d xh0_3 x_a x_e xh1_3 x_b x_f xl0_3 x_a x_e xl1_3 x_b x_f n02 xh0_2 xh0_3 n...

Page 145: ...re read from succesive locations map to y y N 4 y N 2 y 3N 4 in a radix4 scheme y y N 8 y N 2 y 5N 8 y0 2 h2 2 n02 y0 2 h2 3 n03 y1 2 h2 2 n12 y1 2 h2 3 n13 y2 2 h2 2 n22 y2 2 h2 3 n23 y3 2 h2 2 n32 y3 2 h2 3 n33 Increment j by j0 If j equals n0 then increment both x0 and x2 so that double inversion is avoided j j0 if j n0 j n0 x0 int npoints 1 x2 int npoints 1 ...

Page 146: ...ction thus the input data must be scaled by 2log2 nx to completely prevent overflow Implementation Notes Bank Conflicts nx 8 bank conflicts occur Interruptibility The code is interrupt tolerant but not interruptible The routine uses log4 nx 1 stages of radix 4 transform and performs either a radix 2 or radix 4 transform on the last stage depending on nx If nx is a power of 4 then this last stage i...

Page 147: ...ord wide loads and fetch the twiddle factors needed To do this a modified twiddle factor array is created in which the factors WN 4 WN 2 W3N 4 are arranged to be contiguous This eliminates the separation between twiddle factors within a butterfly However this implies that we maintain a redundant version of the twiddle factor array as the loop is traversed from one stage to another Hence the size o...

Page 148: ... 16 yt2_xt1 _add2 xl1_xl0 xl20_xl21 yt1_xt2 _sub2 xl1_xl0 xl20_xl21 Also notice that xt1 yt1 end up on separate words these need to be packed together to take advantage of the packed twiddle factors that have been loaded To achiev this they are re aligned as follows yt1_xt1 _packhl2 yt1_xt2 yt2_xt1 yt2_xt2 _packhl2 yt2_xt1 yt1_xt2 The packed words yt1_xt1 allow the loaded sc twiddle factor to be u...

Page 149: ... Formats This appendix describes performance considerations related to the C64x DSPLIB and provides information about the Q format used by DSPLIB functions Topic Page A 1 Performance Considerations A 2 A 2 Fractional Q Formats A 3 Appendix A ...

Page 150: ...laced in L1 memory Any extra cycles due to placement of code or data in L2 external memory or cache associated effects cache hits or misses are not considered when computing the cycle counts You should also be aware that execution speed in a system is dependent on where the different sections of program and data are located in memory You should account for such differences when trying to explain w...

Page 151: ... as the sign bit in any Q format number Thus in Q 15 format the decimal point is placed immediately to the right of the sign bit The fractional portion to the right of the sign bit is stored in regular two s complement format A 2 1 Q3 12 Format Q 3 12 format places the sign bit after the fourth binary digit from the right and the next 12 bits contain the two s complement fractional component The a...

Page 152: ...cation contains the most significant 15 bits and the sign bit The approximate allowable range of numbers in Q 31 representation is 1 1 and the finest fractional resolution is 2 31 4 66 10 10 Table A 3 Q 31 Low Memory Location Bit Fields Bit 15 14 13 12 3 2 1 0 Value Q15 Q14 Q13 Q12 Q3 Q2 Q1 Q0 Table A 4 Q 31 High Memory Location Bit Fields Bit 15 14 13 12 3 2 1 0 Value S Q30 Q29 Q28 Q19 Q18 Q17 Q1...

Page 153: ...ndix A Software Updates and Customer Support This appendix provides information about software updates and customer support Topic Page B 1 DSPLIB Software Updates B 2 B 2 DSPLIB Customer Support B 2 Appendix B ...

Page 154: ...t enhancements and fixes as they become available You should read the README TXT available in the root directory of every release B 2 DSPLIB Customer Support If you have questions or want to report problems or suggestions regarding the C64x DSPLIB contact Texas Instruments at dsph ti com DSPLIB Software Updates DSPLIB Customer Support ...

Page 155: ...bler substitutes absolute operation codes for symbolic operation codes and absolute or relocatable addresses for symbolic addresses assert To make a digital logic device pin active If the pin is active low then a low voltage on the pin asserts it If the pin is active high then a high voltage asserts it B bit A binary digit either a 0 or 1 big endian An addressing protocol in which bytes are number...

Page 156: ...g and Boolean logic operations as well as the generation of data and program memory addresses The CPU includes the central arithmetic logic unit CALU the multiplier and the auxiliary register arithmetic unit ARAU chip support library CSL The CSL is a set of application programming interfaces APIs consisting of target side DSP code used to configure and control all on chip peripherals clock cycle A...

Page 157: ...ich are discrete or discontinuous electrical impulses so that they can be manipulated direct memory access DMA A mechanism whereby a device other than the host processor contends for and receives mastery of the memory bus so that data transfers can take place independent of the host DMA See direct memory access DMA source The module where the DMA data originates DMA data is read from the DMA sourc...

Page 158: ...orward mixed radix 32 x 32 bit FFT with scaling DSP_fir_cplx Complex FIR filter radix 2 DSP_fir_gen FIR filter general purpose DSP_firlms2 LMS FIR radix 2 DSP_fir_r4 FIR filter radix 4 DSP_fir_r8 FIR filter radix 8 DSP_fir_sym Symmetric FIR filter radix 8 DSP_fltoq15 Float to Q15 conversion DSP_ifft16x32 Complex inverse mixed radix 16 x 32 bit FFT with rounding DSP_ifft32x32 Complex inverse mixed ...

Page 159: ...o and write from off chip memory F fast Fourier transform FFT An efficient method of computing the discrete Fourier transform algorithm which transforms functions between the time domain and the frequency domain fetch packet A contiguous 8 word series of instructions fetched by the CPU and aligned on an 8 word boundary FFT See fast fourier transform flag A binary status indicator whose state indic...

Page 160: ... points to another pointer rather than to the actual data this mode is prohibited in RISC architecture instruction fetch packet A group of up to eight instructions held in memory for execution by the CPU internal interrupt A hardware interrupt caused by an on chip peripheral interrupt A signal sent by hardware or software to a processor requesting attention An interrupt tells the processor to susp...

Page 161: ...from right to left within a word More significant bytes in a word have higher numbered addresses Endian ordering is specific to hardware and is determined at reset See also big endian M maskable interrupt A hardware interrupt that can be enabled or disabled through software memory map A graphical representation of a computer system s memory showing the locations of program space data space reserve...

Page 162: ...e that is used to configure the power down control registers if applicable and to invoke various power down modes R random access memory RAM A type of memory device in which the individual locations can be accessed in any order register A small area of high speed memory located within a processor or electronic device that is used for temporarily storing data or instructions Each register is given ...

Page 163: ...ncreased synchronous dynamic random access memory SDRAM RAM whose contents are refreshed periodically so the data is not lost Transfer of data is at a fixed rate relative to the clock speed of the device syntax The grammatical and structural rules of a language All higher level programming languages possess a formal syntax system software The blanketing term used to denote collectively the chip su...

Page 164: ...C 10 ...

Page 165: ... clock modes defined C 2 code defined C 2 coder decoder defined C 2 compiler defined C 2 compress and expand compand defined C 3 control register defined C 3 control register file defined C 3 correlation functions 3 4 DSPLIB reference 4 4 CSL defined C 3 customer support B 2 D data types DSPLIB table 2 3 device ID defined C 3 digital signal processor DSP defined C 3 direct memory access DMA define...

Page 166: ...fltoq15 defined C 4 DSPLIB reference 4 85 DSP_ifft16x32 defined C 4 DSPLIB reference 4 30 4 32 4 34 DSP_ifft32x32 defined C 4 DSPLIB reference 4 36 DSP_iir defined C 4 DSPLIB reference 4 54 DSP_iirlat DSPLIB reference 4 56 DSP_lat_fwd DSPLIB reference 4 56 DSP_mat_trans defined C 4 DSPLIB reference 4 75 DSP_maxidx defined C 4 DSPLIB reference 4 63 DSP_maxval defined C 4 DSPLIB reference 4 62 DSP_m...

Page 167: ...ting how DSPLIB is tested 2 4 using DSPLIB 2 3 DSPLIB reference adaptive filtering functions 4 2 correlation functions 4 4 DSP_autocor 4 4 4 6 DSP_bexp 4 76 DSP_bitrev_cplx 4 90 DSP_blk_move 4 78 4 80 4 82 4 84 DSP_dotp_sqr 4 58 DSP_dotprod 4 60 DSP_fft 4 98 DSP_fft16x16r 4 14 DSP_fft16x16t 4 8 4 11 4 107 DSP_fft16x32 4 24 DSP_fft32x32 4 26 DSP_fft32x32s 4 28 DSP_fir_cplx 4 38 4 40 DSP_fir_gen 4 4...

Page 168: ...errupt service fetch packet ISFP defined C 6 interrupt service routine ISR defined C 6 interrupt service table IST defined C 7 IST defined C 7 L least significant bit LSB defined C 7 lib directory 2 2 linker defined C 7 little endian defined C 7 M maskable interrupt defined C 7 math functions 3 6 DSPLIB reference 4 58 matrix functions 3 6 DSPLIB reference 4 73 memory map defined C 7 memory mapped ...

Page 169: ...defined C 8 register defined C 8 reset defined C 9 routines DSPLIB functional categories 1 2 RTOS defined C 9 S service layer defined C 9 software updates B 2 STDINC module defined C 9 synchronous burst static random access memory SBSRAM defined C 9 synchronous dynamic random access memory SDRAM defined C 9 syntax defined C 9 system software defined C 9 T tag defined C 9 testing how DSPLIB is test...

Reviews: