background image

SPRAA56 

DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application 

13 

The low-resolution CLK_getltime API is used instead of the high-resolution CLK_gethtime 
because the range of the latency is known to be on the order of one or more frame times, where 
a frame time is 33.33 ms in NTSC systems. The low-resolution timing measurement provided by 
CLK_getltime is more cycle efficient and is in milliseconds. Since the data is displayed in 
milliseconds, the lower-resolution time base results in a faster measurement, with sufficient 
accuracy for the latency benchmark. 

The corresponding code in the video output task finishes the benchmark once the frame has 
propagated through the system: 

 

if (!benchCapVid.captodisplay.done) { // benchVideoDisRta.captodisplay 
  benchCapVid.captodisplay.latency  
         = CLK_getltime() - benchCapVid.captodisplay.latency; 
  // current time - last captured frame timestamp = latency         
  UTL_logDebug2("Latency = %d [ms], for frame %d ", benchCapVid.captodisplay.latency,  
         benchCapVid.captodisplay.frameNum ); 
  benchCapVid.captodisplay.done = 1; 

Note that this measurement does not include the latency introduced by the capture and display 
drivers. Similar techniques could be applied, using the UTL or STS APIs, to measure the driver 
latency, however this would require modifying and rebuilding the driver, which is outside the 
scope of this application note. To measure the total input-to-output latency, add the driver 
latencies to the measured benchmark reported here. 

4.4  Measuring the Frame Rate 

Frame rate 

is the rate, in frames per second or Hz, of the capture, processing, or display of 

video frames by the system. In video systems it is possible for the display frame rate to exceed 
the capture and/or processing frame rate, so it is often important to measure it separately for the 
capture, processing, and display stages in the data stream.  

In this example application, the actual frame rate is measured at each stage, and user control of 
the frame rate is provided for the processing stage. 

During periods of peak CPU loading, the processing rate of the DSP can fall below the display 
rate of the output device, resulting in dropped frames. Dropped frames are frames that were 
received during capture or decode but not displayed, or frames that were captured but not 
encoded. Frame dropping can occur when the CPU is overloaded by the processing required for 
real-time encoding or decoding. 

The VPORT display driver from the DDK is written to handle this condition gracefully. If a new 
frame is not received from the application in time for the video port to display it, the device driver 
continues to show the previously displayed frame. With high-motion video, this condition can 
sometimes result in noticeable “jerkiness”. At other times, dropped frames can be difficult to 
detect or quantify, so a method of detecting dropped frames is useful during development, 
debugging, and demonstrations. A method for detecting dropped frames is implemented in this 
application using the UTL and CLK services. 

The following code from the tskProcess function measures the number of dropped frames by 
subtracting the reference time from the actual time required to capture 30 or 25 frames. The 
reference time should be approximately 1 second for NTSC or PAL systems, respectively. 

 

Summary of Contents for DSP/BIOS Real-Time Analysis

Page 1: ...ements for Viewing RTA Benchmarks 7 3 Modifications to the Base Example 7 3 1 Splitting the Encode and Decode CELLs 8 3 2 Adding the Control TSK and MBX Communication 8 3 3 Querying the H 263 Encoder for Status 9 3 4 Controlling the Frame Rate 10 4 RTA Techniques for Performance Measurement 11 4 1 Measuring Function Execution Time with the UTL Module 11 4 2 Measuring Task Scheduling Latencies 12 4...

Page 2: ...isplay frame rates can differ by design or under overloaded conditions where frames are dropped Therefore it is important to measure all three frame rates separately Resolution is the size in pixels of the capture processing and display Resolution is typically static at run time so it is not usually benchmarked with real time tools However it is important to know the capture processing and display...

Page 3: ...verview The base h263_loopback example used to create the application described here is a video application supplied with the TMS320DM642 evaluation module board support package After you install the board support package the source code and included object libraries for the base example are in the CCS_install_dir boards evmdm642 examples video h263_loopback directory The H 263 loopback example wa...

Page 4: ...The example s first stage is a task called tskInput which runs the tskVideoInput function The task receives digital video buffers from the device driver It then converts the buffers to the 4 2 0 format from the 4 2 2 formatted data it receives from the driver The next stage the tskVideoProcess task which runs the tskProcess function The task includes algorithms that require input data in the 4 2 0...

Page 5: ...P BIOS SCOM Synchronization and pointer passing mechanism for data flow between TSKs RF CHAN Instantiates and serially executes xDAIS compliant algorithms RF CELL Container for xDAIS algorithms in a CHAN RF ALGRF Encapsulates the procedure for xDAIS algorithm instantiation RF The following module provides an interface to the video port device driver and is described in The TMS320DM642 Video Port M...

Page 6: ...logs and statistics accumulators For greater efficiency the target does not execute log or statistics APIs unless tracing is enabled This module contains two user defined TRC flags that can be toggled using the DSP BIOS RTA Control Panel in Code Composer Studio The application can use these bits to enable or disable sets of explicit instrumentation The program can use the TRC_query API to check th...

Page 7: ...TA is disabled the Message Log Statistics View Execution Graph and other RTA windows are updated only when the DSP is halted An update displays the most recent contents of their respective buffers This stop mode of RTA offers a good compromise when some visibility is required but the additional code and background function calls are undesirable Stop mode can also occur if RTA is enabled but the CP...

Page 8: ...he H 263 encoder and decoder are wrapped in sequential CELLs in a single channel This is suitable for an example application but in actual video systems the input to the decoder would be an encoded bitstream from an external source and the output from the encoder would be sent to an external source such as a network stream or a hard disk drive Splitting the encoder and decoder into separate channe...

Page 9: ... the control task from adding latency or CPU overhead when responding to control commands The control commands are only serviced at times when the three TSKs in the data stream are all in the blocked state and the processor would normally be running its background loop Figure 3 shows the task partitioning added to the application flow in Figure 2 Device Driver Device Driver Device Driver Buffer 3 ...

Page 10: ...cessed or displayed prompting the display driver to re display the most recent frame The capture frame rate and display frame rate are left unchanged at DISPLAYRATE which is set to 30 frames for second in NTSC applications or 25 frames per second in PAL applications Because the capture driver is using external memory bandwidth to copy unused frames from the video port FIFO to external buffers it m...

Page 11: ...r functions of interest and UTL_stsPeriod was used in each of the three data tasks to measure the period of one complete loop through each task Because the UTL module acts as a wrapper for DSP BIOS STS objects the STS objects needed to be created during DSP BIOS configuration The following naming convention is used to create the statistics objects sts task pseudonym function benchmarked The appIns...

Page 12: ...ctoInput SYS_FOREVER end of main processing loop 4 3 Measuring End to End Latencies End to end latency is the time between the capture of a video frame in real time and the display of that same video frame some number T of milliseconds later Long latencies are undesirable in bi directional video applications such as in a video conferencing systems Such latency causes delays between questions and r...

Page 13: ...er second or Hz of the capture processing or display of video frames by the system In video systems it is possible for the display frame rate to exceed the capture and or processing frame rate so it is often important to measure it separately for the capture processing and display stages in the data stream In this example application the actual frame rate is measured at each stage and user control...

Page 14: ...ocessing stages that add to the load Before integrating such functions into the system you may want to estimate their effects on real time performance One way to estimate the effects of an additional load is with a dummy load of NOP instructions Such a dummy load function is provided in the dummyLoad c file of this example It can be controlled from the h263rateControl gel file which manipulates th...

Page 15: ...object var CpuLoadCheck tibios IDL create CpuLoadCheck CpuLoadCheck fxn prog extern LOAD_idlefxn 2 Include load c and load h in the project 3 Call LOAD_getcpuload as needed within your application thrProcRta cpuLoad LOAD_getcpuload The project keeps track of the number of times the idle loop is entered over a time period specified by the window variable in load c The CPU load reported by LOAD_getc...

Page 16: ...stand the memory bus utilization of the whole system and its components Data structures for measuring the memory bus utilization of the input processing and display tasks are included in the modified example The actual values logged into the data structures are estimated based on the defined size of the frames being moved to internal buffers for processing For the case of YUV4 2 0 to YUV4 2 2 colo...

Page 17: ... STS object named sts task BusUtil for viewing in the DSP BIOS Statistics View tool This results in a bus utilization statistic in bytes per second 4 8 Bitrate and Frame Type Bitrate is important in applications that do encoding or decoding The bitrate of encoded video often varies greatly with different video content increasing to high values during periods of high motion and image complexity and...

Page 18: ...mple to set the target bitrate while other applications require more advanced control The percentage of macroblocks that are intracoded is another benchmark that could potentially be useful Some encoders can report this benchmark but the H 263 encoder algorithm used in this application does not This number is the percentage of blocks for which no suitable motion vector could be found to describe t...

Page 19: ...s to a third party receiving application The current size of the debug structure is small defined in Appendix A so sending the structure once every 30 frames would introduce a negligible load on the system and the network yet could still provide useful information at that rate 4 10 Application Specific Control via GEL Scripts in CCStudio As mentioned earlier run time control is provided by the h26...

Page 20: ...than the platform specified in the requirements list 5 2 Running the Application 1 Copy the h263loopback_rta zip file to a working directory and extract its contents 2 Open CCStudio and open the h263loopback_rta pjt project The project file references all source and object files required to build the executable Source filenames with _rta at the end have been modified for this note Source filenames...

Page 21: ...g and select Properties Then enable and select the file CPU Load Graph Shows the percentage utilization of the DSP core in non idle tasks RTA Control Panel You may want to lower the update polling rate of the real time windows this makes the instrumentation less intrusive Right click on the RTA control panel and choose Properties You can change the update rates of various RTA windows starting from...

Page 22: ... Figure 6 Workspace Including RTA Windows 5 3 Interpreting the Benchmarks There are a total of 20 statistics measured by the application 16 application specific STS objects and 4 objects created automatically with the TSKs Figure 7 shows a sample Statistics View of all these measurements ...

Page 23: ...bjects on the target DSP 5 3 1 Expected Values for the STS Objects Table 1 shows expected and measured values for the STS benchmarks in the instrumented application The right column is blank in case you want to fill in your own measurements stsInVidPeriod stsOutVidPeriod and stsProcPeriod are all expected to be 33 33 ms because this is the amount of time between successive frames in an NTSC video ...

Page 24: ... stsInVidBusUtil 28 512 000 Bps stsOutVidPeriod 33 33ms 33 29 ms stsOutVidTotal 2 43 ms stsOutVidCell0 2 5ms 2 41 ms stsOutVidWait0 33ms 30 35 ms stsOutVidBusUtil 28 512 000 Bps stsProcPeriod 33 33ms 33 26 ms stsProcTotal Cell0 Cell1 24 07 ms stsProcCell0 18 97 ms stsProcCell1 5 09 ms stsProcNframes 1 second 30 frames 498 84 ms stsProcBusUtil 26 926 600 Bps The typical expected values for task sch...

Page 25: ...trate for the encoder algorithm between 32 and 15000 passthroughReference Set to 1 to bypasses the decoder and output the frame captured by the encoder without any modification Set to 0 to use the decoder color Set to 1 to enable color processing Set to 0 to disable color processing This slider can be used to benchmark the application with and without color processing enabled 5 4 1 Debug Mode The ...

Page 26: ...mentation in the capture and display tasks using the USER0 and USER1 bits in the RTA Control Panel They are turned on by default In order to view the latency from the input to output task it is necessary to turn these bits on After a typical latency measurement is recorded the amount of data the capture and display tasks deliver to the Message Log may be more than is useful 6 References H 263 Loop...

Page 27: ...006736 30 Total Load of benchmarking 17357 4 0 00086787 1 These benchmarks are given in instructions and the individual CPU load of each function is calculated by dividing the benchmark by 20M instructions per frame the number of cycles available on a 600 MHz 64x device in a 30 fps NTSC system These benchmarks were measured using UTL_stsStart and UTL_stsStop API calls bracketing the regions of cod...

Page 28: ...irements All sizes are in 8 bit bytes Table 4 Memory Footprint Details All RTA Features Enabled as shipped Remove D RTA_INCLUDED Build Option Remove UTL Calls Set UTL_DEBUGLEVEL 0 Remove Both D RTA_INCLUDED Build Option and UTL Calls Code Size 11 406 788 11 405 076 11 402 856 11 401 272 Data Size 3347 3347 2643 2643 Bss Stack 5392 5392 5392 5392 Total 11 415 527 11 413 815 11 410 891 11 409 307 Co...

Page 29: ...nt that any license either express or implied is granted under any TI patent right copyright mask work right or other TI intellectual property right relating to any combination machine or process in which TI products or services are used Information published by TI regarding third party products or services does not constitute a license from TI to use such products or services or a warranty or end...

Reviews: