Cachegrind: a cache and branch-prediction profiler
--mod-filename=<expr> [default:
none]
Specifies a Perl search-and-replace expression that is applied to all filenames. Useful for removing minor differences
in paths between two different versions of a program that are sitting in different directories.
--mod-funcname=<expr> [default:
none]
Like
--mod-filename
, but for filenames. Useful for removing minor differences in randomized names of auto-
generated functions generated by some compilers.
5.6. Acting on Cachegrind’s Information
Cachegrind gives you lots of information, but acting on that information isn’t always easy.
Here are some rules of
thumb that we have found to be useful.
First of all, the global hit/miss counts and miss rates are not that useful. If you have multiple programs or multiple runs
of a program, comparing the numbers might identify if any are outliers and worthy of closer investigation. Otherwise,
they’re not enough to act on.
The function-by-function counts are more useful to look at, as they pinpoint which functions are causing large numbers
of counts. However, beware that inlining can make these counts misleading. If a function
f
is always inlined, counts
will be attributed to the functions it is inlined into, rather than itself.
However, if you look at the line-by-line
annotations for
f
you’ll see the counts that belong to
f
. (This is hard to avoid, it’s how the debug info is structured.)
So it’s worth looking for large numbers in the line-by-line annotations.
The line-by-line source code annotations are much more useful. In our experience, the best place to start is by looking
at the
Ir
numbers. They simply measure how many instructions were executed for each line, and don’t include any
cache information, but they can still be very useful for identifying bottlenecks.
After that, we have found that LL misses are typically a much bigger source of slow-downs than L1 misses.
So
it’s worth looking for any snippets of code with high
DLmr
or
DLmw
counts.
(You can use
--show=DLmr
--sort=DLmr
with cg_annotate to focus just on
DLmr
counts, for example.) If you find any, it’s still not always easy
to work out how to improve things. You need to have a reasonable understanding of how caches work, the principles
of locality, and your program’s data access patterns. Improving things may require redesigning a data structure, for
example.
Looking at the
Bcm
and
Bim
misses can also be helpful. In particular,
Bim
misses are often caused by
switch
statements, and in some cases these
switch
statements can be replaced with table-driven code. For example, you
might replace code like this:
enum E { A, B, C };
enum E e;
int i;
...
switch (e)
{
case A: i += 1; break;
case B: i += 2; break;
case C: i += 3; break;
}
with code like this:
88