Profiling your Applications using the Linux Perf Tools

ProfilingyourApplications
usingtheLinuxPerfTools
/Kevin Funk KDAB
emBO++ Conf 2017, Bochum

Agenda
Perf setup
Benchmarking
Profiling
More Topics?

Intermission:WhoamI?
Software Engineering Consultant at KDAB since 2010
FOSS enthusiast working on Qt/C++ at KDE since 2006
Lead developer of the KDevelop IDE
mainly on the C/C++ support backed by Clang
as well as cross-platform support

Setup
Hardware
Linux Kernel prerequisites
Building user-space perf
Cross-compiling
Permissions

Hardware
Hardware performance counters
Working PMU

LinuxKernelPrerequisites
$ uname -r # should be at least 3.7
4.7.1-1-ARCH
$ zgrep PERF /proc/config.gz
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_PERF_REGS=y
...

BuildingUser-spaceperf
git clone https://github.com/torvalds/linux.git
cd linux/tools/perf
export CC=gcc # clang is not supported
make

Dependencies
Auto-detecting system features:
... dwarf: [ on ] # for symbol resolution
... dwarf_getlocations: [ on ] # for symbol resolution
... glibc: [ on ]
... gtk2: [ on ]
... libaudit: [ on ] # for syscall tracing
... libbfd: [ on ] # for symbol resolution
... libelf: [ on ] # for symbol resolution
... libnuma: [ on ]
... numa_num_possible_cpus: [ on ]
... libperl: [ on ] # for perl bindings
... libpython: [ on ] # for python bindings
... libslang: [ on ] # for TUI
... libcrypto: [ on ] # for JITed probe points
... libunwind: [ on ] # for unwinding
... libdw-dwarf-unwind: [ on ] # for unwinding
... zlib: [ on ]
... lzma: [ on ]
... get_cpuid: [ on ]
... bpf: [ on ]

Cross-compiling
make prefix=somepath ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
Common pitfalls:
CC must not contain any flags
CFLAGS is ignored, use EXTRA_CFLAGS
prefix path ignored for include and library paths
Dependency issues:
linux/tools/build/feature/test-$FEATURE.make.output

Permissions
#!/bin/bash
sudo mount -o remount,mode=755 /sys/kernel/debug
sudo mount -o remount,mode=755 /sys/kernel/debug/tracing
echo "0" | sudo tee /proc/sys/kernel/kptr_restrict
echo "-1" | sudo tee /proc/sys/kernel/perf_event_paranoid
sudo chown root:tracing /sys/kernel/debug/tracing/uprobe_events
sudo chmod g+rw /sys/kernel/debug/tracing/uprobe_events

Benchmarking
Be scientific!
Take variance into account
Compare before/after measurements

perf stat
$ perf stat -r 5 -o baseline.txt -- ./ex_branches
$ cat baseline.txt
Performance counter stats for './ex_branches' (5 runs):
807.951072 task-clock:u (msec) # 0.999 CPUs utilized ( +- 1.97% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
520 page-faults:u # 0.643 K/sec ( +- 0.15% )
2,487,366,239 cycles:u # 3.079 GHz ( +- 1.97% )
1,484,737,283 instructions:u # 0.60 insn per cycle ( +- 0.00% )
329,602,843 branches:u # 407.949 M/sec ( +- 0.00% )
80,476,858 branch-misses:u # 24.42% of all branches ( +- 0.06% )
0.808952447 seconds time elapsed ( +- 1.97% )

Kernelvs.Userspace
Use event modifiers to separate domains:
$ perf stat -r 5 --event=cycles:{k,u} -- ./ex_qdatetime
Performance counter stats for './ex_qdatetime' (5 runs):
13,337,722 cycles:k ( +- 3.82% )
9,745,474 cycles:u ( +- 1.58% )
0.008018321 seconds time elapsed ( +- 4.02% )
See man perf list for more.

perf list
$ perf list
List of pre-defined events (to be used in -e):
branch-misses [Hardware event]
cache-misses [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
...
alignment-faults [Software event]
context-switches OR cs [Software event]
page-faults OR faults [Software event]
...
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]
sched:sched_stat_runtime [Tracepoint event]
...
syscalls:sys_enter_futex [Tracepoint event]
syscalls:sys_exit_futex [Tracepoint event]
...

Profiling
CPU profiling
Sleep-time profiling

perf top
System-wide live profiling:
$ perf top
Samples: 12K of event 'cycles:ppp', Event count (approx.): 5456372201
Overhead Shared Object Symbol
13.11% libQt5Core.so.5.7.0 [.] QHashData::nextNode
5.08% libQt5Core.so.5.7.0 [.] operator==
2.90% libQt5Core.so.5.7.0 [.] 0x000000000012f0d1
2.33% libQt5DBus.so.5.7.0 [.] 0x000000000002281f
1.62% libQt5DBus.so.5.7.0 [.] 0x0000000000022810
...

StatisticalProfiling
Sampling the call stack is crucial!

UnwindingandCallStacks
frame pointers (fp)
debug information (dwarf)
Last Branch Record (lbr)

Recommendation
On embedded: enable frame pointers
On the desktop: rely on DWARF
On Intel: play with LBR

perf record
Profile new application and its children:
$ perf record --call-graph dwarf -- ./lab_mandelbrot -b 5
[ perf record: Woken up 256 times to write data ]
[ perf record: Captured and wrote 64.174 MB perf.data (7963 samples) ]

perf record
Attach to running process:
$ perf record --call-graph dwarf --pid $(pidof ...)
# wait for some time, then quit with CTRL + C

perf record
Profile whole system for some time:
$ perf record -a -- sleep 5

perf report
Top-down inclusive cost report:
$ perf report
Children Self Command Shared Object Symbol
- 93.67% 31.76% lab_mandelbrot lab_mandelbrot [.] main
- 72.22% main
+ 28.42% hypot
__hypot_finite
19.87% __muldc3
3.45% __muldc3@plt
2.19% cabs@plt
+ 1.85% QColor::rgb
1.61% QImage::width@plt
1.26% QImage::height@plt
0.97% QColor::fromHsvF
+ 0.90% QApplicationPrivate::init
0.66% QImage::setPixel
+ 21.44% _start
+ 83.34% 0.00% lab_mandelbrot libc-2.24.so [.] __libc_start_main
+ 83.33% 0.00% lab_mandelbrot lab_mandelbrot [.] _start
...

perf report
Bottom-up self cost report:
$ perf report --no-children
Overhead Command Shared Object Symbol
- 31.76% lab_mandelbrot lab_mandelbrot [.] main
- main
- __libc_start_main
_start
- 23.31% lab_mandelbrot libm-2.24.so [.] __hypot_finite
- __hypot_finite
- 22.56% hypot
main
__libc_start_main
_start
- 23.04% lab_mandelbrot libgcc_s.so.1 [.] __muldc3
- __muldc3
+ main
- 5.90% lab_mandelbrot libm-2.24.so [.] hypot
+ hypot
...

perf report
Show file and line numbers:
$ perf report --no-children -s dso,sym,srcline
Overhead Shared Object Symbol Source:Line
- 7.82% lab_mandelbrot [.] main mandelbrot.h:41
+ main
- 7.79% libgcc_s.so.1 [.] __muldc3 libgcc2.c:1945
__muldc3
main
__libc_start_main
_start
- 7.46% lab_mandelbrot [.] main complex:1326
- main
+ __libc_start_main
__muldc3
main
__libc_start_main
_start
...

perf report
Show file and line numbers in backtraces:
$ perf report --no-children -s dso,sym,srcline -g address
Overhead Shared Object Symbol Source:Line
- 7.82% lab_mandelbrot [.] main mandelbrot.h:41
- 2.84% main mandelbrot.h:41
__libc_start_main +241
_start +4194346
2.58% main mandelbrot.h:41
- 2.01% main mandelbrot.h:41
_start +4194346
+ 3.93% __muldc3 libgcc2.c:1945
+ 3.72% __muldc3 libgcc2.c:1945
- 7.46% lab_mandelbrot [.] main complex:1326
- 4.65% main complex:1326
_start +4194346
2.81% main complex:1326
...

perf config
Configure default output format:
[report]
children = false
sort_order = dso,sym,srcline
[call-graph]
record-mode = dwarf
print-type = graph
order = caller
sort-key = address
man perf config

FlameGraphs
perf script report stackcollapse | flamegraph.pl > graph.svg
Flame Graph Search
_start
__hypot_finite
__muldc3
__libc_start_main
hypot
main
c..
lab_mandelbrot

Cross-machineReporting
When recording machine has symbols available:
# on first machine:
$ perf record ...
$ perf archive
Now please run:
$ tar xvf perf.data.tar.bz2 -C ~/.debug
wherever you need to run 'perf report' on.
# on second machine:
$ rsync machine1:path/to/perf.data{,tar.bz2} .
$ tar xf perf.data.tar.bz2 -C ~/.debug
$ perf report

Cross-machineReporting
When reporting machine has symbols available:
# on first machine:
$ perf record ...
# on second machine:
$ rsync machine1:path/to/perf.data .
$ perf report --symfs /path/to/sysroot

Sleep-timeProfiling
#!/bin/bash
echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
perf record
--event sched:sched_stat_sleep/call-graph=fp/
--event sched:sched_process_exit/call-graph=fp/
--event sched:sched_switch/call-graph=dwarf/
--output perf.data.raw $@
echo 0 | sudo tee /proc/sys/kernel/sched_schedstats
perf inject --sched-stat --input perf.data.raw --output perf.data

Sleep-timeProfiling
$ perf-sleep-record ./ex_sleep
$ perf report
Samples: 24 of event 'sched:sched_switch', Event count (approx.): 8883195296
Overhead Trace output
- 100.00% ex_sleep:24938 [120] S ==> swapper/7:0 [120]
- 90.07% main main.cpp:10
QThread::sleep +11
0x1521ed
__nanosleep .:0
entry_SYSCALL_64_fastpath entry_64.o:0
sys_nanosleep +18446744071576748154
hrtimer_nanosleep +18446744071576748225
do_nanosleep hrtimer.c:0
schedule +18446744071576748092
__schedule core.c:0
+ 9.02% main main.cpp:11
+ 0.91% main main.cpp:6

perf script
Convert perf.data to callgrind format:
$ perf record --call-graph dwarf ...
$ perf script report callgrind > perf.callgrind
$ kcachegrind perf.callgrind
github.com/milianw/linux/.../callgrind.py

perf script
Convert perf.data to callgrind format:

Questions?
kevin.funk@kdab.com
https://www.kdab.com/
We offer trainings and workshops!Debugging and Profiling
More perf work from my colleague:
github.com/milianw/linux/tree/milian/perf
git clone -b milian/perf https://github.com/milianw/linux.git

Profiling your Applications using the Linux Perf Tools

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Profiling your Applications using the Linux Perf Tools (20)

Recently uploaded (20)

Profiling your Applications using the Linux Perf Tools