SlideShare a Scribd company logo
ProfilingyourApplications
usingtheLinuxPerfTools
/Kevin Funk KDAB
emBO++ Conf 2017, Bochum
Agenda
Perf setup
Benchmarking
Profiling
More Topics?
Intermission:WhoamI?
Software Engineering Consultant at KDAB since 2010
FOSS enthusiast working on Qt/C++ at KDE since 2006
Lead developer of the KDevelop IDE
mainly on the C/C++ support backed by Clang
as well as cross-platform support
Setup
Hardware
Linux Kernel prerequisites
Building user-space perf
Cross-compiling
Permissions
Hardware
Hardware performance counters
Working PMU
LinuxKernelPrerequisites
$ uname -r # should be at least 3.7
4.7.1-1-ARCH
$ zgrep PERF /proc/config.gz
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_PERF_REGS=y
...
BuildingUser-spaceperf
git clone https://github.com/torvalds/linux.git
cd linux/tools/perf
export CC=gcc # clang is not supported
make
Dependencies
Auto-detecting system features:
... dwarf: [ on ] # for symbol resolution
... dwarf_getlocations: [ on ] # for symbol resolution
... glibc: [ on ]
... gtk2: [ on ]
... libaudit: [ on ] # for syscall tracing
... libbfd: [ on ] # for symbol resolution
... libelf: [ on ] # for symbol resolution
... libnuma: [ on ]
... numa_num_possible_cpus: [ on ]
... libperl: [ on ] # for perl bindings
... libpython: [ on ] # for python bindings
... libslang: [ on ] # for TUI
... libcrypto: [ on ] # for JITed probe points
... libunwind: [ on ] # for unwinding
... libdw-dwarf-unwind: [ on ] # for unwinding
... zlib: [ on ]
... lzma: [ on ]
... get_cpuid: [ on ]
... bpf: [ on ]
Cross-compiling
make prefix=somepath ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
Common pitfalls:
CC must not contain any flags
CFLAGS is ignored, use EXTRA_CFLAGS
prefix path ignored for include and library paths
Dependency issues:
linux/tools/build/feature/test-$FEATURE.make.output
Permissions
#!/bin/bash
sudo mount -o remount,mode=755 /sys/kernel/debug
sudo mount -o remount,mode=755 /sys/kernel/debug/tracing
echo "0" | sudo tee /proc/sys/kernel/kptr_restrict
echo "-1" | sudo tee /proc/sys/kernel/perf_event_paranoid
sudo chown root:tracing /sys/kernel/debug/tracing/uprobe_events
sudo chmod g+rw /sys/kernel/debug/tracing/uprobe_events
Benchmarking
Be scientific!
Take variance into account
Compare before/after measurements
perf stat
$ perf stat -r 5 -o baseline.txt -- ./ex_branches
$ cat baseline.txt
Performance counter stats for './ex_branches' (5 runs):
807.951072 task-clock:u (msec) # 0.999 CPUs utilized ( +- 1.97% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
520 page-faults:u # 0.643 K/sec ( +- 0.15% )
2,487,366,239 cycles:u # 3.079 GHz ( +- 1.97% )
1,484,737,283 instructions:u # 0.60 insn per cycle ( +- 0.00% )
329,602,843 branches:u # 407.949 M/sec ( +- 0.00% )
80,476,858 branch-misses:u # 24.42% of all branches ( +- 0.06% )
0.808952447 seconds time elapsed ( +- 1.97% )
Kernelvs.Userspace
Use event modifiers to separate domains:
$ perf stat -r 5 --event=cycles:{k,u} -- ./ex_qdatetime
Performance counter stats for './ex_qdatetime' (5 runs):
13,337,722 cycles:k ( +- 3.82% )
9,745,474 cycles:u ( +- 1.58% )
0.008018321 seconds time elapsed ( +- 4.02% )
See man perf list for more.
perf list
$ perf list
List of pre-defined events (to be used in -e):
branch-misses [Hardware event]
cache-misses [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
...
alignment-faults [Software event]
context-switches OR cs [Software event]
page-faults OR faults [Software event]
...
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]
sched:sched_stat_runtime [Tracepoint event]
...
syscalls:sys_enter_futex [Tracepoint event]
syscalls:sys_exit_futex [Tracepoint event]
...
Profiling
CPU profiling
Sleep-time profiling
perf top
System-wide live profiling:
$ perf top
Samples: 12K of event 'cycles:ppp', Event count (approx.): 5456372201
Overhead Shared Object Symbol
13.11% libQt5Core.so.5.7.0 [.] QHashData::nextNode
5.08% libQt5Core.so.5.7.0 [.] operator==
2.90% libQt5Core.so.5.7.0 [.] 0x000000000012f0d1
2.33% libQt5DBus.so.5.7.0 [.] 0x000000000002281f
1.62% libQt5DBus.so.5.7.0 [.] 0x0000000000022810
...
StatisticalProfiling
Sampling the call stack is crucial!
UnwindingandCallStacks
frame pointers (fp)
debug information (dwarf)
Last Branch Record (lbr)
Recommendation
On embedded: enable frame pointers
On the desktop: rely on DWARF
On Intel: play with LBR
perf record
Profile new application and its children:
$ perf record --call-graph dwarf -- ./lab_mandelbrot -b 5
[ perf record: Woken up 256 times to write data ]
[ perf record: Captured and wrote 64.174 MB perf.data (7963 samples) ]
perf record
Attach to running process:
$ perf record --call-graph dwarf --pid $(pidof ...)
# wait for some time, then quit with CTRL + C
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.904 MB perf.data (70 samples) ]
perf record
Profile whole system for some time:
$ perf record -a -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.498 MB perf.data (2731 samples) ]
perf report
perf report
Top-down inclusive cost report:
$ perf report
Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769
Children Self Command Shared Object Symbol
- 93.67% 31.76% lab_mandelbrot lab_mandelbrot [.] main
- 72.22% main
+ 28.42% hypot
__hypot_finite
19.87% __muldc3
3.45% __muldc3@plt
2.19% cabs@plt
+ 1.85% QColor::rgb
1.61% QImage::width@plt
1.26% QImage::height@plt
0.97% QColor::fromHsvF
+ 0.90% QApplicationPrivate::init
0.66% QImage::setPixel
+ 21.44% _start
+ 83.34% 0.00% lab_mandelbrot libc-2.24.so [.] __libc_start_main
+ 83.33% 0.00% lab_mandelbrot lab_mandelbrot [.] _start
...
perf report
Bottom-up self cost report:
$ perf report --no-children
Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769
Overhead Command Shared Object Symbol
- 31.76% lab_mandelbrot lab_mandelbrot [.] main
- main
- __libc_start_main
_start
- 23.31% lab_mandelbrot libm-2.24.so [.] __hypot_finite
- __hypot_finite
- 22.56% hypot
main
__libc_start_main
_start
- 23.04% lab_mandelbrot libgcc_s.so.1 [.] __muldc3
- __muldc3
+ main
- 5.90% lab_mandelbrot libm-2.24.so [.] hypot
+ hypot
...
perf report
Show file and line numbers:
$ perf report --no-children -s dso,sym,srcline
Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769
Overhead Shared Object Symbol Source:Line
- 7.82% lab_mandelbrot [.] main mandelbrot.h:41
+ main
- 7.79% libgcc_s.so.1 [.] __muldc3 libgcc2.c:1945
__muldc3
main
__libc_start_main
_start
- 7.46% lab_mandelbrot [.] main complex:1326
- main
+ __libc_start_main
- 6.94% libgcc_s.so.1 [.] __muldc3 libgcc2.c:1944
__muldc3
main
__libc_start_main
_start
...
perf report
Show file and line numbers in backtraces:
$ perf report --no-children -s dso,sym,srcline -g address
Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769
Overhead Shared Object Symbol Source:Line
- 7.82% lab_mandelbrot [.] main mandelbrot.h:41
- 2.84% main mandelbrot.h:41
__libc_start_main +241
_start +4194346
2.58% main mandelbrot.h:41
- 2.01% main mandelbrot.h:41
__libc_start_main +241
_start +4194346
- 7.79% libgcc_s.so.1 [.] __muldc3 libgcc2.c:1945
+ 3.93% __muldc3 libgcc2.c:1945
+ 3.72% __muldc3 libgcc2.c:1945
- 7.46% lab_mandelbrot [.] main complex:1326
- 4.65% main complex:1326
__libc_start_main +241
_start +4194346
2.81% main complex:1326
...
perf config
Configure default output format:
[report]
children = false
sort_order = dso,sym,srcline
[call-graph]
record-mode = dwarf
print-type = graph
order = caller
sort-key = address
man perf config
FlameGraphs
perf script report stackcollapse | flamegraph.pl > graph.svg
Flame Graph Search
_start
__hypot_finite
__muldc3
__libc_start_main
hypot
main
c..
lab_mandelbrot
Moretopics?
Cross-machineReporting
When recording machine has symbols available:
# on first machine:
$ perf record ...
$ perf archive
Now please run:
$ tar xvf perf.data.tar.bz2 -C ~/.debug
wherever you need to run 'perf report' on.
# on second machine:
$ rsync machine1:path/to/perf.data{,tar.bz2} .
$ tar xf perf.data.tar.bz2 -C ~/.debug
$ perf report
Cross-machineReporting
When reporting machine has symbols available:
# on first machine:
$ perf record ...
# on second machine:
$ rsync machine1:path/to/perf.data .
$ perf report --symfs /path/to/sysroot
Sleep-timeProfiling
#!/bin/bash
echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
perf record 
--event sched:sched_stat_sleep/call-graph=fp/ 
--event sched:sched_process_exit/call-graph=fp/ 
--event sched:sched_switch/call-graph=dwarf/ 
--output perf.data.raw $@
echo 0 | sudo tee /proc/sys/kernel/sched_schedstats
perf inject --sched-stat --input perf.data.raw --output perf.data
Sleep-timeProfiling
$ perf-sleep-record ./ex_sleep
$ perf report
Samples: 24 of event 'sched:sched_switch', Event count (approx.): 8883195296
Overhead Trace output
- 100.00% ex_sleep:24938 [120] S ==> swapper/7:0 [120]
- 90.07% main main.cpp:10
QThread::sleep +11
0x1521ed
__nanosleep .:0
entry_SYSCALL_64_fastpath entry_64.o:0
sys_nanosleep +18446744071576748154
hrtimer_nanosleep +18446744071576748225
do_nanosleep hrtimer.c:0
schedule +18446744071576748092
__schedule core.c:0
+ 9.02% main main.cpp:11
+ 0.91% main main.cpp:6
perf script
Convert perf.data to callgrind format:
$ perf record --call-graph dwarf ...
$ perf script report callgrind > perf.callgrind
$ kcachegrind perf.callgrind
github.com/milianw/linux/.../callgrind.py
perf script
Convert perf.data to callgrind format:
Questions?
kevin.funk@kdab.com
https://www.kdab.com/
We offer trainings and workshops!Debugging and Profiling
More perf work from my colleague:
github.com/milianw/linux/tree/milian/perf
git clone -b milian/perf https://github.com/milianw/linux.git

More Related Content

PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PPTX
Linux Network Stack
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
PDF
Velocity 2015 linux perf tools
PDF
BPF / XDP 8월 세미나 KossLab
PDF
BPF Internals (eBPF)
PDF
Linux Profiling at Netflix
PDF
Cilium - Container Networking with BPF & XDP
Kernel Recipes 2017: Using Linux perf at Netflix
Linux Network Stack
High-Performance Networking Using eBPF, XDP, and io_uring
Velocity 2015 linux perf tools
BPF / XDP 8월 세미나 KossLab
BPF Internals (eBPF)
Linux Profiling at Netflix
Cilium - Container Networking with BPF & XDP

What's hot (20)

PDF
Hands-on ethernet driver
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Intel DPDK Step by Step instructions
PPTX
Introduction to DPDK
PPTX
Dpdk applications
PDF
Q2.12: Debugging with GDB
PDF
LISA2019 Linux Systems Performance
PPT
Introduction to gdb
PPTX
Introduction Linux Device Drivers
PDF
Linux Internals - Part II
TXT
OPTEE on QEMU - Build Tutorial
PDF
Linux Internals - Interview essentials 2.0
PDF
eBPF - Rethinking the Linux Kernel
PDF
Qemu Introduction
PPTX
QEMU - Binary Translation
PDF
Embedded Linux Kernel - Build your custom kernel
PPTX
Understanding eBPF in a Hurry!
ODP
eBPF maps 101
PDF
Shell scripting
Hands-on ethernet driver
Linux Performance Analysis: New Tools and Old Secrets
Intel DPDK Step by Step instructions
Introduction to DPDK
Dpdk applications
Q2.12: Debugging with GDB
LISA2019 Linux Systems Performance
Introduction to gdb
Introduction Linux Device Drivers
Linux Internals - Part II
OPTEE on QEMU - Build Tutorial
Linux Internals - Interview essentials 2.0
eBPF - Rethinking the Linux Kernel
Qemu Introduction
QEMU - Binary Translation
Embedded Linux Kernel - Build your custom kernel
Understanding eBPF in a Hurry!
eBPF maps 101
Shell scripting
Ad

Viewers also liked (6)

PDF
A possible future of resource constrained software development
PDF
Programming at Compile Time
PDF
standardese - a WIP next-gen Doxygen
PDF
'Embedding' a meta state machine
PDF
Device-specific Clang Tooling for Embedded Systems
PDF
Data-driven HAL generation
A possible future of resource constrained software development
Programming at Compile Time
standardese - a WIP next-gen Doxygen
'Embedding' a meta state machine
Device-specific Clang Tooling for Embedded Systems
Data-driven HAL generation
Ad

Similar to Profiling your Applications using the Linux Perf Tools (20)

PDF
Performance Analysis Tools for Linux Kernel
PDF
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
PDF
YOW2020 Linux Systems Performance
PDF
Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Ta...
PDF
Performance tweaks and tools for Linux (Joe Damato)
PDF
Deep learning - the conf br 2018
PDF
Reproducible Computational Pipelines with Docker and Nextflow
PDF
ATO Linux Performance 2018
PDF
Android Boot Time Optimization
PPTX
Modern Linux Tracing Landscape
PDF
Crash_Report_Mechanism_In_Tizen
PDF
Node Interactive Debugging Node.js In Production
PDF
HKG18-TR14 - Postmortem Debugging with Coresight
PDF
Swift profiling middleware and tools
PPTX
Debugging linux issues with eBPF
PDF
Debugging Ruby
PDF
DevoxxUK: Optimizating Application Performance on Kubernetes
PPT
Spark streaming with kafka
PPT
Spark stream - Kafka
PDF
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Performance Analysis Tools for Linux Kernel
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
YOW2020 Linux Systems Performance
Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Ta...
Performance tweaks and tools for Linux (Joe Damato)
Deep learning - the conf br 2018
Reproducible Computational Pipelines with Docker and Nextflow
ATO Linux Performance 2018
Android Boot Time Optimization
Modern Linux Tracing Landscape
Crash_Report_Mechanism_In_Tizen
Node Interactive Debugging Node.js In Production
HKG18-TR14 - Postmortem Debugging with Coresight
Swift profiling middleware and tools
Debugging linux issues with eBPF
Debugging Ruby
DevoxxUK: Optimizating Application Performance on Kubernetes
Spark streaming with kafka
Spark stream - Kafka
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Tartificialntelligence_presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
1. Introduction to Computer Programming.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Getting Started with Data Integration: FME Form 101
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
Tartificialntelligence_presentation.pptx
Electronic commerce courselecture one. Pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
Reach Out and Touch Someone: Haptics and Empathic Computing
1. Introduction to Computer Programming.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Getting Started with Data Integration: FME Form 101
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Profiling your Applications using the Linux Perf Tools

  • 3. Intermission:WhoamI? Software Engineering Consultant at KDAB since 2010 FOSS enthusiast working on Qt/C++ at KDE since 2006 Lead developer of the KDevelop IDE mainly on the C/C++ support backed by Clang as well as cross-platform support
  • 4. Setup Hardware Linux Kernel prerequisites Building user-space perf Cross-compiling Permissions
  • 6. LinuxKernelPrerequisites $ uname -r # should be at least 3.7 4.7.1-1-ARCH $ zgrep PERF /proc/config.gz CONFIG_HAVE_PERF_EVENTS=y CONFIG_PERF_EVENTS=y CONFIG_HAVE_PERF_USER_STACK_DUMP=y CONFIG_HAVE_PERF_REGS=y ...
  • 7. BuildingUser-spaceperf git clone https://github.com/torvalds/linux.git cd linux/tools/perf export CC=gcc # clang is not supported make
  • 8. Dependencies Auto-detecting system features: ... dwarf: [ on ] # for symbol resolution ... dwarf_getlocations: [ on ] # for symbol resolution ... glibc: [ on ] ... gtk2: [ on ] ... libaudit: [ on ] # for syscall tracing ... libbfd: [ on ] # for symbol resolution ... libelf: [ on ] # for symbol resolution ... libnuma: [ on ] ... numa_num_possible_cpus: [ on ] ... libperl: [ on ] # for perl bindings ... libpython: [ on ] # for python bindings ... libslang: [ on ] # for TUI ... libcrypto: [ on ] # for JITed probe points ... libunwind: [ on ] # for unwinding ... libdw-dwarf-unwind: [ on ] # for unwinding ... zlib: [ on ] ... lzma: [ on ] ... get_cpuid: [ on ] ... bpf: [ on ]
  • 9. Cross-compiling make prefix=somepath ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- Common pitfalls: CC must not contain any flags CFLAGS is ignored, use EXTRA_CFLAGS prefix path ignored for include and library paths Dependency issues: linux/tools/build/feature/test-$FEATURE.make.output
  • 10. Permissions #!/bin/bash sudo mount -o remount,mode=755 /sys/kernel/debug sudo mount -o remount,mode=755 /sys/kernel/debug/tracing echo "0" | sudo tee /proc/sys/kernel/kptr_restrict echo "-1" | sudo tee /proc/sys/kernel/perf_event_paranoid sudo chown root:tracing /sys/kernel/debug/tracing/uprobe_events sudo chmod g+rw /sys/kernel/debug/tracing/uprobe_events
  • 11. Benchmarking Be scientific! Take variance into account Compare before/after measurements
  • 12. perf stat $ perf stat -r 5 -o baseline.txt -- ./ex_branches $ cat baseline.txt Performance counter stats for './ex_branches' (5 runs): 807.951072 task-clock:u (msec) # 0.999 CPUs utilized ( +- 1.97% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 520 page-faults:u # 0.643 K/sec ( +- 0.15% ) 2,487,366,239 cycles:u # 3.079 GHz ( +- 1.97% ) 1,484,737,283 instructions:u # 0.60 insn per cycle ( +- 0.00% ) 329,602,843 branches:u # 407.949 M/sec ( +- 0.00% ) 80,476,858 branch-misses:u # 24.42% of all branches ( +- 0.06% ) 0.808952447 seconds time elapsed ( +- 1.97% )
  • 13. Kernelvs.Userspace Use event modifiers to separate domains: $ perf stat -r 5 --event=cycles:{k,u} -- ./ex_qdatetime Performance counter stats for './ex_qdatetime' (5 runs): 13,337,722 cycles:k ( +- 3.82% ) 9,745,474 cycles:u ( +- 1.58% ) 0.008018321 seconds time elapsed ( +- 4.02% ) See man perf list for more.
  • 14. perf list $ perf list List of pre-defined events (to be used in -e): branch-misses [Hardware event] cache-misses [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] ... alignment-faults [Software event] context-switches OR cs [Software event] page-faults OR faults [Software event] ... sched:sched_stat_sleep [Tracepoint event] sched:sched_stat_iowait [Tracepoint event] sched:sched_stat_runtime [Tracepoint event] ... syscalls:sys_enter_futex [Tracepoint event] syscalls:sys_exit_futex [Tracepoint event] ...
  • 16. perf top System-wide live profiling: $ perf top Samples: 12K of event 'cycles:ppp', Event count (approx.): 5456372201 Overhead Shared Object Symbol 13.11% libQt5Core.so.5.7.0 [.] QHashData::nextNode 5.08% libQt5Core.so.5.7.0 [.] operator== 2.90% libQt5Core.so.5.7.0 [.] 0x000000000012f0d1 2.33% libQt5DBus.so.5.7.0 [.] 0x000000000002281f 1.62% libQt5DBus.so.5.7.0 [.] 0x0000000000022810 ...
  • 18. UnwindingandCallStacks frame pointers (fp) debug information (dwarf) Last Branch Record (lbr)
  • 19. Recommendation On embedded: enable frame pointers On the desktop: rely on DWARF On Intel: play with LBR
  • 20. perf record Profile new application and its children: $ perf record --call-graph dwarf -- ./lab_mandelbrot -b 5 [ perf record: Woken up 256 times to write data ] [ perf record: Captured and wrote 64.174 MB perf.data (7963 samples) ]
  • 21. perf record Attach to running process: $ perf record --call-graph dwarf --pid $(pidof ...) # wait for some time, then quit with CTRL + C [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 3.904 MB perf.data (70 samples) ]
  • 22. perf record Profile whole system for some time: $ perf record -a -- sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.498 MB perf.data (2731 samples) ]
  • 24. perf report Top-down inclusive cost report: $ perf report Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769 Children Self Command Shared Object Symbol - 93.67% 31.76% lab_mandelbrot lab_mandelbrot [.] main - 72.22% main + 28.42% hypot __hypot_finite 19.87% __muldc3 3.45% __muldc3@plt 2.19% cabs@plt + 1.85% QColor::rgb 1.61% QImage::width@plt 1.26% QImage::height@plt 0.97% QColor::fromHsvF + 0.90% QApplicationPrivate::init 0.66% QImage::setPixel + 21.44% _start + 83.34% 0.00% lab_mandelbrot libc-2.24.so [.] __libc_start_main + 83.33% 0.00% lab_mandelbrot lab_mandelbrot [.] _start ...
  • 25. perf report Bottom-up self cost report: $ perf report --no-children Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769 Overhead Command Shared Object Symbol - 31.76% lab_mandelbrot lab_mandelbrot [.] main - main - __libc_start_main _start - 23.31% lab_mandelbrot libm-2.24.so [.] __hypot_finite - __hypot_finite - 22.56% hypot main __libc_start_main _start - 23.04% lab_mandelbrot libgcc_s.so.1 [.] __muldc3 - __muldc3 + main - 5.90% lab_mandelbrot libm-2.24.so [.] hypot + hypot ...
  • 26. perf report Show file and line numbers: $ perf report --no-children -s dso,sym,srcline Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769 Overhead Shared Object Symbol Source:Line - 7.82% lab_mandelbrot [.] main mandelbrot.h:41 + main - 7.79% libgcc_s.so.1 [.] __muldc3 libgcc2.c:1945 __muldc3 main __libc_start_main _start - 7.46% lab_mandelbrot [.] main complex:1326 - main + __libc_start_main - 6.94% libgcc_s.so.1 [.] __muldc3 libgcc2.c:1944 __muldc3 main __libc_start_main _start ...
  • 27. perf report Show file and line numbers in backtraces: $ perf report --no-children -s dso,sym,srcline -g address Samples: 8K of event 'cycles:ppp', Event count (approx.): 8164367769 Overhead Shared Object Symbol Source:Line - 7.82% lab_mandelbrot [.] main mandelbrot.h:41 - 2.84% main mandelbrot.h:41 __libc_start_main +241 _start +4194346 2.58% main mandelbrot.h:41 - 2.01% main mandelbrot.h:41 __libc_start_main +241 _start +4194346 - 7.79% libgcc_s.so.1 [.] __muldc3 libgcc2.c:1945 + 3.93% __muldc3 libgcc2.c:1945 + 3.72% __muldc3 libgcc2.c:1945 - 7.46% lab_mandelbrot [.] main complex:1326 - 4.65% main complex:1326 __libc_start_main +241 _start +4194346 2.81% main complex:1326 ...
  • 28. perf config Configure default output format: [report] children = false sort_order = dso,sym,srcline [call-graph] record-mode = dwarf print-type = graph order = caller sort-key = address man perf config
  • 29. FlameGraphs perf script report stackcollapse | flamegraph.pl > graph.svg Flame Graph Search _start __hypot_finite __muldc3 __libc_start_main hypot main c.. lab_mandelbrot
  • 31. Cross-machineReporting When recording machine has symbols available: # on first machine: $ perf record ... $ perf archive Now please run: $ tar xvf perf.data.tar.bz2 -C ~/.debug wherever you need to run 'perf report' on. # on second machine: $ rsync machine1:path/to/perf.data{,tar.bz2} . $ tar xf perf.data.tar.bz2 -C ~/.debug $ perf report
  • 32. Cross-machineReporting When reporting machine has symbols available: # on first machine: $ perf record ... # on second machine: $ rsync machine1:path/to/perf.data . $ perf report --symfs /path/to/sysroot
  • 33. Sleep-timeProfiling #!/bin/bash echo 1 | sudo tee /proc/sys/kernel/sched_schedstats perf record --event sched:sched_stat_sleep/call-graph=fp/ --event sched:sched_process_exit/call-graph=fp/ --event sched:sched_switch/call-graph=dwarf/ --output perf.data.raw $@ echo 0 | sudo tee /proc/sys/kernel/sched_schedstats perf inject --sched-stat --input perf.data.raw --output perf.data
  • 34. Sleep-timeProfiling $ perf-sleep-record ./ex_sleep $ perf report Samples: 24 of event 'sched:sched_switch', Event count (approx.): 8883195296 Overhead Trace output - 100.00% ex_sleep:24938 [120] S ==> swapper/7:0 [120] - 90.07% main main.cpp:10 QThread::sleep +11 0x1521ed __nanosleep .:0 entry_SYSCALL_64_fastpath entry_64.o:0 sys_nanosleep +18446744071576748154 hrtimer_nanosleep +18446744071576748225 do_nanosleep hrtimer.c:0 schedule +18446744071576748092 __schedule core.c:0 + 9.02% main main.cpp:11 + 0.91% main main.cpp:6
  • 35. perf script Convert perf.data to callgrind format: $ perf record --call-graph dwarf ... $ perf script report callgrind > perf.callgrind $ kcachegrind perf.callgrind github.com/milianw/linux/.../callgrind.py
  • 36. perf script Convert perf.data to callgrind format:
  • 37. Questions? [email protected] https://www.kdab.com/ We offer trainings and workshops!Debugging and Profiling More perf work from my colleague: github.com/milianw/linux/tree/milian/perf git clone -b milian/perf https://github.com/milianw/linux.git