
Table of Contents
Open Table of Contents
What perf is
perf is the user-space front-end to the Linux perf_event subsystem (the perf_event_open(2) syscall).
It offers a uniform command-line interface to hardware Performance-Monitoring Units (PMU), kernel trace-points, software counters, kprobes/uprobes, and eBPF events, hiding the architectural quirks of each CPU family. It ships in the kernel tree (tools/perf) and is packaged by most distributions as linux-tools-$(uname -r).1
Why you might care
- Pinpoint hot spots — attribute CPU cycles, stalled slots, cache-misses or branch mis-predictions to lines of code without recompiling.
- Measure whole-system behaviour — sample across all tasks (
-a) or within cgroups, profile kernels, containers and virtual machines (perf kvm). - Low overhead, production-safe — sampling incurs < 1–5 % overhead at typical frequencies.
- Rich ecosystem — outputs feed directly into FlameGraph, speedscope, or BPF-based visualizers.
- Always up-to-date — because perf ships with each kernel release, new CPU features (Hybrid/Big-LITTLE core distinctions, Intel LBR call-stacks, AMD IBS, Arm SPE, etc.) appear as soon as your distro updates its kernel.2
Mental model
| Concept | What it means | Example flags |
|---|---|---|
| Event | Thing you count or sample | -e cycles, -e mem-loads,mem-stores |
| Counting vs. Sampling | Simple aggregate counters (perf stat) vs. periodic PC/IP snapshots (perf record -F99) | perf stat, perf record |
| Call-graph mode | Capture stack traces with frame-pointer unwind, DWARF, Last-Branch-Records or BPF | -g fp, -g dwarf, --call-graph lbr |
| perf.data | Binary file holding raw samples; post-processed by sub-commands | perf report, perf script |
Quick-start workflow
# 1. Get a high-level performance baseline
perf stat -e cycles,instructions,cache-misses ./my_app
# 2. Sample 99 Hz, record user-space call-graphs, system-wide (-a)
sudo perf record -F 99 -g -a -- ./my_loadgen.sh
# 3. Inspect the profile (TUI or stdio)
perf report # interactive
perf report --stdio # pipe to less
# 4. Zoom into specific functions/lines
perf annotate -p <PID>
perf top gives a live, htop-like view, continuously refreshing hottest symbols.3 4
Most-used sub-commands
- perf stat aggregate counters for one run, multiple events in parallel.
- perf record / report / annotate sampling → analysis cycle; supports user stacks, kernel stacks, mixed mode.
- perf top real-time sampling (hit a to toggle kernel/user).
- perf trace lightweight syscall/ftrace tracer (a faster
strace). - perf sched detect run-queue latency and involuntary context-switch delays.
- perf mem –stat / –live NUMA and memory-access profiling.
- perf c2c cache-to-cache false-sharing detector on multi-socket systems.
- perf bench micro-benchmarks for cpuhog, syscall, numa, memset, etc.
Choosing and grouping events
# Sample multiple hardware events as a group so they are counted together
sudo perf record -e '{cycles,cache-misses,branch-misses}:u' -c 100000 -g ./app
# Use raw event codes if the alias is missing on your CPU
# Intel “frontend stalled” & “cycle activity:stalls_l2_miss”
sudo perf stat -e r003c,r0041 ./app
Individual events can be limited to user (:u), kernel (:k), or hypervisor (:h) privilege levels. Event selection varies by micro-architecture, but perf list prints everything supported on the running kernel.
Scope and filtering
- Per-PID / thread
-p PID,-t TID - CPU mask
-C 0-3,6(profile big cores only) - Duration / delay
--timeout 10s,--delay 5 - cgroup
--cgroup=/sys/fs/cgroup/my_ctn(container-only profiling)
Call-graph collection nuances
| Mode | Pros | Cons | Kernel ≥ 6.8 notes |
|---|---|---|---|
-g fp (frame-pointer) | zero config, low overhead | needs FP-enabled build | default on many distros |
-g dwarf | works with FP-omitted builds | higher unwind cost | faster thanks to ORC |
--call-graph lbr (Intel/AMD) | near-zero overhead, deep stacks | requires hardware LBR | hybrid-core aware |
--call-graph lbr,bpf | uses BPF helper to unwind userspace | best for mixed languages | new in kernel 6.9 |
Stacks from JIT runtimes (JVM, V8, .NET) need perf map support or BPF CO-RE unwinders.
Post-processing and visualisation
# Convert to folded stacks for FlameGraph
perf script | stackcollapse-perf.pl > out.folded
flamegraph.pl out.folded > flame.svg
# Generate speedscope JSON
perf script -F +pid,comm,ip,sym | perf_script_speedscope > profile.speedscope.json
perf inject can post-process LBR data to enrich samples, and perf archive bundles symbol files for offline analysis.
Practical tips & caveats
- Privileges: events marked “Precise” or raw PMU codes usually require
sudoorperf_event_paranoid = 1. - Match versions: userspace
perfmust match (or be newer than) the running kernel;perf versionprints both hashes. - Build-ID cache:
perf buildid-cache --add /path/to/lib.soensures symbols for stripped binaries. - Minimise distortion: prefer period-based sampling (
-c) for very short-lived functions; throttle frequency on production (e.g.-F 400). - Containerised kernels: under Kubernetes use
--cgroupor--uidfilters; ensure /proc/sys/kernel/perf_event_paranoid inside the host allows sampling. - Hybrid (P-/E-core) systems: pin to core type with
--cpu-type=performance(kernel 6.8+) to avoid mixed counters.
Conclusion
perf combines a profiler, tracer, and benchmark suite into a single first-party, always-available tool. By counting or sampling nearly every performance-relevant event the kernel exposes, it lets you move from “the code feels slow” to a quantified, line-level diagnosis in minutes—without invasive instrumentation or proprietary SDKs. Armed with the commands above you can start measuring before guessing and make data-driven optimisation a routine part of your Linux workflow.