Go back

Practical Profiling with perf on Linux

Published:

Linux Perf

Table of Contents

Open Table of Contents

What perf is

perf is the user-space front-end to the Linux perf_event subsystem (the perf_event_open(2) syscall). It offers a uniform command-line interface to hardware Performance-Monitoring Units (PMU), kernel trace-points, software counters, kprobes/uprobes, and eBPF events, hiding the architectural quirks of each CPU family. It ships in the kernel tree (tools/perf) and is packaged by most distributions as linux-tools-$(uname -r).1

Why you might care

Mental model

ConceptWhat it meansExample flags
EventThing you count or sample-e cycles, -e mem-loads,mem-stores
Counting vs. SamplingSimple aggregate counters (perf stat) vs. periodic PC/IP snapshots (perf record -F99)perf stat, perf record
Call-graph modeCapture stack traces with frame-pointer unwind, DWARF, Last-Branch-Records or BPF-g fp, -g dwarf, --call-graph lbr
perf.dataBinary file holding raw samples; post-processed by sub-commandsperf report, perf script

Quick-start workflow

# 1. Get a high-level performance baseline
perf stat -e cycles,instructions,cache-misses ./my_app

# 2. Sample 99 Hz, record user-space call-graphs, system-wide (-a)
sudo perf record -F 99 -g -a -- ./my_loadgen.sh

# 3. Inspect the profile (TUI or stdio)
perf report            # interactive
perf report --stdio    # pipe to less

# 4. Zoom into specific functions/lines
perf annotate -p <PID>

perf top gives a live, htop-like view, continuously refreshing hottest symbols.3 4

Most-used sub-commands

Choosing and grouping events

# Sample multiple hardware events as a group so they are counted together
sudo perf record -e '{cycles,cache-misses,branch-misses}:u' -c 100000 -g ./app

# Use raw event codes if the alias is missing on your CPU
# Intel “frontend stalled” & “cycle activity:stalls_l2_miss”
sudo perf stat -e r003c,r0041 ./app

Individual events can be limited to user (:u), kernel (:k), or hypervisor (:h) privilege levels. Event selection varies by micro-architecture, but perf list prints everything supported on the running kernel.

Scope and filtering

Call-graph collection nuances

ModeProsConsKernel ≥ 6.8 notes
-g fp (frame-pointer)zero config, low overheadneeds FP-enabled builddefault on many distros
-g dwarfworks with FP-omitted buildshigher unwind costfaster thanks to ORC
--call-graph lbr (Intel/AMD)near-zero overhead, deep stacksrequires hardware LBRhybrid-core aware
--call-graph lbr,bpfuses BPF helper to unwind userspacebest for mixed languagesnew in kernel 6.9

Stacks from JIT runtimes (JVM, V8, .NET) need perf map support or BPF CO-RE unwinders.

Post-processing and visualisation

# Convert to folded stacks for FlameGraph
perf script | stackcollapse-perf.pl > out.folded
flamegraph.pl out.folded > flame.svg

# Generate speedscope JSON
perf script -F +pid,comm,ip,sym | perf_script_speedscope > profile.speedscope.json

perf inject can post-process LBR data to enrich samples, and perf archive bundles symbol files for offline analysis.

Practical tips & caveats

Conclusion

perf combines a profiler, tracer, and benchmark suite into a single first-party, always-available tool. By counting or sampling nearly every performance-relevant event the kernel exposes, it lets you move from “the code feels slow” to a quantified, line-level diagnosis in minutes—without invasive instrumentation or proprietary SDKs. Armed with the commands above you can start measuring before guessing and make data-driven optimisation a routine part of your Linux workflow.

Footnotes

  1. Introduction - perf: Linux profiling with performance counters

  2. Linux perf Examples - Brendan Gregg

  3. Chapter 21. Recording and analyzing performance profiles with perf

  4. Linux perf: How to Use the Command and Profiler | phoenixNAP KB


Suggest Changes

Previous Post
Effective Modern C++: Writing Clean, Bug-Free, and High-Performance Code
Next Post
Mastering the Art of LLM Integration: A Developer's Guide to Effective AI Utilization