Module 4 · Lesson 4 · ~20 min read

Profiling with pprof and trace

Two tools cover most production performance work in Go: pprof for "where's CPU and memory going?" and trace for "what's the goroutine scheduling actually doing?" Both ship with the standard library. Both are usable in 30 seconds of setup.

The five profile types

Profile	Question it answers
`cpu`	Where is CPU time being spent?
`heap`	Which lines allocated the bytes still live in memory?
`allocs`	Which lines allocated bytes ever (live or freed)?
`goroutine`	How many goroutines are alive and where are they parked?
`block` / `mutex`	What's blocking on contention?

Two ways to expose pprof

1. Live HTTP endpoint — for services

import _ "net/http/pprof"  // blank import — registers handlers on http.DefaultServeMux

func main() {
    go http.ListenAndServe("localhost:6060", nil)
    // ... rest of your service
}

Now http://localhost:6060/debug/pprof/ serves profiles. Grab them via go tool pprof:

# 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# live heap snapshot
go tool pprof http://localhost:6060/debug/pprof/heap

# goroutine dump (text)
curl http://localhost:6060/debug/pprof/goroutine?debug=2

Production note

Don't expose :6060 publicly. The pprof endpoints leak runtime state and CPU info. Bind to localhost only, or to a separate admin network. The blank import auto-registers on the default mux — make sure your main mux isn't the same as the pprof mux.

2. Profile a test or benchmark

go test -bench=. -cpuprofile=cpu.out -memprofile=mem.out
go tool pprof cpu.out

Inside go tool pprof interactive mode:

top — top functions by CPU/memory.
top -cum — by cumulative time (function + everything it calls).
list FuncName — annotate the source of a function.
web — open an interactive flame-graph in your browser (needs Graphviz).

Interpreting a flame graph in 30 seconds

Each box is a function. Width = time spent in that function (or its callees, depending on view).
Stacked vertically = call stack. The bottom box called the box on top.
Wide flat box at the top = work happens here. This is the line to optimize.
Wide narrow tower = function with deep call stacks but no leaf hotspot. Often means "you're doing a lot of small RPCs"; fix is usually batching.

Don't optimize without a profile. Engineers' intuition for "where is time spent" is wrong more than half the time.

The trace tool — different from pprof

pprof samples (statistical). runtime/trace records every scheduling event (exact). It's heavier but tells you things pprof can't:

Goroutines that started but never ran (scheduler starvation).
Long network or syscall stalls.
GC phases visualized against application work.
Per-goroutine timelines.

// Capture a 5-second trace from your service:
curl -o trace.out 'http://localhost:6060/debug/pprof/trace?seconds=5'

// View it:
go tool trace trace.out

This opens a browser UI with several views — most useful is "Goroutine analysis" which shows what each goroutine actually did during the trace.

Common findings — what profiles tell you

Symptom in CPU profile	Likely cause
Lots of time in `runtime.mallocgc`	Excessive allocation. Check `allocs` profile.
Lots of time in `runtime.gcBgMarkWorker`	GC pressure. Reduce allocations or raise GOGC.
Lots of time in `syscall`	Heavy I/O. Batch, cache, or stream more efficiently.
Wide tower under `encoding/json.Decode`	JSON is expensive. Consider streaming decoder, faster encoder, protobuf.
Lots of time under `regexp.compile`	You're recompiling regexes per call. `regexp.MustCompile` at package init.

Goroutine leak detection

Live goroutine count climbing over time = leak. Check:

curl http://localhost:6060/debug/pprof/goroutine?debug=2

You get a stack trace per goroutine. Look for many goroutines parked at the same line — that's where they're stuck. Common causes: a channel that's no one's writing to, a context that never cancels, a blocking call without a timeout.

Benchmarking — the lightweight tool

func BenchmarkSubmit(b *testing.B) {
    s := NewSubmitter()
    cmd := Command{ID: "x"}
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        s.Submit(cmd)
    }
}

go test -bench=BenchmarkSubmit -benchmem -count=10
# BenchmarkSubmit-8     1234567   872 ns/op   128 B/op   3 allocs/op

-benchmem reports allocations. -count=10 runs 10 iterations so you can use benchstat to compare before/after with statistical significance.

The performance loop in practice

Measure first. Without a baseline you can't tell if a change helped.
Profile under realistic load. A microbenchmark of one function can mislead vs production traffic.
Optimize the widest box. 100x speedup of a 0.1% function is noise; 2x speedup of a 30% function is real.
Re-measure. Confirm the change actually helped on the metric you care about.
Stop when good enough. Performance work has diminishing returns.

Takeaways

net/http/pprof + a localhost server = production profiling in 3 lines.
Five profile types: cpu, heap, allocs, goroutine, block/mutex.
Flame graphs: wide flat top = hot leaf, optimize there.
runtime/trace for scheduler-level visibility (per-goroutine timelines, GC phases).
Goroutine count climbing over time = leak. Dump goroutines, look for clusters.
go test -bench -benchmem for tight microbenchmarks.
Profile first, optimize second. Always.