Module 4 · Lesson 4 · ~20 min read
Two tools cover most production performance work in Go: pprof for "where's CPU and memory going?" and trace for "what's the goroutine scheduling actually doing?" Both ship with the standard library. Both are usable in 30 seconds of setup.
| Profile | Question it answers |
|---|---|
cpu | Where is CPU time being spent? |
heap | Which lines allocated the bytes still live in memory? |
allocs | Which lines allocated bytes ever (live or freed)? |
goroutine | How many goroutines are alive and where are they parked? |
block / mutex | What's blocking on contention? |
import _ "net/http/pprof" // blank import — registers handlers on http.DefaultServeMux
func main() {
go http.ListenAndServe("localhost:6060", nil)
// ... rest of your service
}
Now http://localhost:6060/debug/pprof/ serves profiles. Grab them via go tool pprof:
# 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# live heap snapshot
go tool pprof http://localhost:6060/debug/pprof/heap
# goroutine dump (text)
curl http://localhost:6060/debug/pprof/goroutine?debug=2
Don't expose :6060 publicly. The pprof endpoints leak runtime state and CPU info. Bind to localhost only, or to a separate admin network. The blank import auto-registers on the default mux — make sure your main mux isn't the same as the pprof mux.
go test -bench=. -cpuprofile=cpu.out -memprofile=mem.out
go tool pprof cpu.out
Inside go tool pprof interactive mode:
top — top functions by CPU/memory.top -cum — by cumulative time (function + everything it calls).list FuncName — annotate the source of a function.web — open an interactive flame-graph in your browser (needs Graphviz).Don't optimize without a profile. Engineers' intuition for "where is time spent" is wrong more than half the time.
pprof samples (statistical). runtime/trace records every scheduling event (exact). It's heavier but tells you things pprof can't:
// Capture a 5-second trace from your service:
curl -o trace.out 'http://localhost:6060/debug/pprof/trace?seconds=5'
// View it:
go tool trace trace.out
This opens a browser UI with several views — most useful is "Goroutine analysis" which shows what each goroutine actually did during the trace.
| Symptom in CPU profile | Likely cause |
|---|---|
Lots of time in runtime.mallocgc | Excessive allocation. Check allocs profile. |
Lots of time in runtime.gcBgMarkWorker | GC pressure. Reduce allocations or raise GOGC. |
Lots of time in syscall | Heavy I/O. Batch, cache, or stream more efficiently. |
Wide tower under encoding/json.Decode | JSON is expensive. Consider streaming decoder, faster encoder, protobuf. |
Lots of time under regexp.compile | You're recompiling regexes per call. regexp.MustCompile at package init. |
Live goroutine count climbing over time = leak. Check:
curl http://localhost:6060/debug/pprof/goroutine?debug=2
You get a stack trace per goroutine. Look for many goroutines parked at the same line — that's where they're stuck. Common causes: a channel that's no one's writing to, a context that never cancels, a blocking call without a timeout.
func BenchmarkSubmit(b *testing.B) {
s := NewSubmitter()
cmd := Command{ID: "x"}
b.ResetTimer()
for i := 0; i < b.N; i++ {
s.Submit(cmd)
}
}
go test -bench=BenchmarkSubmit -benchmem -count=10
# BenchmarkSubmit-8 1234567 872 ns/op 128 B/op 3 allocs/op
-benchmem reports allocations. -count=10 runs 10 iterations so you can use benchstat to compare before/after with statistical significance.
net/http/pprof + a localhost server = production profiling in 3 lines.runtime/trace for scheduler-level visibility (per-goroutine timelines, GC phases).go test -bench -benchmem for tight microbenchmarks.