Module 6 · Lesson 3 · ~20 min read
In any distributed system — and Canton is one — calls fail. Sometimes they fail after they've succeeded (the network drops the response). The only sane way to live with this is to make calls idempotent and retry transient failures with exponential backoff. Three patterns; each has a Go shape.
Your client submits a command. The connection drops before you receive the response. Did the participant see your command? You don't know.
Three options for what to do next:
Option 3 is the right answer in 95% of cases. Idempotency turns "did this happen?" from a hair-on-fire emergency into a non-issue.
An operation is idempotent if applying it twice has the same effect as applying it once. Examples:
| Idempotent | Not idempotent |
|---|---|
SET balance = 100 | UPDATE balance += 100 |
PUT /resource/123 (replace) | POST /resource (create new) |
kubectl apply | kubectl create |
| "Submit command with this dedup key" | "Submit a new command" |
For operations that aren't naturally idempotent (e.g., "transfer 100 from A to B"), make them idempotent by attaching a unique key the server uses to dedupe:
req := &SubmitRequest{
CommandId: "transfer-2026-04-23-abc123", // unique per LOGICAL operation
Payload: payload,
}
resp, err := client.Submit(ctx, req)
// On retry, send the SAME CommandId.
// Server sees: "I already processed this command_id; return the original outcome."
Canton's Ledger API supports this — every command submission has a command_id, and submitting the same command_id twice returns the same outcome rather than executing twice.
The dedup key has to:
Not every error is retryable. Retry only:
UNAVAILABLE, DEADLINE_EXCEEDED, sometimes RESOURCE_EXHAUSTED. HTTP 502/503/504.Don't retry:
INVALID_ARGUMENT, NOT_FOUND, PERMISSION_DENIED. These won't get better. Retrying just wastes time.Naive retry: try, fail, try again immediately. Hammers the server. Worse, if a thousand clients are all retrying simultaneously after a brief outage, they all hit the recovering server at exactly the same moment ("thundering herd").
Better: exponential backoff. Wait longer between each retry.
func backoff(attempt int) time.Duration {
// 100ms, 200ms, 400ms, 800ms, ...
base := 100 * time.Millisecond
d := base << attempt
if d > 10 * time.Second {
d = 10 * time.Second // cap
}
return d
}
Even better: add jitter. A small random perturbation prevents synchronized retries:
func backoffWithJitter(attempt int) time.Duration {
base := backoff(attempt)
jitter := time.Duration(rand.Int63n(int64(base / 2)))
return base + jitter
}
"Full jitter" — pick a random value in [0, exponential) rather than adding small jitter to a deterministic base — is also widely used and slightly better behavior under heavy contention. AWS's docs are the canonical reference.
func submitWithRetry(ctx context.Context, client SubmitterClient, req *SubmitRequest) (*SubmitResponse, error) {
var lastErr error
for attempt := 0; attempt < 5; attempt++ {
resp, err := client.Submit(ctx, req)
if err == nil {
return resp, nil
}
if !isTransient(err) {
return nil, fmt.Errorf("non-transient error, no retry: %w", err)
}
lastErr = err
select {
case <-time.After(backoffWithJitter(attempt)):
// loop and retry
case <-ctx.Done():
return nil, fmt.Errorf("context done during retry: %w", ctx.Err())
}
}
return nil, fmt.Errorf("exhausted retries: %w", lastErr)
}
Five things this loop does right:
select on ctx.Done() — cancellation propagates through the wait.errors.Is the underlying cause.If 100% of calls to a downstream are failing, retrying is pointless and harmful. A circuit breaker tracks recent failure rate and "opens" when it crosses a threshold — meaning the next call fails fast without even trying.
Canonical libraries: sony/gobreaker for a simple implementation. Use when:
Don't add a circuit breaker until you have a real reason. Premature circuit breakers mask bugs.
Sometimes you can't make the upstream operation idempotent — you're consuming from a stream that may replay events on reconnect. Then idempotency is your responsibility:
var seen = map[string]struct{}{}
for {
upd, err := stream.Recv()
if err != nil { break }
if _, ok := seen[upd.GetEventId()]; ok {
continue // dedup
}
seen[upd.GetEventId()] = struct{}{}
handle(upd)
}
For real production, you'd persist seen (so dedup survives restart) and bound it (so it doesn't grow forever). For Canton's transaction stream specifically, the offset itself is your dedup key — track the highest offset processed and skip anything older.
Retry application errors. A 400 won't become a 200 by trying again. You waste time and burn quota.
Forget the dedup key. If you're retrying and the upstream supports a dedup key (Canton commands do), USE IT.
Use a bare time.Sleep in a retry loop. It blocks even if the parent context is canceled. Use select with time.After and ctx.Done().
command_id) so retries can't double-execute.codes.Code; retry the transient set.select, not time.Sleep.