Module 6 · Lesson 3 · ~20 min read

Idempotency, Retries, and Backoff

In any distributed system — and Canton is one — calls fail. Sometimes they fail after they've succeeded (the network drops the response). The only sane way to live with this is to make calls idempotent and retry transient failures with exponential backoff. Three patterns; each has a Go shape.

The "happened or didn't?" problem

Your client submits a command. The connection drops before you receive the response. Did the participant see your command? You don't know.

Three options for what to do next:

Don't retry. The command may have been lost. Up to a human to resolve.
Retry. The command may have already happened — now it might happen twice.
Make the operation idempotent. Retry is safe regardless.

Option 3 is the right answer in 95% of cases. Idempotency turns "did this happen?" from a hair-on-fire emergency into a non-issue.

What idempotent means

An operation is idempotent if applying it twice has the same effect as applying it once. Examples:

Idempotent	Not idempotent
`SET balance = 100`	`UPDATE balance += 100`
`PUT /resource/123` (replace)	`POST /resource` (create new)
`kubectl apply`	`kubectl create`
"Submit command with this dedup key"	"Submit a new command"

The dedup key pattern — the workhorse

For operations that aren't naturally idempotent (e.g., "transfer 100 from A to B"), make them idempotent by attaching a unique key the server uses to dedupe:

req := &SubmitRequest{
    CommandId: "transfer-2026-04-23-abc123",  // unique per LOGICAL operation
    Payload:   payload,
}
resp, err := client.Submit(ctx, req)
// On retry, send the SAME CommandId.
// Server sees: "I already processed this command_id; return the original outcome."

Canton's Ledger API supports this — every command submission has a command_id, and submitting the same command_id twice returns the same outcome rather than executing twice.

The dedup key has to:

Be unique per logical operation (not per attempt).
Be derived deterministically from the operation context, so retries naturally use the same key.
Live long enough that retries can find it (Canton's command tracker keeps them for a configurable window — minutes to hours typically).

What to retry

Not every error is retryable. Retry only:

Transient transport errors: connection refused, connection reset, DNS lookup failed, TLS handshake failure.
Transient server errors: gRPC UNAVAILABLE, DEADLINE_EXCEEDED, sometimes RESOURCE_EXHAUSTED. HTTP 502/503/504.
Idempotent operations: GET, PUT, DELETE, anything with a dedup key.

Don't retry:

Application errors: INVALID_ARGUMENT, NOT_FOUND, PERMISSION_DENIED. These won't get better. Retrying just wastes time.
Operations without dedup keys that aren't naturally idempotent. You'll create duplicates.

Exponential backoff with jitter

Naive retry: try, fail, try again immediately. Hammers the server. Worse, if a thousand clients are all retrying simultaneously after a brief outage, they all hit the recovering server at exactly the same moment ("thundering herd").

Better: exponential backoff. Wait longer between each retry.

func backoff(attempt int) time.Duration {
    // 100ms, 200ms, 400ms, 800ms, ...
    base := 100 * time.Millisecond
    d := base << attempt
    if d > 10 * time.Second {
        d = 10 * time.Second  // cap
    }
    return d
}

Even better: add jitter. A small random perturbation prevents synchronized retries:

func backoffWithJitter(attempt int) time.Duration {
    base := backoff(attempt)
    jitter := time.Duration(rand.Int63n(int64(base / 2)))
    return base + jitter
}

"Full jitter" — pick a random value in [0, exponential) rather than adding small jitter to a deterministic base — is also widely used and slightly better behavior under heavy contention. AWS's docs are the canonical reference.

The retry shape

func submitWithRetry(ctx context.Context, client SubmitterClient, req *SubmitRequest) (*SubmitResponse, error) {
    var lastErr error
    for attempt := 0; attempt < 5; attempt++ {
        resp, err := client.Submit(ctx, req)
        if err == nil {
            return resp, nil
        }
        if !isTransient(err) {
            return nil, fmt.Errorf("non-transient error, no retry: %w", err)
        }
        lastErr = err

        select {
        case <-time.After(backoffWithJitter(attempt)):
            // loop and retry
        case <-ctx.Done():
            return nil, fmt.Errorf("context done during retry: %w", ctx.Err())
        }
    }
    return nil, fmt.Errorf("exhausted retries: %w", lastErr)
}

Five things this loop does right:

Bounded attempts.
Distinguishes transient from permanent errors.
Waits between attempts via select on ctx.Done() — cancellation propagates through the wait.
Captures the last error to return on exhaustion.
Wraps both error paths so callers can errors.Is the underlying cause.

Circuit breakers — the next level

If 100% of calls to a downstream are failing, retrying is pointless and harmful. A circuit breaker tracks recent failure rate and "opens" when it crosses a threshold — meaning the next call fails fast without even trying.

Canonical libraries: sony/gobreaker for a simple implementation. Use when:

You depend on multiple downstream services.
One downstream's degradation shouldn't cascade.
You'd rather fail fast and recover when the downstream does, than queue retries.

Don't add a circuit breaker until you have a real reason. Premature circuit breakers mask bugs.

Deduplication on the consumer side

Sometimes you can't make the upstream operation idempotent — you're consuming from a stream that may replay events on reconnect. Then idempotency is your responsibility:

var seen = map[string]struct{}{}

for {
    upd, err := stream.Recv()
    if err != nil { break }
    if _, ok := seen[upd.GetEventId()]; ok {
        continue  // dedup
    }
    seen[upd.GetEventId()] = struct{}{}
    handle(upd)
}

For real production, you'd persist seen (so dedup survives restart) and bound it (so it doesn't grow forever). For Canton's transaction stream specifically, the offset itself is your dedup key — track the highest offset processed and skip anything older.

Common mistakes

Don't

Retry application errors. A 400 won't become a 200 by trying again. You waste time and burn quota.

Don't

Forget the dedup key. If you're retrying and the upstream supports a dedup key (Canton commands do), USE IT.

Don't

Use a bare time.Sleep in a retry loop. It blocks even if the parent context is canceled. Use select with time.After and ctx.Done().

Takeaways

Retries are necessary; idempotency makes them safe.
Use dedup keys (Canton: command_id) so retries can't double-execute.
Retry only transient errors. Tag every gRPC error with its codes.Code; retry the transient set.
Exponential backoff with jitter. Cap the maximum delay.
The retry loop's wait must respect context cancellation — select, not time.Sleep.
Circuit breakers when one degraded downstream shouldn't cascade. Not before.