Module 6 · Lesson 3 · ~20 min read

Idempotency, Retries, and Backoff

In any distributed system — and Canton is one — calls fail. Sometimes they fail after they've succeeded (the network drops the response). The only sane way to live with this is to make calls idempotent and retry transient failures with exponential backoff. Three patterns; each has a Go shape.

The "happened or didn't?" problem

Your client submits a command. The connection drops before you receive the response. Did the participant see your command? You don't know.

Three options for what to do next:

  1. Don't retry. The command may have been lost. Up to a human to resolve.
  2. Retry. The command may have already happened — now it might happen twice.
  3. Make the operation idempotent. Retry is safe regardless.

Option 3 is the right answer in 95% of cases. Idempotency turns "did this happen?" from a hair-on-fire emergency into a non-issue.

What idempotent means

An operation is idempotent if applying it twice has the same effect as applying it once. Examples:

IdempotentNot idempotent
SET balance = 100UPDATE balance += 100
PUT /resource/123 (replace)POST /resource (create new)
kubectl applykubectl create
"Submit command with this dedup key""Submit a new command"

The dedup key pattern — the workhorse

For operations that aren't naturally idempotent (e.g., "transfer 100 from A to B"), make them idempotent by attaching a unique key the server uses to dedupe:

req := &SubmitRequest{
    CommandId: "transfer-2026-04-23-abc123",  // unique per LOGICAL operation
    Payload:   payload,
}
resp, err := client.Submit(ctx, req)
// On retry, send the SAME CommandId.
// Server sees: "I already processed this command_id; return the original outcome."

Canton's Ledger API supports this — every command submission has a command_id, and submitting the same command_id twice returns the same outcome rather than executing twice.

The dedup key has to:

What to retry

Not every error is retryable. Retry only:

Don't retry:

Exponential backoff with jitter

Naive retry: try, fail, try again immediately. Hammers the server. Worse, if a thousand clients are all retrying simultaneously after a brief outage, they all hit the recovering server at exactly the same moment ("thundering herd").

Better: exponential backoff. Wait longer between each retry.

func backoff(attempt int) time.Duration {
    // 100ms, 200ms, 400ms, 800ms, ...
    base := 100 * time.Millisecond
    d := base << attempt
    if d > 10 * time.Second {
        d = 10 * time.Second  // cap
    }
    return d
}

Even better: add jitter. A small random perturbation prevents synchronized retries:

func backoffWithJitter(attempt int) time.Duration {
    base := backoff(attempt)
    jitter := time.Duration(rand.Int63n(int64(base / 2)))
    return base + jitter
}

"Full jitter" — pick a random value in [0, exponential) rather than adding small jitter to a deterministic base — is also widely used and slightly better behavior under heavy contention. AWS's docs are the canonical reference.

The retry shape

func submitWithRetry(ctx context.Context, client SubmitterClient, req *SubmitRequest) (*SubmitResponse, error) {
    var lastErr error
    for attempt := 0; attempt < 5; attempt++ {
        resp, err := client.Submit(ctx, req)
        if err == nil {
            return resp, nil
        }
        if !isTransient(err) {
            return nil, fmt.Errorf("non-transient error, no retry: %w", err)
        }
        lastErr = err

        select {
        case <-time.After(backoffWithJitter(attempt)):
            // loop and retry
        case <-ctx.Done():
            return nil, fmt.Errorf("context done during retry: %w", ctx.Err())
        }
    }
    return nil, fmt.Errorf("exhausted retries: %w", lastErr)
}

Five things this loop does right:

  1. Bounded attempts.
  2. Distinguishes transient from permanent errors.
  3. Waits between attempts via select on ctx.Done() — cancellation propagates through the wait.
  4. Captures the last error to return on exhaustion.
  5. Wraps both error paths so callers can errors.Is the underlying cause.

Circuit breakers — the next level

If 100% of calls to a downstream are failing, retrying is pointless and harmful. A circuit breaker tracks recent failure rate and "opens" when it crosses a threshold — meaning the next call fails fast without even trying.

Canonical libraries: sony/gobreaker for a simple implementation. Use when:

Don't add a circuit breaker until you have a real reason. Premature circuit breakers mask bugs.

Deduplication on the consumer side

Sometimes you can't make the upstream operation idempotent — you're consuming from a stream that may replay events on reconnect. Then idempotency is your responsibility:

var seen = map[string]struct{}{}

for {
    upd, err := stream.Recv()
    if err != nil { break }
    if _, ok := seen[upd.GetEventId()]; ok {
        continue  // dedup
    }
    seen[upd.GetEventId()] = struct{}{}
    handle(upd)
}

For real production, you'd persist seen (so dedup survives restart) and bound it (so it doesn't grow forever). For Canton's transaction stream specifically, the offset itself is your dedup key — track the highest offset processed and skip anything older.

Common mistakes

Don't

Retry application errors. A 400 won't become a 200 by trying again. You waste time and burn quota.

Don't

Forget the dedup key. If you're retrying and the upstream supports a dedup key (Canton commands do), USE IT.

Don't

Use a bare time.Sleep in a retry loop. It blocks even if the parent context is canceled. Use select with time.After and ctx.Done().

Takeaways