Implementing Auto-Retry for Agent CLIs like Claude Code and Codex

Auto-retry might seem like a simple switch, but in real-world engineering, it's anything but. Hello everyone, I'm Yu Kun, creator of HagiCode. Today, let's skip the fluff and talk about how auto-retry for Agent CLIs like Claude Code and Codex should actually be implemented—to handle exceptions properly without spiraling into endless retry loops.

Background

If you've been working with AI programming recently, you've likely encountered this: tasks don't fail immediately—they break halfway through.

With ordinary HTTP requests, you often just retry with some exponential backoff. But Agent CLIs are different. Tools like Claude Code and Codex typically execute in streaming fashion, pushing output in chunks, while binding to threads, sessions, or resume tokens. In other words, it's not just "did this request fail" but rather:

Is the content already output still valid
Can the current context continue running
Should this failure auto-recover
If recovering, how long to wait, what to send, whether to reuse the original context

Many teams building this for the first time instinctively write the simplest version: retry on error. That's natural, but once it's in the project, problems start popping up one after another:

Some clearly transient errors get treated as final failures
Some errors not worth retrying get replayed repeatedly
Requests with threads and without get treated identically
Unbounded backoff strategies pound the backend with self-inflicted load

HagiCode has stepped in these pits while integrating multiple Agent CLIs. Especially on the Codex side, the initially exposed problem was that certain reconnect messages weren't recognized as retryable terminal states, so existing recovery mechanisms never got a chance to kick in. Basically, the system didn't lack auto-retry—it failed to recognize "this is worth retrying."

So the core point of this article is clear: Auto-retry is not a button, but a layered design.

About HagiCode

The solution shared here comes from our real practice in the HagiCode project. HagiCode's mission isn't just to hook up some model and call it done—it's to unify streaming messages, tool calls, failure recovery, and session context across multiple Agent CLIs into a maintainable execution model.

One of my main concerns is how to make AI programming actually work in production environments. Writing demos isn't hard; turning demos into something teams will use long-term is. HagiCode takes auto-retry seriously not because it looks fancy, but because if long-running, streaming, resumable CLI execution isn't stable, users see an unreliable wrapper that drops connections mid-task, not an intelligent assistant.

If you want to check out the project first, here are two entry points:

GitHub: github.com/HagiCode-org/site
Official site: hagicode.com

Taking it a step further, HagiCode is now on Steam. If you're on Steam, feel free to wishlist:

Steam Store Page (Wishlist / Details)

Why Agent CLI Auto-Retry is Harder Than Ordinary Retry

This is a practical question—let's jump to the conclusion: The difficulty with Agent CLI auto-retry isn't "wait a few seconds and try again," but "can we continue within the original context?"

Think of it like a long conversation. Ordinary API retry is like a busy line—redial. Agent CLI retry is like the other person's signal cutting mid-sentence. You have to decide: call back? Start over? Do they remember where we left off? These aren't the same problem at all.

Specifically, four challenges are most typical.

1. It's Streaming

Once output is sent to the user, you can't quietly swallow failures and retry like with ordinary requests. Because that initial content was already seen, improper replay leads to duplicate text, confused state, and scrambled tool call lifecycles. This isn't magic—it's engineering.

2. It Binds Session Context

Providers like Codex bind to threads; Claude Code-type implementations have continuation targets or equivalent resume context. Real auto-retry prerequisites aren't just "this error looks like transient failure," but also "does this execution still have a medium to continue?"

3. Not All Errors Are Worth Retrying

Network jitter, SSE idle timeout, upstream transient failures—usually worth a try. But authentication failure, lost context, or providers without resume capability? Continued retrying isn't recovery, it's noise.

4. It Needs Boundaries

Infinite auto-retry is almost always wrong. One stable engineering principle is: failure recovery must have boundaries. The system must know: max attempts, spacing between attempts, when to stop and admit defeat.

Because of these characteristics, HagiCode didn't implement auto-retry as a few try/catch lines in some provider—we extracted it as shared capability. Engineering problems need engineering solutions.

HagiCode's Approach: Extract Retry from Provider

HagiCode's current real implementation can be compressed to one sentence:

Shared layer uniformly manages retry flow; specific Providers only answer two questions: Is this terminal state worth retrying? Can current context continue?

This isn't complex, but it's critical. Once responsibilities are separated, Claude Code, Codex, and even other Agent CLIs can all reuse the same skeleton. Models change, tools change, workflows upgrade, but the engineering foundation remains.

Layer 1: Unified Coordinator Manages Retry Loop

The core implementation fragment looks roughly like this:

internal static class ProviderErrorAutoRetryCoordinator
{
    public static async IAsyncEnumerable<CliMessage> ExecuteAsync(
        string prompt,
        ProviderErrorAutoRetrySettings? settings,
        Func<string, IAsyncEnumerable<CliMessage>> executeAttemptAsync,
        Func<bool> canRetryInSameContext,
        Func<TimeSpan, CancellationToken, Task> delayAsync,
        Func<CliMessage, bool> isRetryableTerminalMessage,
        [EnumeratorCancellation] CancellationToken cancellationToken)
    {
        var normalizedSettings = ProviderErrorAutoRetrySettings.Normalize(settings);
        var retrySchedule = normalizedSettings.Enabled
            ? normalizedSettings.GetRetrySchedule()
            : [];

        for (var attempt = 0; ; attempt++)
        {
            var attemptPrompt = attempt == 0
                ? prompt
                : ProviderErrorAutoRetrySettings.ContinuationPrompt;

            CliMessage? terminalFailure = null;

            await foreach (var message in executeAttemptAsync(attemptPrompt)
                               .WithCancellation(cancellationToken))
            {
                if (isRetryableTerminalMessage(message))
                {
                    terminalFailure = message;
                    break;
                }

                yield return message;
            }

            if (terminalFailure is null)
            {
                yield break;
            }

            if (attempt >= retrySchedule.Count || !canRetryInSameContext())
            {
                yield return terminalFailure;
                yield break;
            }

            await delayAsync(retrySchedule[attempt], cancellationToken);
        }
    }
}

This code does something simple but powerful:

Intermediate failures aren't directly passed through; coordinator first judges if recovery is possible
Only when retry budget is exhausted does final failure return to upper layers
From round 2 onward, don't send original prompt—send unified continuation prompt

This is why I keep emphasizing: auto-retry isn't simply "request again." It's not patching an exception branch—it's managing an execution lifecycle. Sounds product-manager-ish, but that's how it works in engineering.

Layer 2: Snapshot Retry Strategy

Another easily overlooked issue: Who decides whether this request enables auto-retry?

HagiCode's answer: Don't rely on "current global configuration"—snapshot the strategy and let it travel with the request. This way, session queuing, message persistence, execution forwarding, provider adaptation won't lose the strategy. One success isn't a system; sustained success is.

Core structure simplifies to:

public sealed record ProviderErrorAutoRetrySnapshot
{
    public const string DefaultStrategy = "default";

    public bool Enabled { get; init; }

    public string Strategy { get; init; } = DefaultStrategy;

    public static ProviderErrorAutoRetrySnapshot Normalize(bool? enabled, string? strategy)
    {
        return new ProviderErrorAutoRetrySnapshot
        {
            Enabled = enabled ?? true,
            Strategy = string.IsNullOrWhiteSpace(strategy)
                ? DefaultStrategy
                : strategy.Trim()
        };
    }
}

Then map to settings objects actually consumed by providers at execution time. The value is direct:

Business layer decides "whether to retry"
Runtime decides "how to retry"

Each manages its own concern. Many problems aren't impossible, just not properly costed. Snapshotting strategy essentially calculates costs upfront.

Layer 3: Provider Only Does Terminal and Context Determination

At specific Claude Code or Codex provider level, responsibilities are actually thin. Think of it as enhancement, not replacement.

Take Codex—it essentially only needs to provide three things when integrating the shared coordinator:

await foreach (var message in ProviderErrorAutoRetryCoordinator.ExecuteAsync(
                   prompt,
                   options.ProviderErrorAutoRetry,
                   retryPrompt => ExecuteCodexAttemptAsync(...),
                   () => !string.IsNullOrWhiteSpace(resolvedThreadId),
                   DelayAsync,
                   IsRetryableTerminalFailure,
                   cancellationToken))
{
    yield return message;
}

You'll find truly Provider-specific judgments are only two:

IsRetryableTerminalFailure
canRetryInSameContext

Codex checks if thread can continue; Claude Code checks if continuation target exists. Backoff strategy, retry count, subsequent prompts—none of these should be reinvented by each Provider.

Once this layer is extracted, HagiCode's cost to integrate more CLIs drops significantly. You don't duplicate the entire retry state machine—just plug in "this provider's boundary conditions." Fast doesn't mean stable; handling doesn't mean handling well; runnable doesn't means maintainable.

An Easy Mistake: Don't Treat All Errors as Retryable

In this analysis, what's most worth calling out separately isn't "how to implement retry" but "how to avoid wrong retry."

The initial problem entry point was Codex missing recognition of a reconnect message. Intuitively, many would choose minimal fix: add another string prefix to whitelist. This isn't wrong per se, but it's more like a demo-era solution, not a long-term maintainable one.

From current HagiCode implementation, the system has moved toward more stable direction. It no longer focuses on specific literal strings but uniformly hands recoverable terminal states to shared coordinator. Benefits are obvious:

Won't completely break from minor text copy changes
Test coverage can focus on "terminal state envelope" rather than single hardcoded text
Same provider's retry logic stays more consistent

Of course, set a boundary: more generic doesn't mean more permissive. If current context cannot continue, even if error looks like transient failure, don't blindly replay.

This is critical. What's truly reassuring isn't that it occasionally works, but that it's reliable most of the time. If a process requires experts to maintain, it's far from mainstream.

Three Most Worthwhile Practical Lessons

Let's wrap this up at the practical level. If you're implementing similar capability in your project, I most recommend guarding these three principles.

1. Retry Budget Must Have Boundaries

HagiCode's current default backoff rhythm:

10 seconds
20 seconds
60 seconds

This rhythm might not fit all systems, but "having boundaries" must be kept. Otherwise, auto-retry quickly transforms from recovery mechanism into disaster amplifier. Don't rush to name it grand—first see if it survives two iterations in your team.

2. Continuation Prompt Should Be Unified

The project uses fixed continuation prompts, letting subsequent attempts explicitly take "continue current context" path rather than initiating a fresh complete request. This isn't flashy, but it's indispensable in real projects. Many capabilities look like magic, but unpacked they're just polished engineering workflows.

3. Both Shared Library and Adapter Layer Need Mirror Tests

I want to emphasize this. Many teams write tests in shared runtime and call it good enough. It's not.

What makes me confident about HagiCode is both layers have test coverage:

Shared Provider tests "whether auto-resume actually occurred"
Adapter layer tests "whether final errors and streaming messages were corrupted"

I additionally ran two related test suites this time—all 31 cases passed. This result itself doesn't prove perfect design, but it shows at least one thing: current auto-retry isn't a paper plan—it's capability constrained by both code and tests. Talk is cheap. Show me the code. Fits perfectly here.

Summary

If we compress this article to one sentence:

Auto-retry for Agent CLIs like Claude Code and Codex is best implemented not as local tricks inside some Provider, but as a combination of shared coordinator + strategy snapshot + context determination + mirror tests.

Benefits are very real:

Logic written once, reused by multiple Providers
Whether request allows retrying can stably follow execution chain
Continue with context, stop without context
Frontend sees stable completion or failure states, not abandoned intermediate noise

This solution was polished through HagiCode's real integration of multiple Agent CLIs. Who says AI-assisted programming isn't the new pair programming? Models help you start, complete, diverge—but what ultimately determines experience ceiling is context, workflow, and constraints.

If this helped you, also feel free to check out HagiCode's public entry points:

GitHub: github.com/HagiCode-org/site
Official site: hagicode.com
30-minute demo: www.bilibili.com/video/BV1pirZBuEzq/
Desktop install: hagicode.com/desktop/
Steam: Steam Store Page (Wishlist / Details)

HagiCode is now on Steam—this isn't vaporware, link's right here. If you're on Steam, wishlist it and click through. More direct than me saying ten sentences here.

That's it for now—see you in real projects.

References

HagiCode project homepage: https://hagicode.com
HagiCode GitHub repo: https://github.com/HagiCode-org/site
Official demo video: https://www.bilibili.com/video/BV1pirZBuEzq/
Desktop install guide: https://hagicode.com/desktop/

Original Article & License

Thanks for reading. If this article helped, consider liking, bookmarking, or sharing it.
This article was created with AI assistance and reviewed by the author before publication.

Author: newbe36524
Original URL: https://docs.hagicode.com/go?platform=devto&target=%2Fblog%2F2026-02-11-agent-cli-automatic-retry%2F
License: Unless otherwise stated, this article is licensed under CC BY-NC-SA. Please retain attribution when sharing.

DE

Source

This article was originally published by DEV Community and written by Hagicode.

Read original article on DEV Community

Back to Discover

Implementing Auto-Retry for Agent CLIs like Claude Code and Codex

Implementing Auto-Retry for Agent CLIs like Claude Code and Codex

Background

About HagiCode

Why Agent CLI Auto-Retry is Harder Than Ordinary Retry

1. It's Streaming

2. It Binds Session Context

3. Not All Errors Are Worth Retrying

4. It Needs Boundaries

HagiCode's Approach: Extract Retry from Provider

Layer 1: Unified Coordinator Manages Retry Loop

Layer 2: Snapshot Retry Strategy

Layer 3: Provider Only Does Terminal and Context Determination

An Easy Mistake: Don't Treat All Errors as Retryable

Three Most Worthwhile Practical Lessons

1. Retry Budget Must Have Boundaries

2. Continuation Prompt Should Be Unified

3. Both Shared Library and Adapter Layer Need Mirror Tests

Summary

References

Original Article & License

Reading List