Back to Blog
Nexus OS architecture diagram showing three layers
Explainersupervisorsfault-toleranceerlang

Understanding Supervisor Strategies in Nexus OS

Leonidas Esquire WilliamsonApril 12, 20266 min read

What Is a Supervisor?

In Erlang/OTP, a supervisor is a process whose sole job is to monitor other processes and restart them when they fail. Nexus OS brings this battle-tested pattern to AI agents.

When an agent crashes — whether from an API timeout, a malformed response, or an out-of-budget error — the supervisor detects the failure and applies a restart strategy.

The Three Strategies

One-for-One

If one child agent crashes, only that agent is restarted. Other agents continue running undisturbed.

supervisor:
  strategy: one-for-one
  max_restarts: 5
  window: 300s

Use when: Agents are independent. A researcher crashing shouldn't affect a data-bot.

One-for-All

If any child agent crashes, all children are stopped and restarted together.

supervisor:
  strategy: one-for-all
  max_restarts: 3
  window: 600s

Use when: Agents share state or depend on each other. If one fails, the shared state may be corrupted.

Rest-for-One

If a child crashes, that child and all children started after it are restarted. Children started before it continue running.

supervisor:
  strategy: rest-for-one
  max_restarts: 5
  window: 300s

Use when: Agents form a pipeline. If step 2 fails, steps 3 and 4 need to restart, but step 1 is fine.

Restart Windows

The max_restarts and window settings prevent restart loops. If an agent crashes more than max_restarts times within the window period, the supervisor escalates — either shutting down or notifying the operator.

This is critical for production. Without restart limits, a buggy agent could consume your entire LLM budget in minutes.

Practical Example

naos deploy researcher --supervisor one-for-one --max-restarts 5
naos status

The dashboard at localhost:4200/supervisors shows the full supervisor tree with restart counts and child states in real time.