
Understanding Supervisor Strategies in Nexus OS
What Is a Supervisor?
In Erlang/OTP, a supervisor is a process whose sole job is to monitor other processes and restart them when they fail. Nexus OS brings this battle-tested pattern to AI agents.
When an agent crashes — whether from an API timeout, a malformed response, or an out-of-budget error — the supervisor detects the failure and applies a restart strategy.
The Three Strategies
One-for-One
If one child agent crashes, only that agent is restarted. Other agents continue running undisturbed.
supervisor:
strategy: one-for-one
max_restarts: 5
window: 300s
Use when: Agents are independent. A researcher crashing shouldn't affect a data-bot.
One-for-All
If any child agent crashes, all children are stopped and restarted together.
supervisor:
strategy: one-for-all
max_restarts: 3
window: 600s
Use when: Agents share state or depend on each other. If one fails, the shared state may be corrupted.
Rest-for-One
If a child crashes, that child and all children started after it are restarted. Children started before it continue running.
supervisor:
strategy: rest-for-one
max_restarts: 5
window: 300s
Use when: Agents form a pipeline. If step 2 fails, steps 3 and 4 need to restart, but step 1 is fine.
Restart Windows
The max_restarts and window settings prevent restart loops. If an agent crashes more than max_restarts times within the window period, the supervisor escalates — either shutting down or notifying the operator.
This is critical for production. Without restart limits, a buggy agent could consume your entire LLM budget in minutes.
Practical Example
naos deploy researcher --supervisor one-for-one --max-restarts 5
naos status
The dashboard at localhost:4200/supervisors shows the full supervisor tree with restart counts and child states in real time.