Building a self-healing server with Claude Code and a Bash script

Servers go down. It’s not a question of if, but when. And if you’re self-hosting an OpenClaw gateway, you’ve probably experienced the 3am Discord message from a friend asking why your agent went silent — usually because the gateway crashed from an OOM kill, a corrupted build artifact, or an unhandled promise rejection.

The typical fix? SSH in, check the logs, restart the service, go back to sleep. Repeat next week.

I wanted something better. Not just a process monitor that blindly restarts on failure, but something that could actually understand what went wrong and fix the root cause. So I built openclaw-watchdog — a single Bash script that monitors your OpenClaw gateway and, when things break, hands the problem to Claude Code to diagnose and repair autonomously.

Why not just use systemd restart?

Systemd’s Restart=on-failure handles the simplest case: process dies, process restarts. But OpenClaw gateway failures aren’t always that clean. The process might still be running but not listening on its port. The health endpoint might return errors. The disk might be full, causing silent failures. Memory pressure might be degrading performance without triggering an OOM kill.

A blind restart doesn’t help when the disk is at 98% capacity, or when a corrupted config file causes an immediate crash loop. You need something that can look at the full picture and make a judgment call.

Five health checks in 60 lines

The watchdog runs five checks, each deliberately simple:

bash

# 1. Is the process running?
if ! pgrep -f "$OPENCLAW_PROCESS" > /dev/null 2>&1; then
    ISSUES+=("Process '$OPENCLAW_PROCESS' is not running")
fi

# 2. Is the port listening?
if ! ss -tlnp | grep -q ":${OPENCLAW_PORT} " 2>/dev/null; then
    ISSUES+=("Port $OPENCLAW_PORT is not listening")
fi

# 3. Does the CLI health check pass?
HEALTH_OUTPUT=$("$OPENCLAW_BIN" health 2>&1) || true
if echo "$HEALTH_OUTPUT" | grep -qiE "error|fail|refused|timeout"; then
    ISSUES+=("CLI health check reports problems: $HEALTH_OUTPUT")
fi

# 4. Is there enough disk space?
AVAIL_KB=$(df --output=avail / | tail -1 | tr -d ' ')
if [ "$AVAIL_KB" -lt "$MIN_DISK_KB" ]; then
    ISSUES+=("Low disk space: ${AVAIL_KB}KB available")
fi

# 5. Is there enough memory?
AVAIL_MEM_MB=$(awk '/MemAvailable/ {printf "%d", $2/1024}' /proc/meminfo)
if [ "$AVAIL_MEM_MB" -lt "$MIN_MEMORY_MB" ]; then
    ISSUES+=("Low memory: ${AVAIL_MEM_MB}MB available")
fi

If all five pass, the script exits quietly and prunes old logs. A cron job runs it every five minutes — unobtrusive and lightweight.

When things go wrong: calling in the AI

Here’s where it gets interesting. When any check fails, the script doesn’t just restart the service. It gathers a comprehensive diagnostic snapshot:

The specific issues detected
Process list filtered for openclaw
All listening ports
Systemd service status
OpenClaw CLI health output
Disk and memory usage
The last 80 lines of OpenClaw logs
Recent kernel messages (to catch OOM kills)

All of this gets wrapped into a structured prompt and piped directly to Claude Code in non-interactive mode:

bash

AGENT_OUTPUT=$(echo "$PROMPT" | "$CLAUDE_BIN" -p --allowedTools "Bash" 2>&1)

The -p flag runs Claude Code in piped/non-interactive mode, and --allowedTools "Bash" restricts it to only executing shell commands. No file editing, no web access — just the ability to run commands and fix the problem.

Guardrails matter

Giving an AI agent shell access to a production server sounds alarming, and it should. The prompt includes explicit safety rules:

plaintext

RULES:
- Be conservative. Prefer restarts over config changes.
- Do NOT modify openclaw.json unless absolutely necessary.
- Do NOT delete data or session files.
- If disk is full, clean up logs and temp files only.
- If memory is low, identify the biggest consumer and restart it.
- If you cannot fix the issue, clearly state what went wrong.
- After fixing, verify the fix by running: $OPENCLAW_BIN health

These constraints push the agent toward the safest effective action. A full disk gets cleaned up — not reconfigured. A crashed process gets restarted — not reinstalled. And if the agent can’t fix the problem, it says so rather than making things worse.

Beyond the prompt-level rules, the script itself has structural safety mechanisms:

Lockfile with PID checking prevents concurrent runs from stomping on each other
Tool restriction (--allowedTools "Bash") limits the agent’s capabilities to shell commands only
Automatic log rotation keeps the log directory from becoming its own disk space problem
Trap-based cleanup ensures the lockfile is always removed, even on unexpected exits

Staying in the loop with Discord

Nobody wants to discover their server was down for six hours because they didn’t check the logs. The watchdog sends colour-coded Discord notifications at each stage:

Orange — watchdog triggered, agent is investigating
Green — agent successfully fixed the issue
Red — agent couldn’t fix it, manual intervention needed

The notification includes the last 500 characters of the agent’s output, so you can see what it did (or tried to do) without SSHing in. After the agent finishes, the script waits five seconds and re-runs the process and port checks to verify the fix actually worked before sending the success or failure notification.

Getting started

Setup is deliberately minimal:

bash

git clone https://github.com/danjdewhurst/openclaw-watchdog.git
cd openclaw-watchdog
cp watchdog.conf.example watchdog.conf
# Edit watchdog.conf with your paths and Discord webhook URL
chmod +x watchdog.sh

The configuration file covers the essentials:

bash

# Required
OPENCLAW_BIN="$HOME/.npm-global/bin/openclaw"
CLAUDE_BIN="$HOME/.local/bin/claude"

# Optional — remove to disable notifications
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..."

# Tunable thresholds
MIN_DISK_KB="1048576"    # 1GB minimum free disk
MIN_MEMORY_MB="200"      # 200MB minimum free memory

Test it manually first, then schedule with cron:

bash

# Run every 5 minutes
*/5 * * * * /path/to/openclaw-watchdog/watchdog.sh

The project uses MiniMax M2.5 as the default model through Claude Code, keeping operational costs low — but any Claude Code-compatible model works.

What’s next

The watchdog is intentionally simple — a single Bash script with no build step, no dependencies beyond standard Linux tools and Claude Code. That simplicity is a feature, not a limitation. It’s easy to audit, easy to modify, and easy to trust on a production server.

That said, there’s room to grow. Pattern detection across multiple runs could identify recurring issues before they become critical. Integration with other notification platforms beyond Discord is straightforward to add. And the same approach — structured diagnostics piped to an AI agent with constrained tool access — could work for monitoring virtually any self-hosted service.

The code is MIT-licensed and available at github.com/danjdewhurst/openclaw-watchdog. If you’re running an OpenClaw gateway, give it a try. Your 3am self will thank you.