Skip to main content

Health API

Monitor system health and configure heartbeat schedules.

Health check (web)

GET /api/health
No authentication required. Returns system health status.
The backend service exposes its own health check at GET /health (without the /api prefix). The web and backend health endpoints are independent — the web endpoint reports on the web application process while the backend endpoint reports on the API service. See backend health check below for details.
Breaking change: The health endpoint no longer returns cpu, memory, or uptime fields. These hardware details are now restricted to the admin-only endpoint at /api/admin/health. If you were consuming CPU, memory, or uptime data from this endpoint, update your integration to use the admin endpoint instead.

Response

{
  "status": "ok",
  "health": "healthy",
  "timestamp": "2026-03-19T00:00:00Z"
}
FieldTypeDescription
statusstringok when the health check completed successfully
healthstringOverall system health: healthy, degraded, or unhealthy
timestampstringISO 8601 timestamp of the health check
The health field reflects overall system status based on internal CPU and memory thresholds:
ValueCondition
healthyCPU and memory usage both at or below 70%
degradedCPU or memory usage above 70% but at or below 85%
unhealthyCPU or memory usage above 85%

Degraded and unhealthy responses

When the system is degraded or unhealthy, the endpoint still returns HTTP 200 with the health field set to degraded or unhealthy. The status field remains ok.
{
  "status": "ok",
  "health": "unhealthy",
  "timestamp": "2026-03-19T00:00:00Z"
}

Error response

An HTTP 500 is returned only when an unexpected error occurs while collecting health metrics, not for degraded or unhealthy status:
{
  "status": "error",
  "health": "unhealthy",
  "timestamp": "2026-03-19T00:00:00Z"
}
CodeDescription
200Health check succeeded. Check the health field for healthy, degraded, or unhealthy.
500Unexpected error collecting health metrics.

Backend health check

GET /health
No authentication required. Returns backend service status including Railway API availability. This endpoint is served by the backend API service (without the /api prefix).
The backend API continues to serve non-provisioning endpoints (health, metrics, auth, AI, registration) even when the Railway API is not reachable. Agent provisioning and lifecycle operations are disabled until the Railway API becomes available.

Response

{
  "status": "ok",
  "timestamp": "2026-03-19T00:00:00Z",
  "docker": "available",
  "provisioning": "enabled",
  "provider": "render"
}
FieldTypeDescription
statusstringAlways ok when the backend is running
timestampstringISO 8601 timestamp of the health check
dockerstringProvisioning infrastructure availability. available when the Railway API is reachable, unavailable otherwise. This field name is retained for backward compatibility.
provisioningstringAgent provisioning capability. enabled when the Railway API is reachable, disabled otherwise.
providerstringProvisioning infrastructure provider. Currently returns render for backward compatibility, but the underlying infrastructure uses Railway.

Response when the Railway API is unavailable

When the Railway API is not reachable, the health endpoint still returns HTTP 200 but reports degraded capabilities:
{
  "status": "ok",
  "timestamp": "2026-03-19T00:00:00Z",
  "docker": "unavailable",
  "provisioning": "disabled",
  "provider": "render"
}
When provisioning is disabled, any request to a provisioning-dependent endpoint (such as deploying, starting, stopping, or restarting an agent) returns a 500 error. Non-provisioning endpoints continue to operate normally.
The provider field currently returns render for backward compatibility. Agent containers are now provisioned on Railway. This value may be updated to railway in a future release.

Get heartbeat settings

GET /api/heartbeat?agentId=agent_123
Requires session authentication. Returns the heartbeat configuration for a specific agent.

Query parameters

ParameterTypeRequiredDescription
agentIdstringYesThe agent to retrieve heartbeat settings for

Response

{
  "heartbeat": {
    "frequency": "3h",
    "enabled": true,
    "lastHeartbeat": "2026-03-19T00:00:00Z",
    "nextHeartbeat": "2026-03-19T03:00:00Z"
  }
}
When no settings have been saved for the agent, the endpoint returns defaults:
{
  "heartbeat": {
    "frequency": "3h",
    "enabled": true,
    "lastHeartbeat": null,
    "nextHeartbeat": null
  }
}
FieldTypeDescription
heartbeat.frequencystringHeartbeat interval (for example, 3h, 30m, 1d)
heartbeat.enabledbooleanWhether heartbeats are enabled
heartbeat.lastHeartbeatstring | nullISO 8601 timestamp of the last heartbeat, or null if never set
heartbeat.nextHeartbeatstring | nullISO 8601 timestamp of the next scheduled heartbeat, or null if never set

Errors

CodeDescription
400agentId required — the agentId query parameter is missing
401Unauthorized
500Failed to fetch heartbeat settings

Update heartbeat settings

POST /api/heartbeat
Requires session authentication. Updates heartbeat settings for a specific agent. The agent must belong to the authenticated user.

Request body

FieldTypeRequiredDescription
agentIdstringYesThe agent to update heartbeat settings for
frequencystringNoHeartbeat interval (for example, 3h, 30m, 1d). Defaults to 3h. Supported units: m (minutes), h (hours), d (days).
enabledbooleanNoEnable or disable heartbeats. Defaults to true.

Response

{
  "heartbeat": {
    "frequency": "3h",
    "enabled": true,
    "lastHeartbeat": "2026-03-19T00:00:00Z",
    "nextHeartbeat": "2026-03-19T03:00:00Z",
    "lastUpdated": "2026-03-19T00:00:00Z"
  }
}
FieldTypeDescription
heartbeat.frequencystringConfigured heartbeat interval
heartbeat.enabledbooleanWhether heartbeats are enabled
heartbeat.lastHeartbeatstringISO 8601 timestamp when the settings were saved
heartbeat.nextHeartbeatstringISO 8601 timestamp of the next scheduled heartbeat, calculated from the current time plus the frequency
heartbeat.lastUpdatedstringISO 8601 timestamp of the last settings update

Errors

CodeDescription
400agentId required — the agentId field is missing from the request body
401Unauthorized
404Agent not found — the agent does not exist or is not owned by the authenticated user
500Heartbeat update failed

Delete heartbeat settings

DELETE /api/heartbeat
Requires session authentication. Resets heartbeat configuration for a specific agent by removing saved settings.

Request body

FieldTypeRequiredDescription
agentIdstringYesThe agent to reset heartbeat settings for

Response

{
  "success": true
}

Errors

CodeDescription
400agentId required — the agentId field is missing from the request body
401Unauthorized
500Heartbeat reset failed

Container health checks

Agent services run the official OpenClaw image, which exposes built-in health endpoints on port 18789. The backend uses these to determine service readiness during provisioning and ongoing monitoring.

Built-in health endpoints

The OpenClaw image (ghcr.io/openclaw/openclaw:2026.3.22) provides two health endpoints on each agent service:
EndpointPurposeDescription
GET /healthzLivenessReturns 200 when the gateway process is running. Used by the health check to detect crashed or hung services.
GET /readyzReadinessReturns 200 when the gateway is ready to accept requests. Use this to verify the service has completed startup before routing traffic.
Both endpoints are unauthenticated and bind to the service’s internal port (18789).

/healthz response

{
  "ok": true,
  "status": "live"
}
FieldTypeDescription
okbooleantrue when the gateway process is running
statusstringAlways live when the endpoint responds

/readyz response

{
  "ready": true,
  "failing": [],
  "uptimeMs": 68163
}
FieldTypeDescription
readybooleantrue when the gateway is ready to accept requests
failingarrayList of failing readiness checks. Empty when all checks pass.
uptimeMsnumberGateway uptime in milliseconds since startup
The backend probes /healthz on the agent’s public Railway URL for health checks (with a 5-second timeout). The /healthz and /readyz endpoints are provided by the OpenClaw image itself and are available on all agent services.

Container health statuses

StatusCondition
healthyService is running and the internal health endpoint responds successfully
startingService is running but the health endpoint is not yet responding after all retries
runningService is active on Railway and responding
stoppedService has exited
suspendedService has been suspended (saves resources, retains data). Railway does not natively support suspension, so this status indicates the service has been marked idle.
not_foundNo matching Railway service exists for this agent
errorService is in an unexpected state, build failed, or cannot be inspected

Health check behavior

  • The backend probes each agent’s /healthz endpoint to determine service health. The health check uses a 5-second timeout per request.
  • The waitForHealthy function polls service health every 2 seconds, with a default overall timeout of 60 seconds.

Watchdog monitoring

The backend runs a per-agent watchdog that continuously monitors agent health, detects crash loops, and performs automatic recovery. The watchdog operates internally and does not expose dedicated API endpoints. Status information is surfaced through the existing agent status and lifecycle endpoints.

Health check cycle

The watchdog probes each agent’s gateway at GET /healthz on the agent’s internal port. Health checks run on a configurable interval (default: every 2 minutes). When the gateway reports unhealthy, the watchdog transitions the agent to a degraded state and increases the check frequency to every 5 seconds.
ParameterDefaultEnvironment variable
Health check interval120 secondsWATCHDOG_CHECK_INTERVAL
Degraded check interval5 secondsWATCHDOG_DEGRADED_CHECK_INTERVAL
Startup failure threshold3 consecutive failuresWATCHDOG_STARTUP_FAILURE_THRESHOLD
Max repair attempts2WATCHDOG_MAX_REPAIR_ATTEMPTS
Crash loop window5 minutesWATCHDOG_CRASH_LOOP_WINDOW
Crash loop threshold3 crashes in windowWATCHDOG_CRASH_LOOP_THRESHOLD

Lifecycle states

The watchdog tracks the following lifecycle states for each agent:
StateDescription
stoppedAgent is not running
startingAgent service has started; waiting for the first successful health check
runningAgent is healthy and serving requests
degradedHealth checks are failing after a previous healthy state
crash_loopMultiple crashes detected within the crash loop window
repairingAuto-repair is in progress

Auto-repair

When the watchdog detects an unhealthy agent, it can automatically attempt recovery. Auto-repair is enabled by default and can be disabled by setting the WATCHDOG_AUTO_REPAIR environment variable to false. The repair sequence is:
  1. Kill the agent gateway process
  2. Wait 5 seconds
  3. Restart the gateway
  4. Wait 30 seconds (startup grace period)
  5. Verify health
If the repair fails, the watchdog retries up to the configured maximum (default: 2 attempts). After exhausting all repair attempts, the agent transitions to the crash_loop state.

Crash loop detection

The watchdog tracks crash timestamps within a sliding window (default: 5 minutes). When the number of crashes in the window reaches the threshold (default: 3), the agent enters the crash_loop state. This prevents infinite restart loops for agents with persistent failures.

Notifications

The watchdog sends notifications for critical events (degraded, crash loop, repair attempts) through configured channels:
  • Telegram — when TELEGRAM_BOT_TOKEN and TELEGRAM_ADMIN_CHAT_ID are set
  • Discord — when DISCORD_WEBHOOK_URL is set

Railway status webhook

POST /api/webhooks/railway-status
Receives platform status notifications from Railway’s status page and deployment events from the Railway dashboard. This endpoint processes deployment events, incident updates, component status changes, and page-level notifications. Events are persisted to Redis so the dashboard can display real-time Railway status.
This endpoint accepts webhooks from both status.railway.com (incident and component updates) and the Railway dashboard (deployment events). Configure webhook subscriptions in both locations to point to this URL.

Authentication

When the RAILWAY_WEBHOOK_SECRET environment variable is configured, requests must include a valid secret via one of the following methods:
MethodLocationDescription
Headerx-railway-secretShared secret in a custom request header
Query parameter?secret=Shared secret as a URL query parameter
The secret is verified using a constant-time comparison to prevent timing attacks. When RAILWAY_WEBHOOK_SECRET is not configured, requests are accepted without verification (development mode only).
You should always configure RAILWAY_WEBHOOK_SECRET in production to prevent unauthorized parties from injecting fake status notifications.

Request body

The endpoint accepts two payload formats: deployment events from the Railway dashboard and status-page events from Railway’s status page.

Deployment event

Sent by Railway when a deployment status changes.
FieldTypeRequiredDescription
typestringNoEvent type identifier
deploymentobjectNoDeployment details
deployment.idstringNoDeployment identifier
deployment.statusstringNoCurrent deployment status (for example, SUCCESS, FAILED, BUILDING, DEPLOYING)
deployment.urlstringNoDeployment URL
deployment.serviceobjectNoService metadata
deployment.service.namestringNoName of the deployed service

Status-page event

Sent by Railway’s status page for incident and component updates. The payload follows the Railway status page webhook format.
FieldTypeRequiredDescription
incidentobjectNoIncident details including name, status, and incident_updates
incident.namestringNoName of the incident
incident.statusstringNoCurrent incident status (for example, investigating, identified, monitoring, resolved)
incident.incident_updatesarrayNoList of update objects. The first entry’s body field contains the latest update message.
componentobjectNoComponent status change details
component.namestringNoName of the affected component
component.statusstringNoCurrent component status (for example, operational, degraded_performance, partial_outage, major_outage)
pageobjectNoPage-level status information

Response

On success, the endpoint returns the received event along with the persisted record:
{
  "received": true,
  "record": {
    "status": "SUCCESS",
    "name": "my-service",
    "message": "https://my-service.up.railway.app",
    "eventType": "deployment",
    "receivedAt": "2026-03-27T12:00:00.000Z"
  }
}
FieldTypeDescription
receivedbooleanAlways true on success
recordobjectThe status record persisted to Redis
record.statusstringNormalized status value from the event
record.namestringService or incident name. Defaults to "Railway" for status-page events.
record.messagestringDeployment URL or latest incident update body
record.eventTypestringOne of deployment, incident, component, or the type field from the payload
record.receivedAtstringISO 8601 timestamp when the event was received
The record is stored in Redis under the key railway:status:latest with a 7-day TTL. When Redis is not configured (KV_REST_API_URL and KV_REST_API_TOKEN not set), the endpoint still processes the event and returns the record but does not persist it.

Error response

Returned when the request body is not valid JSON:
{
  "error": "Invalid payload"
}
CodeDescription
200Webhook payload received and processed
400Invalid JSON payload
401Unauthorized — missing or invalid secret when RAILWAY_WEBHOOK_SECRET is configured

Example payloads

Deployment event

{
  "type": "deployment.completed",
  "deployment": {
    "id": "dep_abc123",
    "status": "SUCCESS",
    "url": "https://my-service.up.railway.app",
    "service": {
      "name": "my-service"
    }
  }
}

Incident event

{
  "incident": {
    "name": "Elevated error rates on US-West deployments",
    "status": "investigating",
    "incident_updates": [
      {
        "body": "We are investigating elevated error rates affecting deployments in the US-West region."
      }
    ]
  }
}

Railway status polling

GET /api/webhooks/railway-status
Returns the last-known Railway status from Redis. No authentication required. Use this endpoint to display Railway platform status on your dashboard.

Response

When a status event has been received and persisted:
{
  "status": "SUCCESS",
  "lastEvent": {
    "status": "SUCCESS",
    "name": "my-service",
    "message": "https://my-service.up.railway.app",
    "eventType": "deployment",
    "receivedAt": "2026-03-27T12:00:00.000Z"
  },
  "endpoint": "railway-status-webhook"
}
FieldTypeDescription
statusstringStatus from the most recent event, or no-events if no events have been received
lastEventobject | nullThe full status record from the last webhook event, or null if no events exist
endpointstringAlways railway-status-webhook
When no events have been received:
{
  "status": "no-events",
  "lastEvent": null,
  "endpoint": "railway-status-webhook"
}
When Redis is not configured (KV_REST_API_URL and KV_REST_API_TOKEN not set):
{
  "status": "unknown",
  "message": "Redis not configured",
  "endpoint": "railway-status-webhook"
}
CodeDescription
200Status retrieved (or fallback returned when Redis is unavailable)