Back to server overview

Infrastructure

Health Monitoring

Keeping services running is not enough — you also need to know when something has gone wrong. A lightweight monitoring layer checks every service periodically and sends an alert if anything is down.

Two layers of monitoring

There are two complementary monitoring mechanisms in place. The first is a watchdog script that runs every few minutes, sends an HTTP request to each service's health endpoint, and asks systemd to restart any that don't respond. This handles transient failures that systemd's own restart logic might not catch (for example, a process that is running but not serving requests).

The second is a deeper health reporter — a Python script that checks every service, formats the results into an HTML status table, and sends it as an email report. It runs on a schedule and can also be triggered on demand. If the reporter itself crashes, a fallback notification is sent.

Alert behaviour

The monitoring design is intentionally low-noise. Alerts are only sent when something is actually wrong — the script runs silently when everything is healthy. This avoids alert fatigue and keeps the signal-to-noise ratio high.

Why not use a full observability stack?

For a single VPS running a handful of services, a full observability stack (Prometheus, Grafana, alertmanager) would be significant operational overhead with diminishing returns. The current approach uses about 100 lines of Python and standard Linux tooling to achieve the same outcome: knowing immediately when something is down and having enough context to fix it.

This is a deliberate trade-off — matching tooling complexity to operational scale.

Service Management Hosted Services