Infrastructure
Health Monitoring
Keeping services running is not enough — you also need to know when something has gone wrong. A lightweight monitoring layer checks every service periodically and sends an alert if anything is down.
Two layers of monitoring
There are two complementary monitoring mechanisms in place. The first is a watchdog script that runs every few minutes, sends an HTTP request to each service's health endpoint, and asks systemd to restart any that don't respond. This handles transient failures that systemd's own restart logic might not catch (for example, a process that is running but not serving requests).
The second is a deeper health reporter — a Python script that checks every service, formats the results into an HTML status table, and sends it as an email report. It runs on a schedule and can also be triggered on demand. If the reporter itself crashes, a fallback notification is sent.
Alert behaviour
The monitoring design is intentionally low-noise. Alerts are only sent when something is actually wrong — the script runs silently when everything is healthy. This avoids alert fatigue and keeps the signal-to-noise ratio high.
- HTTP-level liveness checks — verifies each service is actually responding, not just running.
- Email alerts include a colour-coded status table so the problem is obvious at a glance.
- A --force flag allows sending reports on demand for testing or status reviews.
- Fallback alert fires if the monitoring script itself encounters an unhandled exception.
Why not use a full observability stack?
For a single VPS running a handful of services, a full observability stack (Prometheus, Grafana, alertmanager) would be significant operational overhead with diminishing returns. The current approach uses about 100 lines of Python and standard Linux tooling to achieve the same outcome: knowing immediately when something is down and having enough context to fix it.
This is a deliberate trade-off — matching tooling complexity to operational scale.