Alerting: Getting Notified When Things Actually Break

Alerting: Getting Notified When Things Actually Break

[Part 5] Or: how I learned to stop worrying and love the threshold

Everything we've built so far is reactive. You open Grafana, you look at things, you notice problems. That's fine if you enjoy staring at dashboards — no judgment, we're all here for the same reason. But the real value of an observability stack is when it taps you on the shoulder at 2am and says "hey, that ZFS drive just went offline" before you wake up to find your media library gone.

This article is about building that tap on the shoulder. We're going to configure Grafana alerting to notify you via Discord and email when something genuinely needs attention — with enough tuning to avoid the opposite problem, where your phone buzzes every five minutes because nginx got a 404 from a bot somewhere and your threshold is set to "literally anything."

The goal is actionable alerts. Not noisy alerts. Not silent alerts. Actionable ones.


How Grafana Alerting Works

Three components:

  • Alert Rules — the conditions being evaluated ("disk usage above 85%")
  • Contact Points — where notifications go (Discord, email, etc.)
  • Notification Policies — which alerts go where, and how often

We're provisioning all of this as code using Grafana's provisioning system. Alert rules live in YAML files on disk, load automatically when Grafana starts, and get version controlled in the git repo. The alternative is clicking through the UI to create rules manually — which works until you recreate the container and lose everything. We've been burned enough times in this project to know better.


Setting Up Contact Points

To get a Discord webhook URL:

  1. Open Discord and go to your server
  2. Edit a channel (or create a new #homelab-alerts channel)
  3. Go to Integrations → Webhooks → New Webhook
  4. Name it Grafana and copy the webhook URL

Add it to your .env file on Nexus:

DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/your/webhook/url

Make sure Grafana's environment variables include it so the reference in the contact points file resolves correctly:

  grafana:
    environment:
      - DISCORD_WEBHOOK_URL=${DISCORD_WEBHOOK_URL}

Create config/grafana/provisioning/alerting/contact-points.yaml:

apiVersion: 1

contactPoints:
  - orgId: 1
    name: Discord and Email
    receivers:
      - uid: discord_critical
        type: discord
        settings:
          url: ${DISCORD_WEBHOOK_URL}
          message: "**{{ .CommonAnnotations.summary }}**\n{{ .CommonAnnotations.description }}\n**Status:** {{ .Status | toUpper }}\n**Severity:** {{ .CommonLabels.severity | toUpper }}"
      - uid: email_critical
        type: email
        settings:
          addresses: [email protected]

  - orgId: 1
    name: Discord
    receivers:
      - uid: discord_warning
        type: discord
        settings:
          url: ${DISCORD_WEBHOOK_URL}
          message: "**{{ .CommonAnnotations.summary }}**\n{{ .CommonAnnotations.description }}\n**Status:** {{ .Status | toUpper }}\n**Severity:** {{ .CommonLabels.severity | toUpper }}"

The explicit message: field in the contact point settings is important. Grafana's default Discord message template sends raw Go template syntax as literal text in some versions. The explicit field overrides the default and keeps your Discord notifications clean. That one cost me about two hours to figure out.


Notification Policies

Create config/grafana/provisioning/alerting/notification-policies.yaml:

apiVersion: 1

policies:
  - orgId: 1
    receiver: Discord
    group_by: ['alertname', 'instance']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    routes:
      - receiver: Discord and Email
        matchers:
          - severity = critical
        repeat_interval: 4h
      - receiver: Discord
        matchers:
          - severity = warning
        repeat_interval: 12h

The repeat_interval settings matter more than you'd think. We learned this during a power outage that took down all three Proxmox nodes — with repeat_interval: 1h on critical alerts, Discord became a very loud place very quickly at 2am.

The policy above means:

  • Critical alerts — notify immediately, remind every 4 hours if still firing
  • Warning alerts — notify once, remind every 12 hours

Reasonable balance between "I need to know about this" and "please stop."


Prometheus Alert Rules

Create config/grafana/provisioning/alerting/prometheus-rules.yaml.

Before you use this file, get your Prometheus datasource UID:

curl -s -u admin:your_password http://localhost:3001/api/datasources | python3 -m json.tool | grep -E '"uid"|"name"'

Replace YOUR_PROMETHEUS_UID throughout the file with the actual UID.

apiVersion: 1

groups:
  - orgId: 1
    name: host-alerts
    folder: Infrastructure
    interval: 1m
    rules:

      # Host Down
      - uid: host_down
        title: Host Down
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: up{job="node"}
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [1]
                    type: lt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: Alerting
        execErrState: Alerting
        for: 2m
        annotations:
          summary: "Host down"
          description: "A host has been unreachable for more than 2 minutes - check Prometheus targets"
        labels:
          severity: critical

      # Disk Space Warning - Boot Drives
      - uid: disk_warning_boot
        title: Disk Space Warning - Boot Drive
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                (1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
                / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [75]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          summary: "Disk space warning - boot drive"
          description: "A boot drive is above 75% capacity - check node_exporter metrics"
        labels:
          severity: warning

      # Disk Space Critical - Boot Drives
      - uid: disk_critical_boot
        title: Disk Space Critical - Boot Drive
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                (1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
                / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [90]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          summary: "Disk space critical - boot drive"
          description: "A boot drive is above 90% capacity - immediate attention required"
        labels:
          severity: critical

      # Disk Space Warning - Log Storage
      - uid: disk_warning_logs
        title: Disk Space Warning - Log Storage
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                (1 - node_filesystem_avail_bytes{instance="nexus", mountpoint="/media/disk1"}
                / node_filesystem_size_bytes{instance="nexus", mountpoint="/media/disk1"}) * 100
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [70]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          summary: "Log storage space warning"
          description: "Log storage on Nexus is above 70% - consider adjusting retention settings"
        labels:
          severity: warning

      # Disk Space Critical - Log Storage
      - uid: disk_critical_logs
        title: Disk Space Critical - Log Storage
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                (1 - node_filesystem_avail_bytes{instance="nexus", mountpoint="/media/disk1"}
                / node_filesystem_size_bytes{instance="nexus", mountpoint="/media/disk1"}) * 100
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [85]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          summary: "Log storage space critical"
          description: "Log storage on Nexus is above 85% - reduce retention or expand storage"
        labels:
          severity: critical

      # Disk Space Warning - Media Pool
      - uid: disk_warning_mediapool
        title: Disk Space Warning - Media Pool
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                (1 - node_filesystem_avail_bytes{instance="vault", mountpoint="/MediaPool"}
                / node_filesystem_size_bytes{instance="vault", mountpoint="/MediaPool"}) * 100
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [80]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          summary: "Media pool space warning"
          description: "MediaPool on Vault is above 80% - time to think about expansion"
        labels:
          severity: warning

      # Disk Space Critical - Media Pool
      - uid: disk_critical_mediapool
        title: Disk Space Critical - Media Pool
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                (1 - node_filesystem_avail_bytes{instance="vault", mountpoint="/MediaPool"}
                / node_filesystem_size_bytes{instance="vault", mountpoint="/MediaPool"}) * 100
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [90]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          summary: "Media pool space critical"
          description: "MediaPool on Vault is above 90% - immediate attention required"
        labels:
          severity: critical

      # ZFS Pool Health
      - uid: zfs_pool_health
        title: ZFS Pool Degraded
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: node_zfs_zpool_state{instance="vault"}
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [1]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: Alerting
        execErrState: Alerting
        for: 1m
        annotations:
          summary: "ZFS pool degraded"
          description: "MediaPool on Vault is no longer in ONLINE state - immediate attention required"
        labels:
          severity: critical

      # High Memory Usage
      - uid: high_memory
        title: High Memory Usage
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [90]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 10m
        annotations:
          summary: "High memory usage"
          description: "A host has been above 90% memory usage for more than 10 minutes"
        labels:
          severity: warning

      # High CPU Usage
      - uid: high_cpu
        title: High CPU Usage
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 900
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: |
                100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 900
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 900
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [85]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 15m
        annotations:
          summary: "High CPU usage"
          description: "A host has been above 85% CPU for more than 15 minutes"
        labels:
          severity: warning

      # Pi-hole Down
      - uid: pihole_down
        title: Pi-hole Down
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_PROMETHEUS_UID
            model:
              expr: up{job="pihole"}
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [1]
                    type: lt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: Alerting
        execErrState: Alerting
        for: 2m
        annotations:
          summary: "Pi-hole down"
          description: "A Pi-hole instance has stopped reporting - DNS resolution may be affected"
        labels:
          severity: critical

Loki Alert Rules

Create config/grafana/provisioning/alerting/loki-rules.yaml. Get your Loki datasource UID the same way as Prometheus above and replace YOUR_LOKI_UID throughout.

apiVersion: 1

groups:
  - orgId: 1
    name: loki-alerts
    folder: Logs
    interval: 5m
    rules:

      # Error Rate Spike
      - uid: error_rate_spike
        title: Error Rate Spike
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_LOKI_UID
            model:
              expr: |
                sum by (container_name, host) (
                  count_over_time({job="docker", container_name!="grafana"}
                  |~ "(?i)(error|exception|fatal|panic)"
                  != "401"
                  != "Unauthorized"
                  != "token needs to be rotated"
                  != "Ignoring invalid configuration option"
                  != "Error parsing filter"
                  [5m])
                )
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [25]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          summary: "Error rate spike detected"
          description: "A container has logged more than 25 errors in the last 5 minutes - check Grafana Explore with query: {job=\"docker\"} |~ \"(?i)(error|exception|fatal|panic)\""
        labels:
          severity: warning

      # Plex Transcoding Error
      - uid: plex_transcode_error
        title: Plex Transcoding Error
        isPaused: false
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: YOUR_LOKI_UID
            model:
              expr: |
                count_over_time({job="plex"}
                |~ "(?i)(transcoder exited with error|transcode.*failed|failed.*transcode|error starting transcode|transcoder crashed)" [5m])
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              expression: A
              reducer: last
              settings:
                mode: dropNN
              type: reduce
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [1]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [B]
                  reducer:
                    type: last
              expression: B
              type: threshold
        noDataState: NoData
        execErrState: Error
        for: 0m
        annotations:
          summary: "Plex transcoding error detected"
          description: "Plex has logged transcoding failures in the last 5 minutes - check Grafana Explore with query: {job=\"plex\"} |~ \"(?i)(transcoder exited with error|transcode.*failed)\""
        labels:
          severity: warning

Applying the Config

Make sure the provisioning directory is mounted in Grafana's compose:

  grafana:
    volumes:
      - /media/disk1/logStorage/grafana:/var/lib/grafana
      - ./config/grafana/provisioning:/etc/grafana/provisioning:ro

Set permissions on the provisioning directory:

sudo chmod -R 777 /home/youruser/docker-projects/logStack/config/grafana/

Restart Grafana:

docker compose restart grafana

Check the logs to confirm provisioning succeeded:

docker compose logs grafana | grep -i "provision\|alert\|error" | tail -20

Looking for:

logger=provisioning.alerting msg="finished to provision alerting"

Without errors between start and finish.


Verifying in the UI

Go to Alerting → Alert Rules. You should see two folders — Infrastructure and Logs — with rules inside them. All rules should show Normal state with ok health after a few evaluation cycles.

If any show Error health, click View to see the error message. The most common cause is a datasource UID mismatch. Double check the UIDs in your YAML files against what Grafana has configured.


Tuning Your Alerts

Out of the box the error rate spike alert will probably fire immediately. Here's what triggered false positives in my setup and how I handled each one:

Grafana's own 401 errors — if Grafana is behind a Cloudflare tunnel, session token rotation generates a steady stream of 401 errors. These match the error filter. Exclude Grafana entirely with container_name!="grafana" and filter out 401 and Unauthorized strings.

nginx 404 errors — any nginx container with internet exposure logs [error] for every 404. Bots constantly probe for robots.txt, favicon.ico, /.git/index, and other files that don't exist. Harmless but noisy. Worth noting: if you have port forwarding rules on your router pointing at these containers, turn them off. Route everything through Cloudflare tunnels instead. You'll be amazed how much noise disappears.

Application-specific noise — Ghost logs a MySQL2 warning as error level on every database connection. Completely harmless, extremely chatty. Add it to the exclusions.

The general exclusion pattern in LogQL:

{job="docker", container_name!="grafana"}
|~ "(?i)(error|exception|fatal|panic)"
!= "401"
!= "Unauthorized"
!= "your noisy string here"

Raising the threshold is also valid. A threshold of 25 errors per 5 minutes means a container has to be genuinely misbehaving before you hear about it.


Testing Delivery

Grafana has a built-in test feature:

  1. Go to Alerting → Contact Points
  2. Click the test button next to any receiver
  3. Click Send test notification

Check Discord and your inbox. If the Discord message shows raw Go template syntax instead of rendered content, you have a duplicate contact point — one created manually and one from provisioning. Find and delete the duplicate via the API:

curl -s -u admin:your_password http://localhost:3001/api/v1/provisioning/contact-points | python3 -m json.tool | grep -E '"uid"|"name"'

curl -X DELETE -u admin:your_password http://localhost:3001/api/v1/provisioning/contact-points/THE_DUPLICATE_UID

Where We Are

  • ✅ Discord notifications for all alert severities
  • ✅ Email notifications for critical alerts
  • ✅ Host down detection with 2 minute grace period
  • ✅ Disk space warnings and critical alerts for all drives
  • ✅ ZFS pool health monitoring
  • ✅ High CPU and memory alerts with sensible thresholds
  • ✅ Pi-hole availability monitoring
  • ✅ Error rate spike detection with noise filtering
  • ✅ Sane repeat intervals that won't wake your household during a power outage

The Series

  1. Introduction & Architecture –– Stop Flying Blind, Series Introduction
  2. Setting Up the Core Stack — Loki, Grafana, and Fluent Bit on your main host
  3. Shipping Logs from Multiple Hosts — expanding Fluent Bit across your network
  4. Metrics with Prometheus — node_exporter, Pi-hole metrics, and Proxmox monitoring
  5. Alerting — getting notified when things actually break
  6. Lessons Learned — everything that went wrong and how we fixed it

In the final article we pull back and talk about everything that went wrong during this build — the version tag that didn't exist, the AppArmor permissions saga, the GELF logs firing into the void, and the Proxmox firewall that kept eating our iptables rules.

One more coffee. You're almost there.

Chris R. Miller

Austin, TX
I like computers.