Lessons Learned: Everything That Went Wrong (So You Don't Have To)

Lessons Learned: Everything That Went Wrong (So You Don't Have To)

[Part 6] A love letter to every error message that made us stronger

We made it. Six articles, one complete observability stack, and more error messages than I care to count. If you've followed along from Article 1, you now have centralized logging, metrics dashboards, and intelligent alerting running across your entire homelab.

This article is different from the others. No step-by-step instructions, no config files. Just an honest accounting of everything that went sideways during this build and what to do about it. Consider it the appendix that every technical tutorial should have but most don't bother writing.


Lesson 1: :latest Is A Lie

This one cost the most time for the least interesting reason.

I specified fluent/fluent-bit:3.3.3 in my compose file. Pinned, specific, exactly as you're supposed to do. The problem: that version didn't exist. Docker pulled :latest silently, gave no error, and I spent an afternoon debugging behavior that made no sense — because I was running a completely different version than I thought.

Before you use any version number you find in a blog post — including this one — verify it actually exists:

docker pull fluent/fluent-bit:3.3.3
# Error response from daemon: manifest unknown

That single command would have saved hours. Run it for every image version you put in a compose file. Every single one.


Lesson 2: AppArmor Does Not Care About Your chmod

Ubuntu 24 and 25 ship with AppArmor enabled. AppArmor is a mandatory access control system that enforces security policies at the kernel level — and it can override filesystem permissions entirely for bind-mounted directories in Docker containers.

The symptom is maddening. You chown the directory to the correct UID. Nothing. You chmod 777. Still nothing.

grafana | mkdir: can't create directory '/var/lib/grafana/plugins': Permission denied

The fix — running the container as root with user: "0" — feels wrong. For a homelab logging stack on a dedicated internal drive, it's fine. The alternative is a weekend spent configuring AppArmor profiles. Pick your battles.


Lesson 3: The Deprecated Config Graveyard

Loki moves fast. Between versions 2.x and 3.x a meaningful chunk of configuration syntax changed. If you find a Loki config online that uses any of the following, it was written for an older version:

fifocache — replaced by embedded_cache:

# Old (broken in Loki 3.x)
results_cache:
  cache:
    enable_fifocache: true
    fifocache:
      max_size_bytes: 256MB

# New (correct for Loki 3.x)
results_cache:
  cache:
    embedded_cache:
      enabled: true
      max_size_mb: 256

version: in docker-compose.yml — deprecated since Docker Compose V2. Doesn't break anything but generates a warning on every command. Remove it.

Duplicate config fields — if you get a Loki startup error about a field being "already set," you've accidentally duplicated a key. YAML doesn't always throw an obvious error for this — sometimes it just silently applies one value and ignores the other.


Lesson 4: The Port That Was Already Taken

Grafana defaults to port 3000. So does Home Assistant. And Cockpit. And Netdata. And about forty other things homelab people tend to run.

Check what's on a port before mapping to it:

sudo ss -tlnp | grep 3000

The fix is one line:

ports:
  - "3001:3000"

The container still thinks it's on 3000. The outside world uses 3001. Add this to your pre-flight checklist for any new container.


Lesson 5: Container Names Are Not What You Think

Docker stores container logs at paths like this:

/var/lib/docker/containers/7f314efc44b80a5fe.../7f314efc44b8...-json.log

That 64-character hex string is the container ID. Without intervention, your Grafana label browser looks like a blockchain transaction history.

The fix is the Lua script from Article 2 that reads each container's config.v2.json to extract the human-readable name. Getting that script right took longer than it should have because:

  1. The Fluent Bit distroless image has no shell — you can't exec into it to debug
  2. The docker_metadata filter I tried to use doesn't exist in Fluent Bit — I apparently invented it
  3. The regex to extract the container ID from the tag needed to account for .log at the end of the filename, which the first three attempts missed

The final working pattern:

Split on dots, find the segment ending in -json, extract the hex ID. Simple in retrospect.


Lesson 6: The GELF Logs That Fired Into The Void

This one is my favorite because it's a perfect archaeological discovery of my own past mistakes.

When setting up log collection on Vault, several containers weren't showing up in Grafana. No log files at /var/lib/docker/containers/. Empty LogPath in docker inspect. Fluent Bit had nothing to tail.

The culprit was a logging block buried in old compose files:

    logging:
      driver: gelf
      options:
        gelf-address: "udp://192.168.1.50:12204"

GELF is a log shipping protocol used by Graylog. At some point during a previous failed attempt at centralized logging, I'd configured these containers to ship logs to a Graylog instance that had since been decommissioned. The containers had been faithfully firing logs into the void for — I genuinely don't know how long. Weeks. Possibly months.

When a container uses an alternative log driver, Docker doesn't write to the standard JSON log file at all. The fix is removing the logging block and recreating the containers.

If containers aren't showing up in Loki, check for alternative log drivers first:

docker inspect container_name | grep -A3 "LogConfig"

You want "Type": "json-file". Anything else is your answer.


Lesson 7: Spaces In Paths Are A Silent Killer

The Plex log directory path contains spaces:

/config/Library/Application Support/Plex Media Server/Logs/

Docker volume mounts with spaces in the path can fail silently. The mount just doesn't happen. No error. The container starts fine, Fluent Bit runs fine, and zero Plex logs appear in Loki. You check the config. You check the container. You check your life choices.

The fix is a symlink:

ln -s "/path/to/Application Support/Plex Media Server/Logs" /home/youruser/plex-logs

Mount the symlink instead. Spaces gone, logs flowing.


Lesson 8: The Proxmox Firewall That Kept Eating Our Rules

We installed node_exporter on the Proxmox nodes. Added iptables rules to allow port 9100. Verified Prometheus could scrape them. Everything worked.

Then the power went out.

When the nodes came back online, Prometheus couldn't reach them. The iptables rules were gone. We added them again. Everything worked. Then the power went out again.

Proxmox manages its own firewall, and when it starts on boot it rebuilds iptables from its own configuration — overwriting any rules added manually. Even with netfilter-persistent installed, Proxmox wins this fight every time.

The correct fix is adding the rule through Proxmox's own interface at Datacenter → Firewall. A datacenter-level rule lives in the cluster config and survives every reboot and power outage. One rule covers all nodes.

We learned this lesson twice. You should only need to learn it once.


Lesson 9: Your Own IP Is Not An Attacker

During alert tuning, Grafana was generating 90 "errors" per 5 minutes and triggering the error rate spike alert constantly. The errors were all 401 Unauthorized responses hammering the /api/live/ws endpoint from a single IP.

I was ready to block that IP in Cloudflare.

It was my own browser. Expired Grafana session, repeatedly trying to reconnect to the WebSocket endpoint. The token needs to be rotated message in the logs is completely normal session management behavior.

Two takeaways:

  1. Always identify an IP before blocking it. A quick lookup at ipinfo.io/your-ip would have saved some embarrassment.
  2. While you're at it — audit your router's port forwarding rules. During this build I discovered a port 80 forwarding rule that had been open for months, exposing a container directly to the internet and bypassing Cloudflare entirely. Route everything through Cloudflare tunnels. The reduction in noise is immediate and significant.

Lesson 10: Hardcode Your Host Labels

Fluent Bit's ${HOSTNAME} resolves to the container's internal hostname — a random hex string Docker assigns at creation. Useless as a label.

Always hardcode:

[FILTER]
    Name          record_modifier
    Match         *
    Record        host nexus

[OUTPUT]
    Labels        job=docker, host=nexus

If you let ${HOSTNAME} resolve, you get a different container ID every time the container is recreated. Your Loki label browser fills up with orphaned hex strings and cleaning them up requires the Loki delete API and time you didn't plan to spend.


What We'd Do Differently

Looking back at the full build, a few things worth changing with the benefit of hindsight:

Start with provisioning files. We set up Grafana contact points manually first, then switched to provisioning files later. This created duplicate contact points that caused the raw template Discord message issue. Start with provisioning files from day one and never touch the UI for alerting configuration.

Document your label strategy upfront. We ended up with inconsistent label names across different log sources before standardizing on job, host, and container_name. Decide on your label taxonomy before you ingest a single log line.

Set reject_old_samples_max_age from the start. When we tried to backfill historical logs, Loki rejected everything older than the default window. A generous reject_old_samples_max_age set from the beginning would have let the historical data ingest cleanly.

Use Proxmox's own UI for anything Proxmox-related. Manual iptables rules on Proxmox nodes are a trap. If you want something to persist on a Proxmox host, configure it through Proxmox's management interface.


What's Next

A few directions worth exploring once the basics are running:

Watchtower logging — Watchtower's update logs are genuinely useful. When something breaks after an automatic container update, knowing exactly which container was updated and when is the first step to diagnosing the problem. It flows through Docker's standard log driver and appears in Loki automatically.

Backup job monitoring — if you're running scheduled backups, adding success/failure logging to your stack is straightforward. A backup that silently fails for three months is just a data loss event with extra steps.

Synthetic monitoring — Prometheus's blackbox_exporter lets you monitor HTTP endpoints and alert when your self-hosted services return errors or go unreachable. A natural extension of everything we've built here.


The Stack, Complete

ComponentRoleWhere It Runs
LokiLog storageNexus (Docker)
GrafanaVisualization & alertingNexus (Docker)
Fluent BitLog collectionAll hosts (Docker/agent)
PrometheusMetrics storageNexus (Docker)
Node ExporterHost metricsAll hosts (bare metal)
pihole6-exporterPi-hole metricsNexus (Docker)
pve-exporterProxmox metricsNexus (Docker)

Total licensing cost: $0. Total coffee consumed during setup: significant. Total sleep lost to 2am alerts about things that actually needed fixing: surprisingly little, once the thresholds were tuned correctly.

If you run into issues this article is the first place to look. If your problem isn't covered here, it probably happened to me too — I just haven't written it up yet.

Good luck out there. May your pools stay ONLINE, your disks stay uncrowded, and your alerts stay actionable.

Now go make a coffee. You've earned it.


The Series

  1. Introduction & Architecture –– Stop Flying Blind, Series Introduction
  2. Setting Up the Core Stack — Loki, Grafana, and Fluent Bit on your main host
  3. Shipping Logs from Multiple Hosts — expanding Fluent Bit across your network
  4. Metrics with Prometheus — node_exporter, Pi-hole metrics, and Proxmox monitoring
  5. Alerting — getting notified when things actually break
  6. Lessons Learned — everything that went wrong and how we fixed it

Chris R. Miller

Austin, TX
I like computers.