Self-hosted state of the union

“Homelab” in massive quotes

After reaching a threshold of how critical my self-hosted setup is to my day-to-day life, it’s hard to call it a homelab with a straight face.

For example: a botched upgrade or downtime can remove the ability to access my password manager and thus my entire online existence! It’s hardly a “lab” I can safely play in.

That not necessarily a bad thing! It lets me train my operational muscles, especially when my well-being is on the line to force me to do so.

Actually, let’s use that mindset to ask the following about each system in my self-hosted setup:

What is it and why do I need it?

How stable has it been? Over what timeframe?

What is the blast radius for it failing?

State of monitors and alerts?

While I’ve gravitated toward robust systems and setups over time, it’s good to write everything down and take a hard look at each one. Once that’s done, I can evaluate if they need adjusting or removal if they’re too much of a risk or not really needed.

Okay. Let’s list what is deployed on my server cluster, what is the impact of their failure, and how stable have they been across time and across upgrades.

This is specific to my setup and personal experience, definitely not any kind of recommendation. It is first a self-reflection piece that will hopefully be useful for others.

Please skim through and see if there’s anything you care about :)

Foundational systems

These are less critical and more like “everything else is standing atop these”.

Debian (server OS)

Project link: https://www.debian.org/

Description:

Granddaddy of many a Linux distro, and for a good reason: it’s slow-moving and stable.
It also helps that a lot of online Linux Q&As are Debian-centric, making setup and debugging easier.

Stability: no issues to report!

I’ve never done an in-place Debian upgrade though, so can’t speak to that.

Failure impact: probably some temporary service interruptions

If the package manager and the multitude of mirrors are unavailable, I can’t apply a critical security update.
- Timing of that is highly unlikely.
Otherwise a Debian / Linux bug, hardware issue, or resource exhaustion might take down a server.
- Unlikely to take down all at the same time in my high-availability 3-node setup though.

Monitoring: looking good

Prometheus node-exporter alerts should cover most common failure scenarios.

OpenZFS (server filesystem)

Project link: https://openzfs.org/wiki/Main_Page

Description:

Provides disk-level redundancy, encryption, integrity checks, app volumes, and much more.
Can’t heap enough praise on ZFS. Robustly solves so many problems all in one package.

Stability: never had a reliability issue

Over time and across upgrades, things never went wrong reliability-wise due to using ZFS.
Caveats:
- You NEED to be running periodic scrubs to fix any bitrot.
  - Don’t be a Linus
- You NEED to have enough redundancy to make sure you can comfortably survive disk failures.

Failure impact: I’m screwed

It’s literally where the data lives, it CANNOT fail no matter what.
Really this is only if ZFS starts returning corrupt data somehow and propagates across nodes, HIGHLY unlikely given data is validated against checksums during reads combined with ECC RAM.
- There were some known issues, particularly with zfs send / recv in combination with encryption, but I don’t use that in my environment.
If it just starts erroring due to pool failure, can failover to another node where the data was synchronously replicated via DRBD.

Monitoring: only alerting on pool state

Partially covered by some node-exporter custom alerts, need to expand them.

k3s (cluster OS)

Project link: https://k3s.io/

Description:

Kubernetes, but greatly simplified into one binary.
Bunch of convenient things built in to get you going, fairly customizable, and supports various topologies.

Stability: no issues across 4+ major version upgrades!

Upgrades are as easy as: shutdown, replace existing binary, startup.
Of course, need to carefully read any release notes on breaking changes, but there are typically warnings and grace periods between versions before that happens.

Failure impact: it’s the OS for my cluster

If it goes down, so does everything running on it.

Monitoring: probably complete

Covered by various k8s exporters and alerts that come with kube-prometheus-stack Helm chart.
- Although getting k3s to work with the exporters is a bit of a pain.

Piraeus Operator (cluster filesystem)

Project link: https://github.com/piraeusdatastore/piraeus-operator

Description:

In the usual Unix™ way, this is a sandwich of systems to make data replication happen.
- DRBD: the Linux kernel driver to enable replication of block devices across systems.
- LINSTOR: management engine that handles generating the archaic DRBD configs and keeping them in sync across systems, as well as volume provisioning using ZFS.
- Piraeus: the k8s CSI layer on top of LINSTOR to let us use native k8s resources to provision and manage replicated volumes.
In practice, Piraeus Operator sets all this up including building and loading in the DRBD Linux kernel driver, which makes it easy to get going.
- It does limit you to specific Linux distros and versions due to said kernel driver functionality.

Stability: not great…

There’s been a handful of issues like this one that have caused the dreaded corruption to (silently) occur, truly the worst scenario.
Thankfully, no issues with Piraeus Operator v2.10.4 (DRBD v9.2.16) so far.
- I might just stay on this version until I’m forced to move due to an OS upgrade…
Definitely the problem child of the whole setup, have not found any solid alternatives though.
- Evaluated Longhorn, Rook, and Mayastor. All have dealbreakers in one way or another.
- Paid solutions might be better, but my setup is already costing me enough as it is…

Failure impact: I’m triple screwed!

It’s like ZFS in terms of impact, but across ALL nodes!
- Corruption will propagate automatically and will make sure all your nodes are screwed at the same time.
  - ZFS will not save you since it is faithfully storing the corrupted data given to it from the layer above.
- Only recourse is restoring from ZFS snapshot or remote backup.

Monitoring: probably good

Covered by Prometheus monitoring resources provided by the operator.

MetalLB (network load-balancer)

Project link: https://metallb.io/

Description:

Uses clever methods to let you set up a proper load-balancer across physical nodes without dedicated hardware.
If you don’t want to play with BGP or if your router doesn’t support it then you need to use layer 2 mode.
- This means one node is tasked with receiving and routing all traffic, with MetalLB performing failover as needed.

Stability: I’ve never had to think about it

Layer 2 mode has been incredibly stable and I’ve had no issues.
Can’t speak to the BGP mode, been too lazy to try it out, although that does make it act like a true load balancer where all nodes can receive inbound traffic in parallel.

Failure impact: full outage, it handles external network traffic

Internal traffic will keep going fine, but that’s not exactly useful.

Monitoring: double coverage

Covered by Prometheus monitoring resources deployed as part of the Helm chart + blackbox-exporter as a failsafe.

Traefik (web load balancer)

Project link: https://traefik.io/traefik

Description:

Does the usual reverse proxy thing to let me map various subdomains to different apps, terminating TLS, load-balancing, etc.
Comes built into k3s, but haven’t had a reason to switch away from it yet.
- Customizing installation within k3s is a bit hacky, but that’s not Traefik’s fault.

Stability: been fine

No issues to note during upgrades outside some deprecations here and there.

Failure impact: most services down

Some non-web apps like Samba and Forgejo git do survive though, so I guess not a full outage?

Monitoring: same as MetalLB

Covered by Prometheus monitoring resources deployed as part of the Helm chart (once customized) + blackbox-exporter as a failsafe.

cert-manager (certificate provisioning)

Project link: https://cert-manager.io/

Description:

Used with Let’s Encrypt and Cloudflare DNS to automatically provision and renew HTTPS certs for all webapps.

Stability: all clear!

Do vaguely remember having pain with an upgrade at work in the long ago, but not gonna count that against my own setup.

Failure impact: low if caught in time

Expiration should give me at least a few days to get alerted about and respond to any renewal issues.

Monitoring: canned alerts

While a ServiceMonitor is exposed, no default alerting rules are setup in the Helm chart. Ended up using some generic ones I found online that I should really review.

Uptime / debug enablers

Things that let me know if something went wrong and dig into why.

Issues can’t always be prevented, but we can try and reduce the time-to-recovery by setting ourselves up for success.

Prometheus (time series metrics / alerting)

Project link: https://prometheus.io/

Description:

One of the OGs of time series DBs and alerting (via Alertmanager)
There are lots of alternatives (like Influx or Mimir), but it’s what I know and comes with a lot built in via kube-prometheus-stack.
- Prometheus Operator is pretty great too and lots of Kubernetes deploys optionally come with Prometheus resources for automatically setting up alerting.

Stability: no issues for low volume

Definitely fine for my self-hosted setup
- Though I know it doesn’t scale particularly well with number of metrics / cardinality

Failure impact: alerting down

Most alerts are based on Prometheus metrics, so if it’s down then I’m flying blind.

Monitoring: no backup alerting!

Definitely need something that monitors the monitor, like a dead man’s switch.
- ntfy just enabled this functionality in the latest version, so can probably leverage that.

Influx (time-series DB)

Project link: https://www.influxdata.com/

Description:

Prometheus-alternative with a much better UI and honestly more sane query langauge (at least Flux is).
- But has a lot less support out of the box from other apps.

Stability: several major breaking versions

Stuck on InfluxDB v2 since Flux language support was dropped in v2, not great…
Otherwise, InfluxDB v2 has been great.

Failure impact: critical metrics down

Metrics for solar panels, power distribution, and hard drives go into InfluxDB.

Monitoring: we ping the site

Only blackbox-exporter alerts for the website frontend, but no monitoring on the actual functionality.

Loki (log aggregation / query)

Project link: https://grafana.com/oss/loki/

Description:

Like Prometheus, but for log aggregation (logs pushed from each node via Alloy).
Meant for large scale deployments via autoscaling microservices, so is quite complicated to setup even in “SingleBinary” mode.
- Loki query language is hard to beat though

Stability: no issues for in my setup

Built for scaling up, but works fine as well as one pod writing to a file system

Failure impact: no way to debug issues

Pod logs don’t persist locally for long, so anything that requires looking into the past will be down.
Don’t have any log-based alerts yet, so doesn’t impact alerting at least right now.

Monitoring: it’s there

Comes built in with Prometheus alerting rules, so it’s probably good?

Grafana (dashboards / metrics querying)

Project link: https://grafana.com/grafana/

Description:

Gold standard dashboarding and metrics querying solution, lots of integrations available for whatever datasources you care about.
Large ecosystem of dashboards and plugins to pull from as well.

Stability: fine

There were some known issues due to the removal of Angular in Grafana 12 that broke some legacy plugin functionality.
- I wasn’t affected, but it is something to note.

Failure impact: oh no my pretty dashboards!

It would be a pain to have to query each data source directly if Grafana went down, but it’s not the end of the world.

Monitoring: we ping the site

The blackbox-exporter based alerts on whether the site is up or not should be sufficient.

Scrutiny (drive monitoring)

Project link: https://github.com/AnalogJ/scrutiny

Description:

Provides predictive alerts on potential drive failure by comparing SMART data against statistics from Backblaze.
- Though not sure how well the statistics apply to SSDs.
There’s some overlap with Prometheus’ smartctl-exporter, but drive health is important enough that I don’t mind having two systems for this.

Stability: nothing to report

No complaints so far about stability, though this is one of the newer apps I’ve installed so it hasn’t been around the block yet.

Failure impact: predicted drive failure detection miss

smartctl-exporter will let me alert on full-blown drive failure, but being proactive rather than reactive is always better.

Monitoring: only dashboard site uptime

blackbox-exporter alert exists for the dasbhoard site, but no monitoring on if node SMART data scraping / alerting is working.

ntfy (Android push notifications)

Project link: https://ntfy.sh/

Description:

Easy to use and flexible way to enable push notifications to phones.
- Lots of ways to specify data, priority, etc via query params, headers, etc.
Self-hosted setup is kind of a pain though.
- Lots of settings to configure, though the docs do help.
- To prevent the Android app from running as a foreground service, need to create a Google Firebase account and build your own Android APK.

Stability: occasional delayed alerts

At least when using the Firebase-based background notifications, have had delays with pretty critical things like doorbell notifications up to a minute or so.
- Not sure how much of that is the app though vs Android issues.

Failure impact: most alerts sent in parallel to email

Some alerts aren’t sent to email though, so there is some impact.

Monitoring: only website uptime

Only blackbox-exporter alerts for the admin website, but no monitoring on the actual functionality (like a heartbeat check).

Unpoller (Ubiquiti metrics)

Project link: https://unpoller.com/

Description:

Most of my networking and security hardware is Ubiquiti. This lets me scrape them as Prometheus metrics and expose the data as Grafana dashboards.
Not using the data for alerts yet, though Ubiquiti provides some of that out of the box natively.

Stability: probably good

The rare times I look at the provided dashboards, the data is there so no issues from what I can tell.

Failure impact: none really

The data isn’t used for alerts so no impact if there’s downtime.
- Should probably change that though, would be good to get some secondary monitoring going.

Monitoring: none really

Kubernetes will let me know if the pod is crashlooping, but nothing about if the functionality works okay.

Database-like things

Thankfully lots of self-hosted apps now just use SQLite or filesystem for storing data and make things easy. Some don’t though, and so have setup some additional things.

MariaDB Operator (MySQL fork)

Project link: https://github.com/mariadb-operator

Description:

Provides a Kubernetes-native way to spin up MariaDB instances / clusters.
- I like how databases, users, grants, etc are all k8s resources, makes it easy to prep a database for an app in a declaritive way.
- Really, this is only really used for Seafile as MySQL / MariaDB is a hard requirement.

Stability: gewd

I’m only running a single instance for one app, so not exactly high load but it’s never caused me issues.
Updates are pretty easy too and handled by the operator mostly.

Failure impact: Seafile down

It’s not the end of the world if the DB for Seafile is down since the data is replicated to a bunch of clients.
- Although it means client uploads are paused, which can cause conflicts which are a pain to untangle.

Monitoring: none!

The metrics should be exposed via ServiceMonitors, but no rules actually alert based on them, not good.

Valkey (Redis fork)

Project link: https://valkey.io

Description:

Lots of apps rely on Redis for caching, queues, etc, and Valkey seems like the OSS fork of choice now after the licensing debacle.
- Many app Helm charts baked in OG Redis cluster provisioning, and untangling that was annoying.
Sadly, instances are still provisioned by creating individual deployments for each app as a k8s operator didn’t really exist at setup time.

Stability: haven’t root caused an issue to it yet

Mone of the Valkey instances have a persistent volume attached so clearly data preservation is not an issue.
Upgrades have been completely seemless too so far.

Failure impact: higher latencies maybe?

Most apps just use it as a cache and fallback to hitting the DB if needed.
- My cluster has low enough load that it’s probably not impactful.

Monitoring: crickets

No metrics endpoints configured since they’re just raw k8s Deployments, so need to figure that out then finding some alerting rules based on those.

Stackgres (PostgreSQL stack)

Project link: https://stackgres.io/""

Description:

Like the MariaDB Operator, this let’s us spin up Postgres clusters and manage them using k8s resources.
- Interestingly, this includes things like restarts, upgrades, backups, etc. All of these can be triggered via applying a k8s resource.
Has a nice UI interface, and integrates well with Prometheus / Grafana.

Stability: lotta bugs

This is mostly from my experience at work, but there have been many issues that we had to work through with the vendor and which required either manual intervention or continuous patch version upgrades.
- Learnings from that lets me sidestep most of them at home, but not all.

Failure impact: many apps down!

If an app doesn’t use SQLite, then it most likely uses PostgreSQL.
- Lots of apps like Forgejo and Immich pretty much require it.

Monitoring: fantastic!

Operator provides out of the box metrics for each cluster an automatically sets up dashboards in Grafana.
- While the operator weirdly doesn’t setup PrometheusRules itself, the repo provides a YAML file to apply.