Nerdy Lyon's Den... Tech Blog

GitOps for Network Engineers - Deploying Nautobot

Jeffrey Lyon — Tue, 16 Sep 2025 12:13:28 GMT

Previous Articles in the Series

Bridging the Gap: GitOps for Network Engineers - Part 1 (Deploying ArgoCD)

Bridging the Gap: GitOps for Network Engineers - Part 2 (Deploying Critical Infrastructure with ArgoCD)

Intro

Here we go! Time to deploy something network automation engineers actually use: Nautobot. For those who are unfamiliar, Nautobot is an open-source Network Source of Truth and automation platform. It gives you a clean API, GraphQL, plugins, and jobs for modeling your network and driving intent-based automation. In a GitOps workflow, Nautobot becomes the living database of network intent and inventory, while Argo CD ensures the platform itself is deployed and maintained declaratively. It’s one of my favorite tools because you can’t have a solid network automation foundation without a solid source of truth (okay, “source of intent” if you prefer). Either way, Nautobot is among the best, kudos to the Network to Code team for a great product. Before we dive in, let’s quickly recap previous GitOps for Network Engeineers posts. If you haven’t read those yet, I’d recommend starting there first. The links are posted above.

Part 1 established the groundwork: why GitOps matters for network engineers (intent-as-code, reviews, rollbacks), installing Argo CD, connecting it to Git, and proving the reconcile loop with a simple, Git-managed deployment.

Part 2 leveled that foundation into a production-ready platform. We declaratively integrated:

MetalLB for external service IPs
Traefik for ingress routing and TLS
Rook-Ceph for durable, cluster-native storage
A secrets stack using External Secrets backed by HashiCorp Vault, all continuously managed by ArgoCD.

As a result, the platform can now:

Expose apps securely via external IPs and ingress rules
Persist data with Ceph-backed volumes
Manage secrets without committing them to Git
Treat infrastructure the same as applications: defined in code, reconciled by Argo CD

Instead of stamping this post as “Part 3,” I’m branching it off from the foundation posts. That gives me room to play with future installments while still keeping them under the GitOps for Network Engineers umbrella when it makes sense. The goal here is simple: bring a basic Nautobot deployment online, fully managed by ArgoCD, using the same GitOps patterns we established earlier. Specifically, we will:

Add the main Nautobot Helm chart to ArgoCD
Define (or confirm) a StorageClass for Nautobot’s persistent needs
Allocate a MetalLB IP for Traefik to serve Nautobot externally
Create Secrets for DB, Redis, and an initial Nautobot superuser
Compose Kustomize resources to wrap Helm and environment overlays
Author custom values.yml for your environment
Deploy the App

When we are done our deployment will include 5 total pods:

Nautobot Web (frontend/API) - serves the UI plus REST/GraphQL endpoints
Nautobot Celery Worker - executes background jobs and plugin tasks
Nautobot Celery Beat - schedules periodic tasks for the worker
PostgreSQL - primary application database for Nautobot objects/state
Redis - cache and message broker backing Celery queues

This deployment will not include any building of custom container images, Nautobot plugins, or custom Nautobot configurations. I’m planning that for a future post.

Let’s dive in.

Adding Nautobot’s Helm Chart

First things first: let’s add the Nautobot Helm chart to Argo CD. If you followed the earlier posts, this will feel familiar. In the examples below, I’m using my prod-home Argo CD Project, you’ll see that name throughout. Your Project name can (and likely will) be different; substitute your own wherever you see prod-home.

Step 1: Add the Helm Repo

Helm Repo URL:
https://nautobot.github.io/helm-charts/

In the ArgoCD UI:

Go to Settings → Repositories
Click + CONNECT REPO
Enter the Helm repo URL
Choose Helm as the type
Give the repo a name (Optional)
Chose the project you created earlier to associate this repo to (mine was ‘prod-home’)
No authentication is needed for this public repo
When done, click CONNECT

Once added, ArgoCD can now pull charts from this source.

Note: As seen in Part 2, you’ll also need to add the GitHub repo that contains your custom configuration files, like Helm values.yml files and Kustomize overlays.

If you're using my example repo, add https://github.com/leothelyon17/kubernetes-gitops-playground.git as another source, of type Git.
If you're using your own repo, just make sure it's added in the same way so ArgoCD can pull your values and overlays when syncing.

Step 2: Create the ArgoCD Application

Head to the Applications tab and click + NEW APP to start the deployment.

Here’s how to fill it out:

Application Name: nautobot (or in my case nautobot-prod)
Project: Select your project (e.g., prod-home)
Sync Policy: Manual for now (we’ll automate later)
Repository URL: Select the Helm repo you just added
Chart Name: nautobot
Target Revision: Use the latest or specify a version (latest is recommended)
Cluster URL: Use https://kubernetes.default.svc if deploying to the same cluster (mine might be different than the default, dont worry.)
Namespace: nautobot or nautobot-prod to match the ArgoCD application name. Check the box for creating the namespace if it doesn’t exist already in your kubernetes cluster

Click CREATE when finished.

If everything is in order you should see the App created like the screenshot below, though your’s will be all yellow status and ‘OutOfSync’ -

Just like before, ArgoCD will immediately show you all the Kubernetes objects it plans to create. Don’t hit Sync yet. We haven’t done the configuration of the databases, secrets, or persistent storage, so a deploy right now would fail. Databases would fail to mount their volumes. We’ll get there.

For this first section, the goal was simple: pull in the main Nautobot Helm chart, which we’ve done. In previous posts, we’d usually fine-tune the ArgoCD Application to point at our Kustomize overlays or custom helm values. We’ll come back to that once all those pieces exist; if you do this in the Application now ArgoCD will fail on the missing paths. Onward.

Overview for Nautobot’s Helm Values

Here we’ll take a quick pass over Nautobot’s default Helm values so we know exactly where our overrides will land later.

Defaults can be found here:
https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/values.yaml

For this deployment, we’ll customize these core sections:

superuser - bootstrap admin (username/email/password).

postgresql - point at our Postgres (in-cluster or external), version, storage, and connection settings.

redis - enable/disable and wire the cache/queue endpoint (persistence optional).

A few optional knobs worth calling out:

Replicas: under both nautobot and celery, you can set replicas: 1 for dev or tight clusters; bump later as you scale. I will be setting the replicas to ‘1’.
Image: under nautobot.image set a specific tag (or a custom image) if you don’t want “latest.” Unless you know what you are doing, leave the defaults for this deployment.
Ingress: the chart can create it, but we’re keeping that off and handling exposure via our Kustomize IngressRoute pattern.

That’s it for the big call-outs. We’ll circle back and set those values once the rest of the pieces (storage, secrets, and ingress) are in place later in the post.

Add Persistent Storage

For Nautobot, the one thing that absolutely needs persistence is the primary database, by default that’s PostgreSQL, and it should live on durable storage. Redis handles caching/queuing, and persistence there is optional: if you need cached data to survive pod restarts or rolling updates, back it with a PVC; otherwise keep it ephemeral and let it rebuild as needed.

In the earlier Part 2 post we previously created two CephFS storage classes with Rook-Ceph. For this post I’m using the rook-cephfs-retain class for Postgres and rook-cephfs-delete for Redis (optional) which we will see later in our helm custom values.

CephFS StorageClass (Retain)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  # Name you’ll reference from PVCs (spec.storageClassName)
  name: rook-cephfs-retain
# CSI driver that provisions CephFS-backed volumes via Rook
provisioner: rook-ceph.cephfs.csi.ceph.com

parameters:
  # ----- Tell the CSI driver which Ceph cluster/filesystem to use -----

  # Namespace where your Rook-Ceph cluster runs (operator, mons/osds, etc.)
  # If your cluster is in a different namespace, update this and the secrets below.
  clusterID: rook-ceph

  # Name of the CephFS filesystem (created during CephFS setup)
  # You can confirm with `ceph fs ls`.
  fsName: k8s-ceph-fs

  # Ceph pool backing the filesystem (required when provisionVolume is true)
  # Must match the pool configured for your fsName.
  pool: k8s-ceph-fs-replicated

  # ----- CSI secrets for provisioning/expansion/node-stage (auto-created by Rook) -----

  # Secret used by the provisioner sidecar to create volumes
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph

  # Secret used by the controller for volume expansion operations
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph

  # Secret used on the node to stage/mount volumes
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

  # ----- Optional: choose the client implementation for CephFS mounts -----
  # If omitted, CSI auto-detects. Kernel client is typical in prod.
  # mounter: kernel

# Keep PVs (and data) when PVCs are deleted—safer for DBs and long-lived data
reclaimPolicy: Retain

# Allow growing PVCs in place (kubectl patch ... size: 40Gi, etc.)
allowVolumeExpansion: true

# Mount-time options passed to the client
mountOptions:
  # Uncomment for verbose client debug logs during troubleshooting
  # - debug

Why choose Retain vs Delete

Retain keeps the PV (and data) when its PVC is deleted.
Use it for anything you don’t want accidentally destroyed (databases, long-lived app data, easy rollbacks). The trade-off is manual cleanup later.
Delete removes the PV and backend data when the PVC goes away.
Great for ephemeral/dev workloads where you don’t care about the data. Trade-off: once it’s gone, it’s gone.

Why allowVolumeExpansion is important

Lets you grow PVCs in place as your data grows (no migrate-and-restore dance).
With CephFS + CSI, online expansion is supported; Kubernetes handles the resize.
You still need available capacity in the Ceph cluster. This just makes growth operationally simple.

Use this class for your Nautobot Postgres PVC. Redis persistence is optional. Enable it only if you truly need cache durability.

Add this storage class or classes to your Rook-Ceph deployment if you haven’t already (below) and let’s move forward.

MetalLB IP Pool for Traefik

Before we can expose apps to the outside world, Traefik needs an externally reachable IP from MetalLB. “Public” here just means outside the cluster (it can still be RFC1918). Since we already set up MetalLB in the earlier posts, this is a quick tweak.

1) Give MetalLB an address to hand out

Add a single IP (or a range) to your existing IPAddressPool. I like dedicating a single /32 for Traefik so DNS stays stable.

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: prod-traefik-pool
  namespace: metallb-prod
spec:
  addresses:
    - 192.168.101.161/32  # Traefik LB IP

If you don’t already have one, pair the pool with an L2Advertisement (MetalLB won’t announce addresses without it):

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: prod-traefik-l2adv
  namespace: metallb-prod
spec:
  ipAddressPools:
    - prod-traefik-pool

Note: Pick an unused IP in your LAN (outside DHCP scope). Then sync your Argo CD app for MetalLB.

2) Pin that IP on the Traefik Service

In your Traefik Helm values, set the Service to LoadBalancer and assign the static IP:

service:
  enabled: true
  type: LoadBalancer
  spec:
    loadBalancerIP: 192.168.101.161
  # optional, preserves client source IP if you care about logs:
  externalTrafficPolicy: Local

Sync your Traefik app. You should see the EXTERNAL-IP appear:

kubectl -n kube-system get svc

traefik-prod  LoadBalancer   10.233.23.76   192.168.101.161   32400:30228/TCP,80:31007/TCP,443:32150/TCP   108d

3) (Optional) DNS now or later

Once the IP is live, create a DNS A record (e.g., nautobot.example.local → 192.168.101.161). We’ll wire the IngressRoute host to match this in the next steps.

That’s it. Traefik now has a stable, outside-facing address; we can safely publish Nautobot behind it.

Exposing Nautobot using an Ingress Route

With Traefik now holding an external IP, we can move on to exposing Nautobot to the outside world, time to configure the IngressRoute so users and devices can reach it.

This part is straightforward if you already have an ingress controller. If not, jump back to the Part 2 post for deploying Traefik in-cluster. By default, the Nautobot Helm chart does not create any Ingress/IngressRoute resources.

You can use the Nautobot chart values to let it create ingress, but we’re leaving those at the defaults. Instead, we’ll handle exposure in the overlay with a Traefik IngressRoute. I prefer this split: Helm owns the app; Kustomize owns how it’s exposed. It’s a repeatable, cookie-cutter pattern across apps and keeps odd edge cases out of chart values. Goal here is simple, publish the web UI outside the cluster. Nothing fancy.

A working IngressRoute example is below, and can also be found in my GitOps Playground repository in the apps/nautobot/overlays/prod folder -

---
# --> (Example) Create an IngressRoute for your service...
 apiVersion: traefik.io/v1alpha1
 kind: IngressRoute
 metadata:
   name: nautobot-prod-ingressroute  # <-- Replace with your IngressRoute name
   namespace: nautobot-prod  # <-- Replace with your namespace
 spec:
   entryPoints:
     - websecure
   routes:
     - match: Host(`nautobot.home.nerdylyonsden.io`)  # <-- Replace with your FQDN
       kind: Rule
       services:
         - name: nautobot-prod-default  # <-- Replace with your service name
           port: 80
# --> (Optional) Add certificate secret
   tls:
     secretName: prod-apps-certificate-secret # < cert-manager will store the created certificate in this secret.
# <--

The main points to cover here are:

Namespace - Make sure the manifest’s namespace matches where Nautobot will live.
EntryPoints - Use only websecure so traffic is encrypted at least up to Traefik inside the cluster.
Host rule - routes.match must match the public DNS A record users will hit for Nautobot.
Service wiring - services.name and services.port must match the Nautobot Service. In my setup the name is -default; adjust if yours differs.
Port - Defaults to 80 unless you’ve changed it in the Service.
TLS / certs - If you have a cluster cert solution (e.g., cert-manager), wire it here. If not, leave the TLS section out for now; I’ll cover this in an advanced post.

Note: To check the Service name and port you can go click into the app on ArgoCD, whether fully deployed or not, click the Service → Summary Tab → Desired Manifest as shown below -

Note (again): The Service also exposes port 443, but we’re not using it. Nautobot needs additional app-level config to terminate HTTPS directly. For now we’ll keep TLS at Traefik and speak HTTP to the Service. End-to-end HTTPS on Nautobot itself is out of scope for this post (maybe a future one).

Once the IngressRoute is set the way you want, drop it into your environment overlay (e.g., apps/nautobots/overlays/prod/ingress-route.yml) and commit it. That’s it for this piece, on to the next section.

Deploying Securely - Creating Our Secrets

For a starter implementation of Nautobot with some basic security we are going to need the following secrets stored in Vault -

Super User Login Credentials (which will include a password and API token)
Postgres DB Credentials
Redis DB Credentials

We’re going to keep credentials out of Git and let External Secrets (ESO) fetch them from HashiCorp Vault at deploy time. The two things we need to cover here are: (1) enabling Kubernetes authentication in Vault with a role dedicated to Nautobot, and (2) adding the actual secrets into Vault under the /secret path.

You should hopefully have an existing instance of Hashicorp Vault already if you’ve been following along with the previous posts.

Kubernetes Authentication + Nautobot Role

Why we need it:
External Secrets runs inside your cluster. It needs a secure, short-lived way to prove to Vault, “I’m allowed to read only the Nautobot secrets.” Vault’s Kubernetes auth method does exactly that by validating a pod’s service account token against the cluster API and mapping it to a least-privilege policy.

What the role does:

Binds a specific ServiceAccount + Namespace (e.g., the one where Nautobot lives) to a read-only policy for your Nautobot secret paths.
Issues short-lived Vault tokens to ESO when it presents the Kubernetes JWT. No root tokens or static creds in manifests.
Scopes access to exactly the secret paths you choose (nothing more).

First piece that has to be done (if never done previously) is not unable the Kubernetes Authentication method. For enabling through the GUI follow the steps below:

In the left-hand pane, click Access.
Under Authentication, click Enable New Method (top right).
Under Infra, choose Kubernetes.
Leave the options at their defaults and click Enable Method.
Back on the Authentication Methods list, you should now see kubernetes/ and token/. Click kubernetes/.
Click Configuration (top area), then Configure (right side).
In Configuration, set Kubernetes host to your API URL (I use the Kube-VIP URL from earlier posts). If you don’t have one, you can use https://kubernetes.default.svc.

Kubernetes auth is now configured. Next, create the Nautobot role.

Create the Nautobot role:

From Authentication Methods, select kubernetes/.
Click Create role (right side).
Use these values (adjust as needed for your environment):
- Name: nautobot
- Alias name source: serviceaccount_name
- Bound service account names: nautobot-prod
- Bound service account namespaces: nautobot-prod
- Under Tokens → Generated Token’s Policies: add nautobot (we’ll create this policy next)
Leave other token settings at their defaults; other fields can remain blank.
Click Save.

That’s all we need for the Nautobot role. The referenced ServiceAccount will be created by our Helm deployment a bit later.

Create the ACL policy:

In the left-hand pane, click Policies.
Click Create ACL policy.
Enter a policy name (e.g., nautobot).
Paste in the policy content. Note: for simplicity, I start from the default policy and add read and list capabilities for the upcoming secrets paths (shown below).

# Allow tokens to look up their own properties
path "auth/token/lookup-self" {
    capabilities = ["read"]
}

# Allow tokens to renew themselves
path "auth/token/renew-self" {
    capabilities = ["update"]
}

# Allow tokens to revoke themselves
path "auth/token/revoke-self" {
    capabilities = ["update"]
}

# Allow a token to look up its own capabilities on a path
path "sys/capabilities-self" {
    capabilities = ["update"]
}

# Allow a token to look up its own entity by id or name
path "identity/entity/id/{{identity.entity.id}}" {
  capabilities = ["read"]
}
path "identity/entity/name/{{identity.entity.name}}" {
  capabilities = ["read"]
}


# Allow a token to look up its resultant ACL from all policies. This is useful
# for UIs. It is an internal path because the format may change at any time
# based on how the internal ACL features and capabilities change.
path "sys/internal/ui/resultant-acl" {
    capabilities = ["read"]
}

# Allow a token to renew a lease via lease_id in the request body; old path for
# old clients, new path for newer
path "sys/renew" {
    capabilities = ["update"]
}
path "sys/leases/renew" {
    capabilities = ["update"]
}

# Allow looking up lease properties. This requires knowing the lease ID ahead
# of time and does not divulge any sensitive information.
path "sys/leases/lookup" {
    capabilities = ["update"]
}

# Allow a token to manage its own cubbyhole
path "cubbyhole/*" {
    capabilities = ["create", "read", "update", "delete", "list"]
}

# Allow a token to wrap arbitrary values in a response-wrapping token
path "sys/wrapping/wrap" {
    capabilities = ["update"]
}

# Allow a token to look up the creation time and TTL of a given
# response-wrapping token
path "sys/wrapping/lookup" {
    capabilities = ["update"]
}

# Allow a token to unwrap a response-wrapping token. This is a convenience to
# avoid client token swapping since this is also part of the response wrapping
# policy.
path "sys/wrapping/unwrap" {
    capabilities = ["update"]
}

# Allow general purpose tools
path "sys/tools/hash" {
    capabilities = ["update"]
}
path "sys/tools/hash/*" {
    capabilities = ["update"]
}

# Allow checking the status of a Control Group request if the user has the
# accessor
path "sys/control-group/request" {
    capabilities = ["update"]
}

# Allow a token to make requests to the Authorization Endpoint for OIDC providers.
path "identity/oidc/provider/+/authorize" {
    capabilities = ["read", "update"]
}

# Allow a token to access nautobot db secrets
path "secret/nautobot-prod-db-credentials" {
    capabilities = ["read", "list"]
}

# Allow a token to access nautobot superuser secrets
path "secret/nautobot-prod-superuser-credentials" {
    capabilities = ["read", "list"]
}

That’s it. Kubernetes Auth, the Nautobot Role, and the policy are set. Let’s finally add our actual secrets to Vault.

Add Secrets to Vault (under `secret/`)

We’ll store the credentials and app secrets that Nautobot (and its dependencies) need under a clear, predictable hierarchy in the /secret (KV) mount.

What to store for a “basic but secure” deploy:

Superuser: password and API token (for first login and automation).
Database Passwords: Postgres and Redis

If the KV (Key/Value) secrets engine isn’t enabled yet, start here. Otherwise, skip to Create the secrets.

Enable the KV secrets engine

In the left navigation, click Secrets Engines.
Click Enable new engine + (top right).
Choose KV under “Generic.”
Set the Path to secret; leave other options at defaults.
Click Enable Engine.

If this is a fresh Vault and KV wasn’t previously enabled, you should now see it listed alongside the existing engines.

Create the secrets

In the left navigation, click Secrets Engines.
Select the new secret (KV) engine.
Click Create secret + (right side).
For Path, enter nautobot-prod-db-credentials (or your preferred name).
Under Secret data, add a key postgres-pass with its value.
Click Add and create a second key redis-pass with its value.
Click Save.

If completed correctly it should look like below -

Repeat the process for the Superuser secret. Create a new secret with two keys (for example, password and api-token) and save.

How ESO and Vault Work Together (high-level)

Once the role and secrets exist and ArgoCD goes to deploy the app, ESO will:

Use the Kubernetes auth role to obtain a short-lived Vault token (via its ServiceAccount).
Read the exact keys under /secret/nautobot... as defined by your policy.
Materialize a single Kubernetes Secret (or multiple, your call) in the Nautobot namespace with the names/keys your Helm chart expects.

With Vault and External Secrets in place, we now have a clean, Git-free path for credentials: a Kubernetes auth role that scopes exactly who can read what, a tidy set of KV paths for Nautobot’s superuser + databases, and ESO ready to materialize those values as Kubernetes Secrets when Argo CD reconciles. That closes the loop on “secure by default” for this deployment. Next up, we’ll use everything we’ve built so far (storage classes, ingress patterns, secrets) to assemble our Kustomize resources and configure the Nautobot Helm chart the GitOps way.

The Rest of the Kustomize Resources

Earlier we created the IngressRoute Kustomize file to publish Nautobot through Traefik. Now we’ll add the rest of the overlay, mostly focused on integrating in the work from the Secrets section. We’ll also add a top-level kustomization.yml to bundle these pieces so the cluster can build them as a single unit. Once this overlay is in place, everything we’ve prepared (storage, secrets, and ingress) comes together as one declarative package.

The first file will be the ClusterSecretStore - cluster-secret-store.yml

ClusterSecretStore: ESO’s shortcut to Vault

A ClusterSecretStore is a cluster-wide connection profile that tells External Secrets (ESO) how to reach Vault, which KV (“/secret”) mount to read, and how to auth (Kubernetes auth + Vault role). Use ClusterSecretStore to share one Vault setup across namespaces; use SecretStore if you want it namespace-scoped. For a more simplistic deployment I choose to use a ClusterSecretStore.

What this sets:

server – Vault URL reachable from the cluster
path/version – your KV mount (e.g., secret, v2)
auth.kubernetes – use SA token login; role maps SA+namespace → read-only policy
serviceAccountRef – which SA ESO uses to authenticate

Repo Example (with comments):

apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
  name: vault-backend                 # cluster-wide handle ESO will reference
spec:
  provider:
    vault:
      # Where Vault is reachable from the cluster using a cluster internal URL
      # (Many setups use http://vault.vault.svc:8200 or https with proper CA)
      server: "http://hashi-vault-prod-0.hashi-vault-prod-internal.hashi-vault-prod.svc.cluster.local:8200"

      # The KV mount path and version you enabled in Vault
      path: "secret"                  # e.g., 'secret', 'kv', etc.
      version: "v2"                   # be explicit to avoid surprises

      # Authenticate to Vault using the Kubernetes auth method
      auth:
        kubernetes:
          mountPath: "kubernetes"     # must match your Vault auth mount path
          role: "nautobot"            # Vault role bound to SA+namespace with read-only policy
          serviceAccountRef:
            name: nautobot-prod       # SA whose token ESO will use for login
            namespace: nautobot-prod  # namespace where that SA lives

How it flows: ESO reads this store → logs into Vault with the SA token → gets a short-lived token for the nautobot role → pulls only the allowed keys → renders Kubernetes Secrets for Helm/Kustomize.

The next pair of files are for the database and superuser secret creation.

ExternalSecrets: mapping Vault data into Kubernetes Secrets

Why these exist: An ExternalSecret tells ESO which Vault keys to read and how to materialize them as a plain Kubernetes Secret that Helm/Kustomize can mount. We’ll use two: one for database/redis creds and one for the Nautobot superuser. An ExternalSecret is namespace-scoped.

Database & Redis ExternalSecret (commented)

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: nautobot-prod-db-external-secret   # ESO resource name
  namespace: nautobot-prod                 # where the resulting K8s Secret will live
spec:
  refreshInterval: "1h"                    # re-sync cadence from Vault
  secretStoreRef:
    name: vault-backend                    # points to our (Cluster)SecretStore
    kind: ClusterSecretStore
  target:
    name: nautobot-prod-db-secrets         # name of the K8s Secret ESO will create/update
    creationPolicy: Owner                  # ESO owns and reconciles this Secret
  data:
    - secretKey: postgres-password         # key inside the K8s Secret
      remoteRef:
        key: secret/data/nautobot-prod-db-credentials  # Vault path (KV v2 HTTP style)
        property: postgres-pass            # field inside that Vault doc

    - secretKey: password                  # duplicate key for charts expecting 'password'
      remoteRef:
        key: secret/data/nautobot-prod-db-credentials
        property: postgres-pass

    - secretKey: redis-password            # Redis password (optional if Redis is unauthenticated)
      remoteRef:
        key: secret/data/nautobot-prod-db-credentials
        property: redis-pass

Superuser ExternalSecret (commented)

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: nautobot-prod-superuser-external-secret
  namespace: nautobot-prod
spec:
  refreshInterval: "1h"
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: nautobot-prod-superuser-secrets  # K8s Secret with Nautobot bootstrap creds
    creationPolicy: Owner
  data:
    - secretKey: password                  # superuser password
      remoteRef:
        key: secret/data/nautobot-prod-superuser-credentials
        property: superuser-pass

    - secretKey: api_token                 # superuser API token
      remoteRef:
        key: secret/data/nautobot-prod-superuser-credentials
        property: superuser-api-token

Notes

Key naming: The secretKey entries become keys in your Kubernetes Secret. Align them with whatever your Helm values or manifests expect.
KV v2 pathing: Some setups prefer the logical path (e.g., nautobot-prod-db-credentials) rather than the HTTP-style secret/data/.... Use the style that matches how your ClusterSecretStore is configured.
Duped mappings: Having both postgres-password and password mapped to the same Vault value is fine if different consumers expect different key names.
Refresh: refreshInterval controls how quickly rotations in Vault propagate to Kubernetes. Pick something that fits your rotation policy.

Kustomize: Bundling our Resources Together

Time to bundle everything we’ve created into a single overlay Kustomize can build (and Argo CD can track). Keep this file in your environment overlay (e.g., overlays/prod/).

What this overlay does:

Registers Vault access for ESO via the ClusterSecretStore
Pulls database + superuser creds via ExternalSecret objects
Publishes Nautobot through Traefik with our IngressRoute

## kustomize.yml

# The building blocks we created earlier
resources:
  - cluster-secret-store.yml        # ESO → Vault connection (cluster-scoped; namespace here is ignored)
  - external-secrets-db.yml         # Database & Redis credentials from Vault → K8s Secret
  - external-secret-superuser.yml   # Nautobot superuser creds from Vault → K8s Secret
  - ingress-route.yml               # Traefik exposure for Nautobot

Notes

Order of operations: Kustomize doesn’t enforce ordering, but Argo CD will reconcile until everything is healthy. If you want strict sequencing later, you can add Argo CD sync waves via annotations.
Where this fits: Your Argo CD Application will point at this folder (done in the next section). Once synced, ESO will authenticate to Vault, create the Kubernetes Secrets, and Traefik will expose the app host defined in your IngressRoute.

Commit this file alongside the four resources, and you’ve got a clean, declarative package ready for Argo CD to manage.

The Final Pieces - Custom Helm Values + ArgoCD App Manifest

Custom Helm values (`values-prod.yml`)

This file wires Nautobot to the secrets that will be deployed, dials replicas down for a tidy first deploy, and pins persistence to the CephFS StorageClass(es). Drop it next to your overlay (e.g., apps/nautobot/values-prod.yml) and reference it from your Argo CD Application (next section).

# values-prod.yml
nautobot:
  # Keep it small for the first sync; scale later.
  replicaCount: 1

  # Probes off for initial bring-up (migrations can make probes flap).
  # Once stable, consider enabling these.
  livenessProbe:
    enabled: false
  readinessProbe:
    enabled: false

  # Bootstrap superuser from our ExternalSecret-backed K8s Secret.
  superUser:
    existingSecret: "nautobot-prod-superuser-secrets"   # created by ESO
    existingSecretPasswordKey: "password"               # key in that Secret
    existingSecretApiTokenKey: "api_token"              # key in that Secret
    username: "jeff"                                    # static bootstrap username

celery:
  # One worker to start; bump if you run jobs/heavy plugins.
  replicaCount: 1

serviceAccount:
  # Leave token mounted, used for ESO/ClusterSecretStore
  automountServiceAccountToken: true

postgresql:
  # Using the chart’s built-in PostgreSQL with CephFS persistence.
  primary:
    persistence:
      enabled: true
      size: "2Gi"                          # starter size; expand later
      storageClass: "rook-cephfs-retain"   # keep data if PVC is deleted
      accessModes: ['ReadWriteOnce']       # DB should be single-writer
  auth:
    # Pull the password from the ExternalSecret-created Secret.
    existingSecret: nautobot-prod-db-secrets

redis:
  # Enable persistence if you want cache/queue data to survive restarts.
  master:
    persistence:
      enabled: true
      size: "1Gi"
      storageClass: "rook-cephfs-delete"   # okay to delete for cache data
      accessModes: ['ReadWriteOnce']
  auth:
    enabled: true
    existingSecret: nautobot-prod-db-secrets

Why these choices

Probes disabled (initially): first runs often include migrations; turning probes off avoids noisy restarts. Re-enable once everything is healthy.
CephFS everywhere: aligns with the storage classes you built earlier.
- rook-cephfs-retain for Postgres so accidental PVC deletes don’t nuke data.
- rook-cephfs-delete for Redis because it’s cache/queue data.
ReadWriteOnce for DB/Redis: even though CephFS supports RWX, keeping databases single-writer reduces foot-guns (performance issues, data corruption, or scalability bottlenecks).
Secrets via ESO: existingSecret keys point at the Kubernetes Secrets materialized from Vault, so nothing sensitive lives in Git or in the helm values.

Rounding Out the Argo CD Application

Now that Helm (the app) and Kustomize (secrets + ingress) are defined and your custom Helm values exist we just need to finish the Argo CD Application so it points at both sources and deploys them to the right place (below).

project: prod-home
destination:
  server: https://prod-kube-vip.jjland.local:6443
  namespace: nautobot-prod
syncPolicy:
  syncOptions:
    - CreateNamespace=true
sources:
  - repoURL: https://nautobot.github.io/helm-charts/
    targetRevision: 2.5.5
    helm:
      valueFiles:
        - $values/apps/nautobot/values-prod.yml
    chart: nautobot
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    path: apps/nautobot/overlays/prod
    targetRevision: HEAD
    ref: values

Copy and paste the above file in the ArgoCD GUI or edit manually. Same as similar app manifests that were configured in previous posts.

Deploying and Syncing the App

With everything bundled via Kustomize and correctly referenced by ArgoCD, it’s time to deploy.

Open the Argo CD Application and click Sync. You should see the Helm release create a batch of Kubernetes objects. To focus on what we built in this post, look for:

PVCs bound to your CephFS StorageClasses and mounted by the pods
PostgreSQL and Redis pods coming up Healthy
Secrets flow: ClusterSecretStore and ExternalSecret resources showing Synced, and the resulting Kubernetes Secrets present in the namespace
IngressRoute created and admitted by Traefik (host matches your DNS A record)

If all of the above is green, the Argo CD app should land in Synced / Healthy. Screenshots below show an example of what you should see.

Storage

Secrets

IngressRoute/Traefik

The Application Pods

Note: It can take a little while for the app to show Healthy and become reachable. On the first deploy, once Nautobot connects to Postgres it will run initial database migrations to create tables—this adds extra time on top of the normal startup. If you’re curious, watch the nautobot-init logs for migration progress.

If everything’s green in Argo CD and the pods look steady, open the host defined in your IngressRoute. You should land on the Nautobot login page. Sign in with the superuser credentials you stored in Vault (surfaced via External Secrets and referenced in your custom Helm values). If login fails, check the logs for the nautobot-init container. On first start it runs migrations and bootstraps the superuser. You’ll see log messages confirming the account creation (not the raw secrets), which is a quick way to verify the secret wiring end to end.

If you can log in, CONGRATULATIONS! You’ve just deployed Nautobot on Kubernetes, fully managed the GitOps way.

Troubleshooting Tips

If your deployment isn’t landing cleanly, work through these quick checks, organized by the same pieces we built in this post.

1) Argo CD & Kustomize

What to look for

App stuck in OutOfSync or Progressing.
Sync immediately fails
Resources missing from the tree.

Checks

Open the Argo CD git diff for the app: look for bad paths/filenames in kustomization.yml.
Verify the repo folder the Application points to contains:
- cluster-secret-store.yml
- external-secrets-db.yml
- external-secret-superuser.yml
- ingress-route.yml
- values-prod.yml (referenced by your Helm app)
Confirm files paths in the ArgoCD App manifest
Double check all YAML syntax

2) Secrets pipeline: Vault → ESO → K8s Secret

Symptoms

ExternalSecrets show Not Synced, Nautobot init fails with missing env/creds.

Checks

ClusterSecretStore:
- Server URL reachable inside the cluster?
- auth.kubernetes.mountPath matches your Vault auth mount?
- role name matches the role you created in Vault?
ExternalSecret:
- Conditions should be Ready=True; if not, describe it for a clear error (auth denied, key not found, etc.).
- Verify Vault paths/field names match exactly (KV v2 pathing trips people up).
ServiceAccount binding:
- The SA referenced in the store exists in the right namespace, and your Vault role binds to that SA+namespace.

3) Storage: CephFS StorageClass & PVCs

Symptoms

PVCs stuck in Pending; pods can’t mount volumes.

Checks

StorageClass name in Helm values matches your CephFS SC (e.g., rook-cephfs-retain).
Access modes fit usage:
- Postgres/Redis: ReadWriteOnce (single writer).
- Nautobot media/static (if used): ReadWriteMany.
Rook-Ceph health:
- OSDs/MONs healthy, pool/FS exists, quota not exceeded.
If PVC deleted but PV persists:
- That’s expected with reclaimPolicy: Retain; either reuse or manually clean it up before recreating.

4) Postgres & Redis (built-in charts)

Symptoms

DB pod CrashLoopBackOff; app can’t connect.

Checks

Secrets:
- The existingSecret names line up with what the subcharts expect, and key names (password, postgres-password, redis-password) match your ExternalSecret outputs.
Persistence:
- Correct StorageClass; PVC bound.
Logs:
- Postgres: authentication/permissions, initdb errors.
- Redis: refuses connections or auth errors if auth.enabled=true.

5) Nautobot app (web/worker/beat)

Symptoms

Web never becomes Ready, 502 via Traefik, or superuser not created.

Checks

nautobot-init logs:
- Confirms migrations and superuser bootstrap; errors here usually mean secret keys missing/wrong.
Probes:
- We disabled probes initially—good. If you enabled them early, they can flap during migrations; disable, sync, let it settle, then re-enable.
Environment wiring:
- Confirm the Helm values reference the K8s Secret keys you created (names and casing must match).

6) Ingress, Traefik & DNS

Symptoms

404/503 at the browser, TLS errors, or wrong host.

Checks

IngressRoute:
- routes.match host matches your DNS A record exactly.
- entryPoints: ["websecure"] and Traefik has that entrypoint enabled.
Traefik Service:
- Has an EXTERNAL-IP from MetalLB; DNS A record points to it.
If using certs later:
- Don’t reference cert-manager resources yet if you haven’t set them up; keep TLS simple at Traefik.

7) MetalLB (external reachability)

Symptoms

Traefik never gets an external IP; no traffic into the cluster.

Checks

IPAddressPool contains the IP/range; it’s unused on your LAN.
L2Advertisement exists for that pool.
Traefik Service type: LoadBalancer and (optionally) loadBalancerIP matches your chosen IP.

8) Resources & scheduling

Symptoms

Pods Pending or OOMKilled.

Checks

Nodes have capacity; Ceph/DB pods especially need memory/CPU.
Start small (single replicas) then scale up.
If OOMs, raise limits/requests or add memory.

9) Naming & key mismatches (sneaky but common)

What to verify

Secret names and keys in:
- ExternalSecret → target Secret
- Helm values (e.g., existingSecret, existingSecretPasswordKey, etc.)
Namespace consistency across all manifests (nautobot-prod vs something else).

10) Quick sanity commands (lightweight)

Objects at a glance: kubectl -n nautobot-prod get all
ESO health: kubectl -n nautobot-prod get externalsecret,secretstore,clustersecretstore
PVCs: kubectl -n nautobot-prod get pvc
Describe failures: kubectl -n nautobot-prod describe
App logs: kubectl -n nautobot-prod logs deploy/nautobot -c nautobot-init --tail=100

Ultimately, if you don’t know where to start, USE THE CONTAINER LOGS. ArgoCD makes it so easy and usually you can find the issue in the logs themselves.

Summary

Well, we did it. We didn’t just get Nautobot running, we established a repeatable pattern for network-automation apps or any containerized app: ArgoCD for reconciliation, Kustomize for environment shaping, Vault + External Secrets for credentials, Traefik + MetalLB for reachability, and CephFS for persistence. That stack gives a stable runway to ship changes the same way every time, through Git, without snowflakes or manual tweaks. This same method can be used to deploy both on-prem, in the cloud, or a mix of the two.

Why this helps your automation journey

Trustable intent: Nautobot becomes the system of record for sites, devices, IPAM, and custom models exposed via REST/GraphQL for pipelines and tools.
Safe, auditable change: Every tweak (charts, values, secrets wiring, ingress) goes through Git reviews and rolls back cleanly. Drift is visible; fixes are deterministic.
Fewer blockers: Secrets are handled with least-privilege, storage/ingress are standardized, so you can focus on workflows, not plumbing.
From dev to prod: The same pattern scales to new apps (observability, chatops, CI/CD helpers) with minimal friction. Copy the overlay, adjust values, and commit.

Where I’m going next

An advanced Nautobot deployment (plugins, app config, HTTPS/certs, SSO).
Integrations with other GitOps-deployed apps.
A NetBox deployment for folks who prefer that app. Love it too!

This is the moment where GitOps stops being theory and starts accelerating real network automation and manageable application delivery.

Thanks for reading!

Bridging the Gap: GitOps for Network Engineers - Part 2

Jeffrey Lyon — Mon, 05 May 2025 16:23:13 GMT

ArgoCD Is Amazing—But Let’s Make It Do Something!

Intro

In Part 1, we laid the foundation by installing ArgoCD and setting up the basic structure for a GitOps-driven platform. If you've followed along, you should now have a working Kubernetes cluster, ArgoCD deployed and accessible, and your first project created in the UI.

Now it's time to turn that foundation into something usable.

In Part 2, we'll start deploying the critical infrastructure pieces that power everything else. That includes MetalLB for external load balancing, Traefik for ingress, persistent storage using Rook + Ceph, and secrets management with External Secrets and HashiCorp Vault. All of these will be deployed through ArgoCD, GitOps-style.

We’ll kick things off with MetalLB, which enables us to expose services outside the cluster, an essential first step in making your platform actually accessible. Let’s get into it.

MetalLB: Load Balancing for Bare Metal and Home Labs

If you're running Kubernetes in a cloud environment, you typically get a load balancer as part of the package, something like an AWS ELB or an Azure Load Balancer that magically routes traffic to your services. But when you're running on bare metal, in a lab, or on-prem (which, let’s be real, a lot of network engineers are), you're on your own. That's where MetalLB comes in.

What is MetalLB?

MetalLB is a load balancer implementation for Kubernetes clusters that don’t have access to cloud-native load balancer resources. It allows you to assign external IP addresses to your Kubernetes services so that they can be accessed from outside the cluster, exactly what you'd expect from a "real" load balancer, just built for the DIY crowd.

Why You Need It

In any Kubernetes-based GitOps platform, exposing services to the outside world is non-negotiable. Whether it’s ArgoCD, Traefik, Vault, or any of your network automation tools, they all need to be reachable by users, APIs, or other systems. While NodePorts can get the job done in a lab, they’re clunky, inconsistent, and definitely not production-grade.

MetalLB solves this by handling Service type: LoadBalancer in environments where a cloud load balancer doesn’t exist, like bare metal or your home lab. You define a pool of IP addresses from your local network, and MetalLB assigns those IPs to services that request them.

Here’s where the networking magic comes in: MetalLB (when running in Layer 2 mode) announces those external IPs using ARP. If a device outside of the cluster ARPs for an exposed service IP, MetalLB replies with the MAC address of the node running the service. It’s simple, reliable, and doesn’t require BGP or complex router configs.

So when a LoadBalancer service is created, for example, to expose ArgoCD or Traefik, MetalLB makes that service’s external IP reachable from anywhere on your local network, just like a real load balancer would in a cloud environment.

How It Powers the Platform

MetalLB becomes one of the core enablers of our GitOps stack. It allows you to:

Expose ArgoCD with a proper external IP
Route external traffic to Traefik, our ingress controller
Provide consistent access to internal services that need to be reachable from your network
Maintain a production-like networking experience, even in a lab or homelab environment

Without MetalLB, you’d either be stuck manually forwarding ports, messing with IP tables, or leaning on NodePorts. With it, your platform starts acting like it belongs in a real, routable network, and that’s exactly what we want.

Now that we understand what MetalLB does and how it fits into the big picture, let’s deploy it the GitOps way, starting with adding the Helm chart repository to our config

Quick Review: Helm Charts and How They Fit into ArgoCD

Before we deploy MetalLB, let’s quickly go over how Helm works, especially how it integrates with ArgoCD.

Helm is a package manager for Kubernetes. Instead of manually writing and applying a bunch of YAML files, Helm lets you deploy versioned, configurable "charts", pre-packaged bundles of Kubernetes manifests that define an application. These charts live in remote Helm repositories, similar to how apt or yum fetch packages on a Linux system.

In a GitOps workflow, Helm charts are referenced as part of an ArgoCD Application manifest, specifically as a source. ArgoCD uses this source definition to pull the chart directly from the repo, apply any custom values.yaml overrides you’ve stored in Git, and deploy everything into your cluster automatically.

Using the MetalLB Helm Chart with ArgoCD

The official MetalLB Helm chart is hosted at:

https://metallb.github.io/metallb

When creating your ArgoCD Application, one of your sources will look like this:

Type: Helm
Chart: metallb
Repo URL: https://metallb.github.io/metallb
Target Revision: Usually the latest

ArgoCD will then treat this Helm chart as part of the desired state. It will sync the chart, merge in your values (if you’re overriding anything), and deploy MetalLB as part of your platform, all driven from Git.

MetalLB Installation

These initial steps, adding the Helm repo or other base sources, creating the app in ArgoCD, and wiring up the basic Helm configuration, are mostly the same for every application we’ll deploy. Because of that, I’ll only walk through this process in detail once (here), and only call out major differences for other apps later in the post. Screenshots are included below where it helps, but once you’ve done it once, you’ll be able to rinse and repeat for everything else.

Step 1: Add the Helm Repo

ArgoCD needs to know where to fetch the Helm chart from. For MetalLB, we’ll be using the Github-hosted chart:

Helm Repo URL:
https://metallb.github.io/metallb

In the ArgoCD UI:

Go to Settings → Repositories
Click + CONNECT REPO
Enter the Helm repo URL
Choose Helm as the type
Give the repo a name (Optional)
Chose the project you created earlier to associate this repo to (mine was ‘prod-home’)
No authentication is needed for this public repo
Once done, click CONNECT

Once added, ArgoCD can now pull charts from this source.

Note: You’ll also need to add the GitHub repo that contains your custom configuration files, like Helm values.yml files and Kustomize overlays.

If you're using my example repo, add https://github.com/leothelyon17/kubernetes-gitops-playground.git as another source, of type Git.
If you're using your own repo, just make sure it's added in the same way so ArgoCD can pull your values and overlays when syncing.

Step 2: Create the ArgoCD Application

Head to the Applications tab and click + NEW APP to start the deployment.

Here’s how to fill it out:

Application Name: metallb
Project: Select your project (e.g., lab-home)
Sync Policy: Manual for now (we’ll automate later)
Repository URL: Select the Helm repo you just added
Chart Name: metallb
Target Revision: Use the latest or specify a version (recommended once things are stable)
Cluster URL: Use https://kubernetes.default.svc if deploying to the same cluster (mine might be different than the default, dont worry.)
Namespace: metallb-system (check to create it if it doesn’t exist)

Click CREATE when finished.

If everything is in order you should see the App created like the screenshot below, though your’s will be all yellow status and ‘OutOfSync’ -

Click into the app and you’ll see that ArgoCD has pulled in all the Kubernetes objects defined by the Helm chart. Everything will show as OutOfSync for now, ArgoCD knows what needs to be deployed, but we’re not quite ready to hit sync just yet. You're doing great, let’s move on to the next step.

Step 3: Add the Kustomize Configuration Layer

For MetalLB, we’re keeping things straightforward (kind of): the Helm chart gets deployed using its default values, no need to touch values.yml here. But MetalLB still needs to be told how to operate: what IP ranges it can assign, and how it should advertise them. We handle that using a second source: a Kustomize overlay.

Here’s what to do next:

In the ArgoCD UI, go to the Application you just created for MetalLB.
Click the App details (🖉 edit) icon in the top right to open the manifest editor.
Scroll down to the source section.
You’ll now be editing this app to include a second source.

Add the following block under source: to include the Kustomize overlay for your MetalLB custom resources:

project: prod-home
destination:
  server: https://prod-kube-vip.jjland.local:6443
  namespace: metallb-prod
syncPolicy:
  syncOptions:
    - CreateNamespace=true
sources:
  - repoURL: https://metallb.github.io/metallb
    targetRevision: 0.14.9
    chart: metallb
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    path: apps/metallb/overlays/lab
    targetRevision: HEAD

NOTE: ‘source’ needs to be changed to ‘sources’, as there are now more than one.

This tells ArgoCD to deploy not just the Helm chart, but also the additional Kubernetes objects (like IPAddressPool and L2Advertisement) defined in your overlay. These are located in your apps/metallb directory and should include a kustomization.yml that pulls them together.

Once saved, ArgoCD will treat both the Helm install and the Kustomize overlay as part of the same application, and sync them together.

Step 4: Sync the App

Once everything looks good, hit Sync. ArgoCD will pull the chart, merge/build your kustomize files, and deploy MetalLB into the cluster.

You can click into the app to watch MetalLB’s resources come online; Deployments, ConfigMaps, the speaker DaemonSet, and more. If the sync fails on the first try, don’t panic, just retry it. This can happen if the chart includes CRDs (Custom Resource Definitions), which sometimes cause the sync to complete out of order while the CRDs are still registering.

Once things settle, you should see the application status show “Healthy” and “Synced”. You’ll also see multiple healthy MetalLB pods running in your cluster, just like the screenshot above.

Congrats! MetalLB is now deployed and ready to hand out external IPs like a proper load balancer.

MetalLB Custom Configuration

I wanted to provide a breakdown of the custom MetalLB files I’m using and why. This directory contains a Kustomize overlay used to deploy MetalLB's custom configuration in a lab environment. It layers environment-specific resources, like IP pools and advertisements, on top of the base Helm chart deployment, following GitOps best practices.

File Breakdown

`ip-address-pool.yml`

Defines a IPAddressPool custom resource:

Specifies a range of IP addresses MetalLB can assign to LoadBalancer services
Ensures services are reachable from the local network
Helps avoid IP conflicts in your lab environment

`l2-advertisement.yml`

Defines an L2Advertisement custom resource:

Tells MetalLB to advertise the IPs via Layer 2 (e.g., ARP)
Perfect for home labs and bare metal where BGP isn’t in use
Allows MetalLB to function like a basic network-aware load balancer

`kustomization.yml`

Kustomize overlay file:

Combines and applies the above resources
Enables clean separation between base and environment-specific config
Keeps your repo organized and scalable

Why It Matters

This overlay is what makes MetalLB actually work in your lab. While the Helm chart installs the MetalLB controller and speaker pods, these custom resources tell MetalLB what IPs to use and how to announce them to your network.

By keeping these files in Git and applying them via ArgoCD, you’re not just deploying MetalLB, you’re making your configuration declarative, version-controlled, and repeatable across environments.

Moving on…

Traefik: Ingress Routing Built for GitOps

Once MetalLB is in place and capable of handing out external IPs, we need something that can route incoming HTTP and HTTPS traffic to the right service inside the cluster. That’s where an ingress controller comes in, and for our GitOps setup, Traefik is a perfect fit.

What is Traefik?

Traefik is a modern, Kubernetes-native ingress controller that handles routing external traffic into your cluster based on rules you define in Kubernetes. It supports things like:

Routing traffic based on hostname or path
TLS termination (including Let’s Encrypt support)
Load balancing between multiple pods
Middleware support for things like authentication, redirects, rate limiting, etc.

Traefik is also highly compatible with GitOps workflows. It uses Kubernetes Custom Resource Definitions (CRDs) like IngressRoute and Middleware, which makes it easy to manage all of your ingress behavior declaratively, right from your Git repo.

Why You Need It

Without an ingress controller, every service you want to expose needs its own LoadBalancer service (i.e., a dedicated external IP). That scales poorly, especially in a lab environment with limited IP space.

Traefik solves that problem by letting you expose multiple services through a single external IP, usually on ports 80 and 443, by routing requests based on hostnames or paths. This means:

You can access services like argocd.yourdomain.local and vault.yourdomain.local through the same IP.
You get clean, centralized HTTPS management with built-in TLS support.
You dramatically reduce the number of open ports and public IPs you need.

Paired with MetalLB, Traefik becomes the front door to your entire GitOps platform.

How It Powers the Platform

Traefik is the gateway that makes all the services behind it easily and securely accessible. It enables you to:

Route HTTP/HTTPS traffic to services like ArgoCD, Vault, and your internal tools
Handle TLS (with optional Let’s Encrypt integration)
Define ingress behavior declaratively via CRDs
Share a single external IP across multiple services, using hostnames or paths

All of this is deployed using ArgoCD, meaning every route, certificate, and service exposure is version-controlled and reproducible.

Traefik Installation

As we covered during the MetalLB install, adding Helm repositories, creating the app in ArgoCD, and configuring the basic Helm parameters is mostly the same for each app we deploy. Because we've already gone through that in detail with MetalLB, I'll just briefly outline the steps again here. No detailed screenshots needed unless there’s a significant difference.

Step 1: Add the Traefik Helm Repo

ArgoCD needs to know where to pull the Traefik Helm chart from. For Traefik, we’ll use the official Traefik Helm repository:

Helm Repo URL:

https://helm.traefik.io/traefik

In the ArgoCD UI:

Navigate to Settings → Repositories
Click + CONNECT REPO
Enter the Traefik Helm repo URL listed above
Select Helm as the repository type
Provide a name (optional, something like traefik-charts)
Associate the repo with the appropriate ArgoCD project (mine was lab-home)
No authentication is required since this repo is publicly accessible
Click CONNECT to finish

Once connected, ArgoCD is ready to deploy the Traefik Helm chart into your cluster.

Step 2: Create the ArgoCD Application (Traefik)

Head to the Applications tab in ArgoCD, and click + NEW APP to start deploying Traefik.

Here's how you'll fill it out:

Application Name: traefik
Project: Select your ArgoCD project
Sync Policy: Manual (for now)
Repository URL: Select the Traefik Helm repo you just connected
Chart Name: traefik
Target Revision: Use latest, or specify a stable version once you've tested and confirmed compatibility
Cluster URL: Typically https://kubernetes.default.svc for an in-cluster deploy (if yours differs, just use the appropriate URL)
Namespace: Use kube-system (check the option to create it if it doesn’t exist yet)

Why kube-system namespace?
Deploying Traefik to the kube-system namespace makes sense because Traefik is essentially a core infrastructure service. Placing it here aligns with Kubernetes best practices, core infrastructure and networking-related services belong in this namespace, separating them clearly from user or application workloads.

When finished, click CREATE to finalize the setup.

Step 3: Add Custom Helm Values for Traefik

Unlike MetalLB, our Traefik deployment uses custom Helm values directly from our Git repository, without Kustomize. We'll define these custom values as a second source within our ArgoCD Application manifest.

Here's how you'll set this up in the ArgoCD UI:

Navigate to the Traefik Application you created earlier.
Click the App details (🖉 edit) icon in the top-right corner to open the manifest editor.
Scroll down to the manifest, and ensure you're using sources: (plural), since we're adding an additional source.
Modify your ArgoCD Application manifest to look similar to this:

yamlCopyEditproject: home-lab
destination:
  server: https://172.16.99.25:6443
  namespace: kube-system
syncPolicy:
  syncOptions:
    - CreateNamespace=true
sources:
  - repoURL: https://helm.traefik.io/traefik
    targetRevision: 35.0.1
    helm:
      valueFiles:
        - $values/apps/traefik/values-lab.yml
    chart: traefik
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    targetRevision: HEAD
    ref: values

Explanation:

The first source references the official Traefik Helm repository, specifying the chart version.
The second source references my GitHub repo (or your own), where your custom Helm values (values-lab.yml) are stored.
ArgoCD merges these values when syncing Traefik, allowing environment-specific customizations, such as ingress rules, TLS settings, dashboard exposure, middleware options, and other important configurations.

Once you've updated and saved this manifest, ArgoCD will apply the changes, and Traefik will deploy using your customized configuration, all neatly managed by GitOps.

Step 4: Sync the Traefik Application

Once everything looks good, click Sync in ArgoCD. It will pull the Traefik Helm chart, merge your custom Helm values (values-lab.yml), and deploy Traefik into your cluster.

You can click into the application details to watch Traefik’s resources spin up; Deployments, Services, IngressRoutes, and more. If the sync fails initially, don't worry, just retry it.

After a short period, you should see Traefik showing a status of “Healthy” and “Synced”. Verify that Traefik pods are running successfully in your cluster (similar to MetalLB earlier).

Congratulations! Traefik is now up and running as your ingress controller, ready to handle external HTTP(S) traffic into your cluster.

Traefik Custom Helm Values

Let’s take a look at the custom Helm values we’re using for Traefik, pulled from apps/traefik/values-lab.yml. These provide a simple but functional starting point for ingress, dashboard access, and authentication in a lab environment.

Key Configuration Highlights

IngressRoute for the Traefik Dashboard

ingressRoute:
  dashboard:
    enabled: true
    matchRule: Host(`YOUR-URL`)
    entryPoints: ["web", "websecure"]
    middlewares:
      - name: traefik-dashboard-auth

Enables the Traefik dashboard and exposes it via both HTTP and HTTPS.
Routes traffic based on hostname, i.e. (traefik-dashboard-lab.jjland.local).
Adds a middleware for basic authentication to protect access.

Basic Authentication Middleware

extraObjects:
  - kind: Secret
    type: kubernetes.io/basic-auth
    stringData:
      username: admin
      password: changeme
  - kind: Middleware
    spec:
      basicAuth:
        secret: traefik-dashboard-auth-secret

Creates a Kubernetes Secret with hardcoded credentials (admin / changeme).
Defines a Traefik Middleware that references the secret and applies HTTP basic auth to protected routes.

NOTE: These credentials are hardcoded and intended only for lab/demo use. You should absolutely replace "changeme" with a strong, securely managed password, or better yet, use a more robust authentication mechanism in production.

Static LoadBalancer IP Assignment

service:
  spec:
    loadBalancerIP:  IP SET ASIDE BY METALLB>

This assigns a specific external IP to Traefik’s LoadBalancer service, ensuring stable access through MetalLB.

Accessing the Dashboard

Once deployed and synced in ArgoCD, you can access the Traefik dashboard by visiting the URL set in the custom values file.

To make this work:

Add a DNS record (or local /etc/hosts entry) pointing to your Traefik service IP (in my case, 172.16.99.30).
Use the credentials you set in the values file (admin / changeme) to log in via the basic auth prompt.

Why It Matters

This configuration gives you:

A working Traefik dashboard protected by basic auth
A predictable IP address exposed by MetalLB
A GitOps-managed ingress setup, all stored in Git and synced automatically via ArgoCD

These are just starter settings. They work great in a lab, but you’ll want to harden and expand them for production use. Still, even at this basic level, you’re getting all the core benefits: visibility, consistency, and version-controlled configuration.

Let’s move on to the next part of the platform.

Rook + Ceph: Persistent Storage for Stateful Applications

So far, we’ve deployed the pieces that make your platform accessible, MetalLB for external IPs, and Traefik for routing traffic. But modern platforms don’t just serve traffic, they store data. If you’re planning to run apps like Nautobot, NetBox, or Postgres, you’ll need reliable, persistent storage to keep data alive across restarts and node failures.

That’s where Rook + Ceph comes in.

What is Rook + Ceph?

Ceph is a distributed storage system that provides block, object, and file storage, all highly available and scalable. It’s used in enterprise environments for cloud-native storage, and it’s rock solid.

Rook is the Kubernetes operator that makes deploying and managing Ceph clusters easier and more native to the Kubernetes ecosystem. Together, they turn a set of disks across your nodes into a resilient, self-healing storage platform.

Why You Need It

Kubernetes doesn’t come with a built-in storage backend. While it allows you to declare PersistentVolumeClaims, it’s up to you to provide the actual storage behind them. In cloud environments, that’s easy, just hook into EBS, Azure Disks, or whatever your platform provides. But in a lab or on-prem cluster? You’re on your own.

Rook + Ceph fills that gap. Once deployed, it becomes your cluster’s dynamic, self-healing storage layer. You can provision persistent volumes for any stateful workload—databases, internal tooling, monitoring stacks, and more, without having to manually manage local disks or worry about data loss.

How It Powers the Platform

Rook + Ceph is the backbone of persistent infrastructure in this setup. It enables you to:

Create PersistentVolumes dynamically, on demand, using StorageClass definitions
Run stateful apps like NetBox, Nautobot, PostgreSQL, and Prometheus with confidence
Survive pod restarts and node reboots, your data stays intact and available
Manage it all declaratively, deployed and version-controlled with ArgoCD, just like everything else

What This Looks Like When Deployed

Once your Rook + Ceph configuration is applied and the cluster becomes active, you’ll effectively have a resilient, distributed storage system spanning all your nodes. In this setup:

Ceph stores data redundantly across all three nodes, similar in concept to a 3-node RAID-1 (mirrored) configuration.
When one node goes offline or a disk fails, your data is still accessible and safe.
The Ceph monitor daemons ensure quorum and cluster health, while OSDs (Object Storage Daemons) replicate data across your available storage devices (e.g., /dev/vdb on each node).

This redundancy is built-in and automatically managed by the Ceph cluster itself, no manual RAID configuration needed. It’s a core reason why Ceph is trusted in both enterprise and lab-scale deployments.

What We’re Deploying: The Operator + StorageCluster

As with many Kubernetes-native tools, Rook uses the Operator pattern to manage Ceph. We’ll be deploying two key components:

The Rook-Ceph Operator – Acts as a controller that manages Ceph-specific resources and keeps everything in the desired state.
A CephCluster resource – Defines how the storage backend should be built using the disks available across your nodes.

What’s an Operator?
A Kubernetes Operator is a purpose-built controller that manages complex stateful applications by watching for custom resources (like CephCluster) and continuously reconciling their desired state—creating, healing, scaling, and configuring everything automatically.

By deploying both the operator and the cluster config together, we get a hands-off, fully declarative storage setup. Everything is defined in Git, synced by ArgoCD, and managed by the operator—including provisioning, recovery, and upgrades.

Step 1: Add the Rook-Ceph Helm Repo

ArgoCD needs to know where to pull the Rook-Ceph Helm chart from. For this, we’ll use the official Rook Helm repository:

Helm Repo URL:

https://charts.rook.io/release

In the ArgoCD UI:

Navigate to Settings → Repositories
Click + CONNECT REPO
Enter the Helm repo URL listed above
Select Helm as the repository type
Optionally give it a name (e.g., rook-ceph-charts)
Associate the repo with your ArgoCD project (mine was lab-home)
No authentication is required since it’s publicly accessible
Click CONNECT to finish

Once connected, ArgoCD will be able to deploy both the Rook-Ceph operator and storage cluster using this chart.

Step 2: Create the ArgoCD Application (Rook-Ceph)

Now that the repo is connected, head to the Applications tab in ArgoCD and click + NEW APP to start the deployment.

Here’s how to fill it out:

Application Name: rook-ceph
Project: Select your ArgoCD project (e.g., lab-home)
Sync Policy: Manual (for now)
Repository URL: Select the Rook Helm repo you just connected
Chart Name: rook-ceph
Target Revision: Use latest, or pin to a stable version you’ve tested
Cluster URL: Typically https://kubernetes.default.svc if deploying in-cluster
Namespace: rook-ceph (check the box to create it if it doesn’t exist)

Why the `rook-ceph` Namespace?

Rook and Ceph manage a lot of moving parts—monitors, OSDs, managers, etc.—and isolating those components into their own namespace (rook-ceph) helps keep your cluster clean and easier to troubleshoot. It also aligns with common community best practices and makes upgrades and deletions much safer.

Once you’ve filled everything out, click CREATE to finish provisioning the application.

Step 3: Add Custom Helm Values + Kustomize Overlay for Rook-Ceph

Rook-Ceph is one of the more complex components in our GitOps platform. It’s not just a single deployment, it involves multiple controllers, CRDs, and cluster-level storage logic. Because of that, we’ll be using both a Helm chart (with custom values) and a Kustomize overlay to deploy it cleanly and maintainably.

This dual-source approach lets us:

Use the Helm chart to install the Rook-Ceph operator and core components
Apply custom values to tailor behavior for our environment (resource tuning, monitor placement, dashboard settings, etc.)
Layer in Kustomize-based manifests for complex resources like CephCluster, StorageClass, CephFilesystem, resources that often require more precise control

ArgoCD Application Sources

When editing your ArgoCD Application manifest, your sources block will look similar to this:

sources:
  - repoURL: https://charts.rook.io/release
    targetRevision: v1.17.0
    helm:
      valueFiles:
        - $values/apps/rook-ceph/values-lab.yml
    chart: rook-ceph
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    path: apps/rook-ceph/overlays/lab
    targetRevision: HEAD
    ref: values

Why Both Sources?

The Helm chart deploys the operator and all required CRDs in the correct order.
The Kustomize overlay (from your Git repo) contains environment-specific resources like:
- CephCluster – the main storage cluster definition
- StorageClass – so other apps can request storage using PersistentVolumeClaims
- CephFileSystem – enables shared POSIX-compliant volumes for apps needing ReadWriteMany access
- Optional extras like CephBlockPool or a toolbox deployment for CLI-based Ceph management

You can find these manifests in the repo under:
apps/rook-ceph/overlays/lab/

Once saved, ArgoCD will treat both sources as part of the same application and sync them together, ensuring everything is deployed in the right order and stays in sync with Git.

Understanding the Rook-Ceph Overlay: Managing Complexity with GitOps

I wanted to cover this now before we try and sync. Setting up Rook-Ceph in a GitOps workflow involves more than just deploying a Helm chart. You’re orchestrating a sophisticated storage platform made up of tightly coupled components: an operator, CRDs, a distributed Ceph cluster, storage classes, ingress routes, and more. Each piece needs to be configured correctly and deployed in the proper order.

To keep all of this manageable and repeatable, we separate concerns using a combination of custom Helm values and a Kustomize overlay. The overlay found in apps/rook-ceph/overlays/lab brings together the critical resources required for a working Ceph deployment—block pools, shared filesystems, storage classes, and even a dashboard ingress.

The sections below break down each of these files so you can understand what’s happening, why it’s needed, and how it fits into the larger GitOps puzzle.

`apps/rook-ceph/values-lab.yml`

csi:
  enableRbdDriver: false

Purpose: Disables the RBD (block-device) CSI driver in this lab setup, since we’re only using CephFS here.
Why it matters: Keeps the cluster lean by not installing unused CSI components.

`apps/rook-ceph/overlays/lab/`

`ceph-cluster.yml`

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.1
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
  dashboard:
    enabled: true
  storage:
    useAllNodes: true
    useAllDevices: false
    deviceFilter: vdb

Defines the core CephCluster resource.
Key settings:
- Runs 3 monitors for quorum.
- Uses each node’s vdb device for OSDs (fits your lab VM disk layout).
- Enables the Ceph dashboard for visual health checks.

⚠️ NOTE: These settings are specific to my 3-node lab cluster, where each node has:

One OS disk (vda)
One dedicated Ceph data disk (vdb)

Example disk layout (lsblk output from one node):

bashCopyEdit[jeff@rocky9-lab-node1 ~]$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0          11:0    1  1.7G  0 rom  
vda         252:0    0   50G  0 disk 
├─vda1      252:1    0    1G  0 part /boot
└─vda2      252:2    0   49G  0 part 
  ├─rl-root 253:0    0   44G  0 lvm  /
  └─rl-swap 253:1    0    5G  0 lvm  
vdb         252:16   0  250G  0 disk

Your disk layout will likely be different. I’ve configured Ceph to use only the vdb disk via the deviceFilter setting to avoid accidentally wiping the OS disk.

⚠️ Be careful: If you don’t tailor these values to your hardware, you could unintentionally destroy existing data. Always verify your node’s disk setup and adjust your configuration accordingly.

`ceph-filesystem.yml`

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: k8s-ceph-fs
  namespace: rook-ceph
spec:
  metadataPool:
    failureDomain: host
    replicated:
      size: 3
  dataPools:
    - name: replicated
      failureDomain: host
      replicated:
        size: 3
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true

Creates a CephFilesystem (CephFS) for shared, POSIX-style volumes.
Why CephFS? Enables ReadWriteMany storage, which block pools alone can’t provide.

`ceph-storageclass-delete.yml` & `ceph-storageclass-retain.yml`

Both define Kubernetes StorageClass objects that front the CephFS CSI driver:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs-delete      # or rook-cephfs-retain
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: k8s-ceph-fs
  pool: k8s-ceph-fs-replicated
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
reclaimPolicy: Delete       # or Retain
allowVolumeExpansion: true

Difference:
- rook-cephfs-delete will delete PV data when PVCs are removed.
- rook-cephfs-retain will retain data for manual cleanup or backup.
Why two classes? Gives you flexibility for different workloads (ephemeral test vs. persistent data).

`ingress-route-gui.yml`

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: ceph-ingressroute-gui
  namespace: rook-ceph
spec:
  entryPoints:
    - web
    - websecure
  routes:
    - match: Host(`ceph-dashboard-lab.jjland.local`) # EXAMPLE
      kind: Rule
      services:
        - name: rook-ceph-mgr-dashboard
          port: 7000

Exposes the Ceph dashboard through Traefik on your chosen host.
Why: Lets you reach the Ceph UI (after DNS/hosts setup) without manually port-forwarding.

`kustomization.yml`

resources:
  - ceph-cluster.yml
  - ingress-route-gui.yml
  - ceph-filesystem.yml
  - ceph-storageclass-delete.yml
  - ceph-storageclass-retain.yml

Aggregates all the above files into a single overlay that ArgoCD can sync.
Why Kustomize? Keeps base Helm installs separate from environment-specific definitions, making updates cleaner and more maintainable.

Step 4: Sync the Rook-Ceph Application

Ready? Go ahead and click Sync in ArgoCD for the rook-ceph application.

This one’s going to take a little more time, and for good reason. There’s a lot happening under the hood.

When you sync, ArgoCD will:

Deploy the Rook-Ceph Operator, which is responsible for watching and managing Ceph resources in your cluster
Install CephFS CSI drivers, RBAC roles, and CRDs needed to support persistent volumes
Apply your CephCluster, CephFilesystem, and StorageClass definitions via the Kustomize overlay

But the real magic starts after the operator is running.

Once the operator is up, it will immediately start watching for additional Ceph custom resources in the rook-ceph namespace. When it discovers the CephCluster definition, it will:

Initialize the monitors (MONs) for quorum
Deploy the manager (MGR) for handling cluster state and dashboard
Start spinning up the OSDs (Object Storage Daemons) using the storage devices you specified (in this case, vdb on each node)

This process can take several minutes depending on your hardware, node performance, and the size of your disks.

How do you know it worked?
The cluster is healthy when you see:

3 running OSD pods, one for each disk across your 3 nodes

The rook-ceph application status in ArgoCD shows “Healthy” and “Synced”

Optionally: access the Ceph dashboard and verify health checks (covered earlier)

Troubleshooting Tips

Rook-Ceph is powerful, but complex. And with that complexity comes the potential for a lot of things to go sideways. I won’t dive into every failure mode here, but I’ll leave you with a few quick tips that can help when something’s not working as expected:

Use the ArgoCD UI to inspect pod logs.
Click into the rook-ceph application, navigate to the "PODS" tab, and use the logs view to get real-time output from key components like the operator, mons, OSDs, and mgr. Most issues will reveal themselves here.
Resync the operator app to restart it.
If the cluster gets stuck or fails to initialize certain pieces, manually syncing the operator application in ArgoCD will redeploy the pod. This is often enough to force a retry or pull in updated CRDs.
Disk issues?
If Ceph is skipping disks or refusing to reuse them, it’s usually leftover metadata. Try running a full zap with ceph-volume or fallback to wipefs, sgdisk, and dd to fully clean the disk.

Congratulations! Once everything is green, you now have a fully functional Ceph storage backend—redundant, self-healing, and fully managed through GitOps.

Secrets Management: External Secrets + HashiCorp Vault

In any production platform, secrets management isn’t optional, it’s foundational. We're talking about things like API tokens, database passwords, SSH keys, and TLS certs. Storing these directly in your Git repo? Not an option. Hardcoding them into manifests? Definitely not.

That’s where External Secrets and HashiCorp Vault come in, and together, they solve this problem the right way.

What is HashiCorp Vault?

Vault is a centralized secrets manager that securely stores, encrypts, and dynamically serves secrets to applications and users. It supports access control, auditing, and integration with identity systems and cloud providers. In this stack, Vault acts as the secure system of record for all sensitive data.

What is External Secrets?

External Secrets is a Kubernetes operator that bridges external secret stores (like Vault) with native Kubernetes Secret objects. It watches for custom resources like ExternalSecret and automatically pulls values from Vault into the cluster, keeping them updated and consistent without manual intervention.

Why Network Automation Needs This

Network automation platforms—like NetBox, Nautobot, and custom Python tooling—frequently need access to sensitive data:

Device credentials for SSH or API-based provisioning
Authentication tokens for systems like GitHub, Slack, or ServiceNow
Vaulted credentials for orchestrating changes via Ansible or Nornir

You don’t want these values floating around in plaintext in Git. But you still want to declare your intent (what secrets are needed and where) in version control. This is especially critical when you're deploying infrastructure with GitOps and need environments to be reproducible and secure.

With Vault + External Secrets, you can:

Keep the actual secret values outside of Git
Still declare your ExternalSecret manifests in Git as part of your ArgoCD-managed platform
Let Kubernetes handle syncing and refreshing secrets automatically

This pattern ensures your network automation stack is secure, scalable, and compliant, without losing any GitOps benefits.

Installing External Secrets Operator

Setting up External Secrets is straightforward and follows the same pattern we’ve used throughout this platform. In this section, we’ll deploy the External Secrets Operator using its official Helm chart with default values, no custom overlays, or secret stores just yet.

Step 1: Add the Helm Repo

First, add the External Secrets Helm repository to ArgoCD:

In the ArgoCD UI, go to Settings → Repositories
Click + CONNECT REPO
Fill in the following:
- Type: Helm
- URL: https://charts.external-secrets.io
- Name (optional): external-secrets
- Project: Choose your ArgoCD project (e.g., lab-home)
- Authentication: Leave empty (this is a public repo)
Click CONNECT to save

Step 2: Create the ArgoCD Application

Navigate to Applications → + NEW APP, and fill out the form like this:

Application Name: external-secrets
Project: lab-home (or your equivalent)
Sync Policy: Manual
Repository URL: Select the Helm repo you just added
Chart: external-secrets
Target Revision: latest (or a specific version like 0.16.1)
Cluster URL: https://kubernetes.default.svc
Namespace: external-secrets
(Check the box to create the namespace if it doesn’t exist)

Click CREATE to finish.

Step 3: Sync the Application

Once the app is created, hit SYNC in the ArgoCD UI. This will:

Deploy the External Secrets Operator into your cluster
Create the necessary CRDs and controller components
Make the ExternalSecret, SecretStore, and ClusterSecretStore resource types available

You should see the app enter a Synced and Healthy state once everything is up and running. No custom values or overlays are needed at this stage.

Installing HashiCorp Vault

Vault is our centralized secrets store, and in this setup we’re deploying it with two main goals in mind:

Enable its built-in GUI for easy inspection and management
Ensure secret data is persisted using our Rook-Ceph-backed StorageClass

To accomplish this, we’ll combine a Helm-based deployment with a Kustomize overlay that adds a Traefik IngressRoute for secure browser access.

Step 1: Add the Helm Repo

Add the official HashiCorp Helm chart repo to ArgoCD:

In the ArgoCD UI, go to Settings → Repositories
Click + CONNECT REPO
Fill in:
- Type: Helm
- URL: https://helm.releases.hashicorp.com
- Project: lab-home (or whatever you're using)
- Authentication: Leave blank (public repo)
Click CONNECT to save

Step 2: Prepare Your Vault Application

Vault is more stateful and config-heavy than most apps, so we’re using two sources in our ArgoCD Application:

A Helm chart to install Vault and enable persistent storage
A Kustomize overlay that exposes the Vault UI through Traefik

Here’s an example Application manifest (adjust values as needed for your setup):

project: lab-home
destination:
  server: https://kubernetes.default.svc
  namespace: vault
syncPolicy:
  syncOptions:
    - CreateNamespace=true
sources:
  - repoURL: https://helm.releases.hashicorp.com
    chart: vault
    targetRevision: 0.30.0 # or latest stable
    helm:
      valueFiles:
        - $values/apps/hashicorp-vault/values-lab.yml
  - repoURL: https://github.com/leothelyon17/kubernetes-gitops-playground.git
    targetRevision: HEAD
    path: apps/hashicorp-vault/overlays/lab
    ref: values

Note: The Git repo and folder structure here are based on my kubernetes-gitops-playground. If you’re using your own repo, be sure to adjust the repoURL, path, and valueFiles references accordingly.

Step 3: Custom Helm Values

In your Git repo, the file at apps/vault/values-lab.yml should enable:

The Vault UI (ui: true)
Persistent storage via the Rook-Ceph-backed StorageClass you created earlier

Example configuration:

server:

  dataStorage:
    enabled: true
    # Size of the PVC created
    size: 1Gi
    # Location where the PVC will be mounted.
    mountPath: "/vault/data"
    # Name of the storage class to use.  If null it will use the
    # configured default Storage Class.
    storageClass: rook-cephfs-retain
    # Access Mode of the storage device being used for the PVC
    accessMode: ReadWriteOnce

# Vault UI
ui:
  enabled: true

Step 4: Expose Vault Securely with Traefik

In your apps/vault/overlays/lab directory, define a Kustomize file to expose the UI via Traefik.

Example: kustomization.yml

resources:
  - ingress-route-gui.yml

And in ingress-route-gui.yml:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: vault-dashboard
  namespace: vault
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`vault-lab.jjland.local`) # EXAMPLE
      kind: Rule
      services:
        - name: vault
          port: 8200

Note: vault-lab.jjland.local is an example hostname used in my lab.
If you're following along exactly, feel free to use it, just be sure to add a local DNS or /etc/hosts entry that maps this to your cluster’s ingress IP.
Otherwise, replace this hostname with one appropriate for your environment.

Step 5: Sync the Application

Once your Helm values and Kustomize overlay are in place and committed to Git, go ahead and sync the Vault application from ArgoCD.

ArgoCD will deploy all Vault components into the vault-lab namespace, including:

The StatefulSet for the Vault server
The service account, RBAC roles, and services
A PersistentVolumeClaim (PVC) for storing Vault data
Your custom IngressRoute for exposing the GUI

After syncing, head to the Vault app in ArgoCD to verify the following:

The app status should be Synced
The PVC should be Bound and Healthy
The main Vault pod will likely remain in a Progressing state, this is expected

That “Progressing” status is normal because Vault isn’t fully initialized yet. It won’t report itself as ready until it has been manually initialized and unsealed for the first time.

Before moving forward, it’s a good idea to:

Inspect the pod logs in the ArgoCD UI if anything seems stuck
Check kubectl get pvc -n vault-lab to confirm the PVC is attached and healthy
Use kubectl describe pod or describe pvc to troubleshoot issues

If all looks good, navigate to the Vault UI in your browser:

https://vault-lab.jjland.local # EXAMPLE

If you’re using a different hostname, be sure you’ve created the appropriate DNS or /etc/hosts entry.

From the web UI, you can initialize Vault, generate unseal keys, and perform the first unseal operation, all interactively.

Initializing Vault Through the GUI

Once the Vault UI is accessible, it’s time to initialize the system. Vault doesn’t become “ready” until this step is completed, and it only needs to be done once per cluster.

Step 1: Open the Vault UI

Navigate to the Vault dashboard in your browser:

https://vault-lab.jjland.local

(Or your custom hostname if you’re using a different setup.)

You’ll be presented with a message that Vault has not yet been initialized. Click the “Initialize” button to begin the process.

Step 2: Generate Unseal Keys

The GUI will prompt you to configure key shares and key threshold. Leave these at the defaults unless you have a specific security model in mind:

Key Shares: 5
Key Threshold: 3

This means Vault will generate 5 unseal keys, and any 3 of them will be required to unseal the Vault.

Click "Initialize" to proceed. Vault will generate a JSON file containing:

The root token (used to log in as admin)
All 5 unseal keys

Download this file immediately and store it in a secure location. These keys cannot be recovered later.

⚠️ Do not skip this download. If you lose these keys before unsealing, you’ll have to wipe and redeploy Vault from scratch.

Step 3: Unseal the Vault

After downloading the key file, Vault will prompt you to enter the unseal keys one by one.

Copy a single unseal key from the JSON file
Paste it into the field and click “Unseal”
Repeat with two more keys (for a total of 3)

Once the required threshold is met, Vault will unlock and become active.

Step 4: Log In with the Root Token

After unsealing, return to the login screen and paste in the root token from your downloaded JSON file.

Once logged in, you’ll have full admin access to Vault.

Step 5: Verify in ArgoCD

Flip back to the ArgoCD UI and check the status of the Vault application. At this point, the main pod should switch from Progressing to Healthy, and your application should show as fully operational.

You're now ready to configure Vault as a backend for External Secrets, so your GitOps-managed workloads can securely retrieve credentials, tokens, and other sensitive data on demand.

This completes Part 2 of this series.

Summary & What’s Next

In Part 2, we took our GitOps foundation and turned it into a functional, production-capable platform. We integrated critical infrastructure components like MetalLB for external access, Traefik for routing, Rook-Ceph for persistent storage, and a full-fledged secrets management stack using External Secrets and HashiCorp Vault, all deployed declaratively using ArgoCD.

At this point, you have a GitOps-powered Kubernetes environment that’s capable of:

Exposing services securely with external IPs and ingress rules
Persisting data across workloads using Ceph-backed volumes
Managing secrets securely without embedding them in Git
Deploying and managing infrastructure the same way you'll deploy apps: as code

This platform is now ready to host real-world applications, whether it’s NetBox, Nautobot, or custom tooling built for your network automation workflows.

In Part 3, we’ll finally do just that: deploy a real application on top of everything we’ve built. I haven’t finalized which app we’ll use yet, but it’ll be something practical and network-engineer focused. Stay tuned and thank you for reading!

Bridging the Gap: GitOps for Network Engineers - Part 1

Jeffrey Lyon — Wed, 23 Apr 2025 15:32:57 GMT

Intro

Over the past 6–9 months, my career and perspective on technology have shifted dramatically. I’ve found myself drifting away from my views of traditional networking and increasingly seeing everything through the lens of applications, and treating it accordingly. To meet the demands of my current role, a colleague introduced me to the concept of GitOps and suggested we integrate it into our network automation workflows. At the time, I had no idea what GitOps even was. But a few months later, I’m wondering why I didn’t adopt this approach much earlier in my career. Within that short span, I had built a complete platform capable of hosting all of our network automation tools—NetBox, Nautobot, custom Python scripts, databases, monitoring stacks, and even Clabernetes (containerlab) for running virtual topologies. All self-contained, all deployed declaratively, and all benefiting from the GitOps principles I’ll be breaking down throughout this article.

So… what is GitOps?

At its core, GitOps is a way of managing infrastructure and applications using Git as the single source of truth. Think of it like this: instead of logging into systems and manually making changes (we’ve all been there), you define your desired state in code—YAML, JSON, whatever floats your repo, and store that in Git. From there, automation tools take over, constantly reconciling what's deployed with what lives in the repo. If something drifts or breaks, the system can alert you, fix it, or at least give you a clean way to roll back.

In traditional terms, it’s like having a version-controlled config file for every part of your infrastructure, and having robots to deploy it all for you, exactly how you wrote it.

Why Should Network Engineers/Orgs Care?

Historically, network automation has been about scripts, Python, maybe some Ansible sprinkled on top. But the problem with that approach is scale, visibility, and consistency. You might have 10 engineers all running different scripts in slightly different ways. Who knows what changed and when?

GitOps brings the same rigor DevOps teams apply to applications into the world of network automation. Imagine managing Nautobot or NetBox deployments through Git. Want to roll out a plugin, change a config, or update a container? You create a pull request, get it reviewed, and once it’s merged, it’s live in production (via ArgoCD, Flux, or whatever your GitOps controller is).

Even beyond the apps themselves, this mindset works for deploying the tools that generate your configs, run validations, or even trigger device changes. You're turning networking workflows into a pipeline. And once that happens, you get auditability, consistency, and less of that "it works on my machine" nonsense.

This is Part 1 of a series aimed at helping network engineers get hands-on with GitOps and understand the core components involved in building a modern automation platform. In this first part, we’ll focus on the foundational concepts of GitOps, the tools that power it, and walk through installing ArgoCD as the GitOps engine for our platform. Even if you're not deploying anything just yet, the goal here is to bridge the knowledge gap, so network engineers can better understand the deployment process and begin delivering their own code and tools in a structured, scalable way. At the very least, this knowledge helps you communicate more effectively with DevOps and Platform Engineering teams, making it easier to explain what you need when it comes to production-ready deployments.

In Part 2, we’ll pick up by deploying core infrastructure components—like MetalLB, Traefik, persistent storage, and secrets management—using the GitOps workflow established here.

For those interested in exploring the configurations and examples discussed in this article, all the code and resources are available in my GitHub repository: kubernetes-gitops-playground.

This repository serves as a comprehensive reference for setting up a GitOps-driven Kubernetes environment. It includes structured directories for applications like Nautobot, configurations for ArgoCD, and various Kubernetes add-ons. The repository is designed to be a practical guide for network engineers aiming to implement GitOps methodologies in their infrastructure.

Feel free to explore the repository to gain insights into the practical implementation of the concepts discussed here.

The GitOps Ecosystem: A Network Automation Perspective

Here’s a high-level breakdown of the components I use to power my GitOps-driven automation platform. This list reflects a practical, production-minded approach to deploying and managing applications, especially for network engineers looking to build, scale, or just better understand modern automation workflows.

Each component below plays a specific role in the platform, helping ensure security, flexibility, repeatability, and operational clarity.

Kubernetes Cluster (Obviously)

The foundation of everything. Kubernetes orchestrates and runs your containerized applications, managing scaling, availability, and resource utilization.

Git Provider (GitHub)

The single source of truth. All manifests, Helm values, and Kustomize overlays live here. Every change is tracked, reviewed, and version-controlled.

ArgoCD

This is the GitOps engine of the platform. It continuously syncs application state from Git repositories into the cluster, ensuring what’s deployed always matches what’s defined in code.

Cluster Load Balancing (MetalLB)

MetalLB enables load-balanced services in bare-metal or home lab environments by assigning external IPs to services that require them.

Traefik (IngressRoute)

Traefik is a powerful and flexible ingress controller that routes external traffic into your Kubernetes cluster using custom IngressRoute CRDs. It gives you fine-grained control over how services are exposed, supports TLS, and integrates smoothly with GitOps workflows.

Note: You can use NodePorts if you’re not ready for an ingress controller and want a simpler setup, but that approach isn’t ideal for production use and lacks the flexibility and security that Traefik provides.

Persistent Storage (Rook + Ceph)

Apps like network automation platforms often require persistent volumes. Rook with Ceph provides resilient, scalable storage within the cluster, critical for stateful services.

Secrets Vault (i.e., HashiCorp Vault)

A secure place to store sensitive information like API tokens, database credentials, and TLS certificates, outside the cluster and outside of Git.

Secrets Operator (i.e., External Secrets)

This bridges the gap between Vault and Kubernetes. It watches your external secret store and injects the data into Kubernetes Secrets based on declarative manifests.

Kubernetes Secrets

The native format for storing and referencing secrets inside Kubernetes workloads. These are the final form of secrets that your apps consume at runtime.

Helm & Custom Values

Helm acts as the package manager for Kubernetes, simplifying the deployment of complex, production-ready applications through reusable charts. By supplying custom values, you can easily override default configurations, tuning things like ports, storage, resource limits, and app-specific settings to fit your environment without modifying the underlying chart.

Kustomize

Kustomize lets you customize Kubernetes manifests without copying or editing the original files. It uses overlays to manage environment-specific changes, like different configs for dev, test, or prod. This helps keep your Git repo organized and clean.

You can also use Kustomize alongside Helm by referencing rendered Helm charts as a base, then layering custom configs on top, giving you the best of both tools.

Requirements & Housekeeping

Before we dive into the individual components of the platform, there are a few things that need to be in place:

Kubernetes Cluster: I won’t be covering how to stand up a Kubernetes cluster in this post. If you need help with that, check out this earlier article I wrote that walks through the setup. This also isn’t a Kubernetes 101 guide, you’ll need a solid understanding of how Kubernetes works, especially when it comes to common resource types like Deployments, Services, Secrets, ConfigMaps, and PersistentVolumeClaims. Kubectl and helm should also be installed and usable for the cluster
Git & GitHub (or another Git provider): This isn’t a Git 101 tutorial. You’ll need some working knowledge of Git and GitHub, and you should already have an account set up. If you’re using another provider (like GitLab or Bitbucket), that’ll work too.
Persistent Storage: While persistent storage is part of the overall stack, this post won’t go deep into the setup. I’ll touch on what’s needed to support the apps, but I’m saving the storage deep dive for a separate article.
Linux & Bash: You should be comfortable using Linux and working in a bash shell. There will be commands, file edits, and troubleshooting that assume you’re not new to the terminal.
IDE (like VSCode): You’ll need a code editor to work with YAML, Helm values, and general GitOps structure. VSCode is a solid choice, it has excellent Git integration and Kubernetes plugins that can speed up your workflow.

My Setup

My cluster consists of a 3 node Rocky Linux 9 cluster. Same as what’s used in my other blog posts. Most other major distributions should work relatively the same but if you’re following closely Rocky and Redhat are the better OS options.

If you’re good on those fronts, let’s keep going.

ArgoCD: Your GitOps Automation Engine

Now that your Kubernetes cluster is built and your GitHub account is ready, it's time to dive into the heart of GitOps: ArgoCD.

What is ArgoCD?

ArgoCD (short for Argo Continuous Delivery) is a GitOps controller for Kubernetes. It continuously monitors Git repositories and ensures the live state of your cluster matches the declared state in Git. If something drifts, like someone manually edits a resource, ArgoCD can detect that and reconcile it back to the desired state stored in Git. It’s declarative, automated, and very production-friendly.

In simple terms: Git is the source of truth, and ArgoCD makes sure your cluster does what Git says.

Where ArgoCD Fits in the GitOps Model

GitOps workflows revolve around a few key principles:

Version control as truth: All manifests live in Git.
Pull-based automation: Kubernetes doesn’t wait for you to push changes, it pulls from Git.
Observability and rollback: You can track exactly what changed, when, and by who. Rolling back is as easy as reverting a commit.

ArgoCD is the engine that powers this model. It watches your repo, compares it to what’s actually running in your cluster, and syncs everything up, either automatically or on demand. It also gives you a nice web UI, CLI, and API for managing applications and monitoring sync status.

On a personal note—I freaking love ArgoCD! When I was first dipping my toes into GitOps and only had a surface-level understanding of Kubernetes, ArgoCD was an absolute game changer. Being able to visually see every single Kubernetes object that makes up an app, and how they relate to each other, leveled up my Kubernetes knowledge fast. The fact that you can pause, sync, delete, or rebuild individual resources with basically the flip of a switch? Insanely useful. And not having to constantly hammer out kubectl commands just to check logs or dig into the YAML? Crazy time saver! Seriously, it’s one of the most valuable tools in this whole setup and tech today.

Installing ArgoCD on Rocky Linux 9

Let’s walk step-by-step through a basic installation of ArgoCD and its CLI. These steps assume you already have:

kubectl configured and pointing to your Kubernetes cluster
helm installed (needed for app creation later)
Root or sudo access on your Rocky Linux 9 system

Step 1: Install ArgoCD into the Cluster

We'll install ArgoCD in its own namespace using the official manifests:

kubectl create namespace argocd

kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

This will install all the ArgoCD components: API server, controller, repo server, and UI server.

To confirm the installation was successful, run the below command.

kubectl get pods -n argocd

You should see all the argocd pods in a ‘Running’ state after 30 seconds or so -

NAME                                               READY   STATUS    RESTARTS     AGE
argocd-application-controller-0                    1/1     Running   0            1m
argocd-applicationset-controller-dc47f7989-77ztg   1/1     Running   0            1m
argocd-dex-server-bc9bc7d65-68rxn                  1/1     Running   0            1m
argocd-notifications-controller-5698dbd744-7vmzc   1/1     Running   0            1m
argocd-redis-656948fbd6-zfgjd                      1/1     Running   0            1m
argocd-repo-server-74c4cb6cc5-pnxfv                1/1     Running   0            1m
argocd-server-856f78f5df-cxh9h                     1/1     Running   0            1m

Step 2: Expose the ArgoCD UI

By default, ArgoCD’s API server is only accessible inside the cluster. For testing or lab use, you can expose it using a NodePort or via your ingress controller (like Traefik):

Option A: NodePort (quick and dirty)

kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "NodePort"}}'

Find the port:

kubectl get svc argocd-server -n argocd

NodePorts are usually assigned within the 30000–32767 range. Look for the PORT(S) column in the output, something like 8080:32678/TCP means ArgoCD is accessible on port 32678 of any Kubernetes control-plane node.

Then access the UI at:
http://:

Option B: IngressRoute (we’ll add this later once Traefik is installed)

If you're planning to use Traefik as your ingress controller, you'll eventually want to expose ArgoCD using an IngressRoute. This is the more GitOps-friendly approach because your ingress config, just like everything else, can live in Git and be managed declaratively.

That said, you probably don’t have an ingress controller installed yet, so this option won’t work just yet. No problem, start with the NodePort method for now, and once Traefik is in place, switching over to an IngressRoute is quick and clean. It fits perfectly into the GitOps model and keeps your exposure configs version-controlled along with the rest of your stack

Step 3: Get the Initial Admin Password

The default username is admin. To get the initial password:

kubectl get secret argocd-initial-admin-secret -n argocd \
  -o jsonpath="{.data.password}" | base64 -d && echo

Step 4: Install the ArgoCD CLI

Install the CLI to interact with ArgoCD from your terminal.

VERSION=$(curl -s https://api.github.com/repos/argoproj/argo-cd/releases/latest \
  | grep tag_name | cut -d '"' -f 4)

curl -sSL -o argocd "https://github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-amd64"

chmod +x argocd
sudo mv argocd /usr/local/bin/

Confirm it’s installed:

argocd version

Step 5: Log In Using the CLI

argocd login  --username admin --password

Use the hostname or IP that maps to your argocd-server service.

Alternative Installation Method (Github Actions Runner)

If you’ve followed along this far, you’re probably realizing we could automate a good chunk of this platform bootstrapping. And yes, we absolutely can.

I've created a GitHub Actions workflow that installs ArgoCD (and its CLI), exposes it, configures custom admin users, and even adds the Kubernetes cluster back into ArgoCD, all automatically. This method is particularly useful if you're managing multiple clusters or frequently rebuilding your platform. Feel free to use this for assistance in setting up your own runner. Here’s how it works.

Requirements

To use this workflow, you’ll need:

A self-hosted GitHub Actions runner that has access to your Kubernetes cluster
kubectl and python3.12 installed on the runner
A valid Kubeconfig file for the cluster you're targeting
GitHub repository secrets and variables configured properly:
- ARGOCD_ADMIN_USER, ARGOCD_ADMIN_PASSWORD – default admin login
- ARGOCD_MY_ADMIN_USER, ARGOCD_MY_ADMIN_PASSWORD – a secondary, more permanent admin account
- PAT_TOKEN – GitHub personal access token for storing encrypted secrets per environment
- GitHub Actions environment variables like ARGOCD_PORT and ARGOCD_SERVER (the IP or DNS hostname of a Kubernetes control-plane node)

Supporting Workflows Worth Noting

If you're wondering how this all connects behind the scenes, the repo also includes a few helper workflows that make this setup much smoother.

Kubeconfig Setup & Storage: There's a workflow that helps you extract your kubeconfig file and securely store it in GitHub as a repository variable or secret. This is crucial for giving your self-hosted runner authenticated access to your cluster during automated jobs.
kubectl Installation & Verification: Another workflow ensures kubectl is installed and properly configured on your self-hosted runner. It also includes a quick test to confirm the runner can talk to the cluster, basically your first "sanity check" before deploying anything.

These smaller workflows aren’t flashy, but they’re essential in keeping everything reliable, reproducible, and GitOps-friendly.

Workflow Breakdown

Here's what the job actually does:

Checkout the Repo
Grabs your current Git repository so that scripts and manifests can be used during the workflow.
Set the Environment
Dynamically sets the target environment (e.g., lab or prod) based on your manual trigger input. This is used for cluster context switching and naming.
Configure kubectl
Updates the active Kubernetes context based on the selected environment so the workflow knows which cluster to operate on.
Install Dependencies
Sets up a Python virtual environment and installs pynacl, which is used later for encrypting the ArgoCD password.
Install ArgoCD
Creates the argocd namespace (if it doesn't exist) and applies the official ArgoCD manifests to install the full stack into your cluster.
Install the ArgoCD CLI
Downloads and installs the latest CLI version for use in later steps like login, user config, and cluster registration.
Wait for ArgoCD to Come Online
Uses kubectl wait to ensure the argocd-server deployment is available before proceeding.
Expose ArgoCD via NodePort
Temporarily exposes the ArgoCD UI using a NodePort service on the configured port. This makes it accessible during early setup (before Ingress is configured).
Extract the Initial Admin Password
Pulls the default ArgoCD admin password from the Kubernetes secret and stores it as a masked GitHub environment variable.
Encrypt and Store the Admin Password in GitHub Secrets
Uses GitHub’s public key API and a Python script to encrypt the ArgoCD admin password and securely store it in the environment-specific GitHub Secrets.
Log into ArgoCD with Default Admin
Authenticates with ArgoCD using the default credentials and ensures the CLI is working.
Create a Custom Admin User
Edits the argocd-cm ConfigMap to define a new admin-level account.
Assign RBAC Permissions to the New User
Updates the argocd-rbac-cm ConfigMap to give your new user full admin access.
Set a Password for the New User
Uses the CLI to set the new admin user’s password securely.
Verify the New Admin Login
Logs in with the new user credentials to confirm everything’s configured properly.
Register the Cluster with ArgoCD
Ensures the current Kubernetes cluster is registered with ArgoCD, allowing future applications to target it via the ArgoCD UI or CLI.

Why This Rocks

Instead of manually copying YAML and running a dozen kubectl commands, this workflow automates the whole thing, and tracks it all in Git. It’s GitOps deploying GitOps, and yes, I’m into that level of inception.

You can trigger it manually for different environments (e.g., lab vs prod), and the entire setup becomes repeatable, shareable, and documented as code.

ArgoCD is now up and running (hopefully). You should be able to access the login page using a URL from one of your Kubernetes control-plane nodes IPs (or hostname) and NodePort port -

Go ahead and login using the admin credentials or credentials you created from using an Actions Workflow. You should see a blank applications list like so -

Initial Setup

Configuring Your Cluster within ArgoCD

Before deploying anything, ArgoCD needs to know which Kubernetes cluster(s) it can target. If you installed ArgoCD into the same cluster you're working in, there's good news, ArgoCD automatically configures access to that cluster. It will show up as in-cluster and is ready to go out of the box.

But if you're managing a remote cluster, or skipped using the automated GitHub Actions workflow I showed earlier, you’ll need to manually register the cluster using the ArgoCD CLI. This is required because you cannot add a new cluster through the ArgoCD UI.

Before you can register a cluster, you need to authenticate using the CLI:

argocd login : --username admin --password  --insecure

Replace the values with your ArgoCD server address and credentials. The --insecure flag is common during lab/testing since you might not have valid TLS configured yet.

Step 2: Register the Cluster

Once logged in, you can add the Kubernetes cluster currently pointed to by kubectl:

argocd cluster add

You can find your context name with:

kubectl config current-context

This command sets up a service account and RBAC within the target cluster, and registers it inside ArgoCD. Once complete, the cluster will appear in the UI and can be used for application deployments.

Adding and Configuring a New Project (via GUI)

Projects in ArgoCD are used to organize applications, enforce boundaries, and apply access rules. They’re especially useful when you want to group related apps, like having one project for core platform components and another for automation tools.

Step-by-Step: Create a New Project in the UI

Login to the ArgoCD UI

Use the NodePort or Ingress you’ve set up earlier to access the web UI. Login with your admin or custom user credentials.
Go to “Settings” → “Projects”

In the sidebar, click Settings, then select Projects. Click + NEW PROJECT to create a new one.
Name Your Project

Give your project a meaningful name, like platform-core or network-tools. Mine is shown below -
Define Destinations

These are the clusters and namespaces that apps in this project are allowed to deploy to. If you're using the default in-cluster setup, your server URL will be https://kubernetes.default.svc.
- Server: https://kubernetes.default.svc
- Namespace: e.g., default, argocd, or tools For a basic setup just set it to * (for all namespaces)

Configure Role-Based Access and Restrictions

When you're setting up a new project in ArgoCD, you'll see options to define what types of Kubernetes resources the project is allowed to manage. This is where you can lock things down pretty tightly, but for basic setup and initial testing, it’s easiest to just allow everything and refine later once things are working.

Here’s what that looks like:
- Cluster Resource Allow List
  - Kind: *
  - Group: *
- Cluster Resource Deny List
  - Leave this empty
- Namespace Resource Allow List
  - Kind: *
  - Group: *
- Namespace Resource Deny List
  - Leave this empty

Resource Monitoring
- Move the slider to ‘Enabled’
Click “Create”

Your project is now set up and ready to have apps assigned to it.

From here, you’ll be able to define Git-based applications, point them at your manifests or Helm charts, and let ArgoCD handle the rest.

The first applications we’ll use it to deploy are the core pieces of our GitOps infrastructure itself, tools like MetalLB for load balancing, Traefik for ingress, and persistent storage components. In other words, we’ll be using GitOps to finish building out the platform that enables GitOps. Poetic, right?

Before we end this Part 1, let's talk about how this all ties back to Git...

Understanding the Repo Structure (and Why Everything Belongs in Git)

One of the core principles of GitOps is keeping everything—infrastructure, applications, configurations, and deployment logic—in version control. The folder layout in my example repo is designed with that in mind. It reflects GitOps best practices: everything is declarative, versioned, and easy to manage or scale over time.

Having a clear and intentional structure not only makes your deployments cleaner, it also simplifies troubleshooting, auditing, onboarding new team members, and extending the platform as your needs grow.

Here’s a quick breakdown of the folders that matter most for this series:

apps/
This is where you’ll find custom Helm values files and Kustomize overlays for each application managed by ArgoCD. Each subdirectory corresponds to a specific app—like MetalLB, Traefik, or ArgoCD itself—and contains the configuration needed to tailor the deployment to your environment. This keeps your app logic cleanly separated and easy to maintain.
argocd-app-manifests/
Contains the ArgoCD Application and AppProject manifests. These define what ArgoCD deploys, where, and from which repo. Managing these separately from app-specific config keeps the logic declarative and helps you track application lifecycle separately from platform logic.
helm-charts/
This folder stores any custom or forked Helm charts that don’t live in an external Helm repo. It gives you a clean place to manage pinned chart versions or make local edits without cluttering the main app or manifest directories.

This layout isn’t just for organization, it’s what enables a GitOps workflow to scale. As your platform grows, this structure makes it easy to maintain a consistent, observable, and testable deployment pipeline across your infrastructure.

Summary & What’s Next

In Part 1, we laid the groundwork for a GitOps-driven automation platform. We covered the key components that make up the stack, walked through what GitOps actually is (without the fluff), and deployed ArgoCD, the engine that brings it all to life.

By now, you should have:

A working Kubernetes cluster
ArgoCD fully installed and accessible via NodePort or Ingress
Logged into the ArgoCD UI
Created your first ArgoCD project and verified it’s configured with the settings described earlier (associated cluster aka ‘Destination’, RBAC/Allowed Lists, enabled Resource Monitoring)

If you’ve made it this far, that’s a huge step forward, especially if you’re coming from a traditional networking background. You’ve already started to shift from manually pushing scripts to building a scalable, Git-driven platform.

But we’re just getting started.

In Part 2, we’ll begin deploying actual infrastructure apps using the GitOps workflow you’ve set up here. We’ll cover MetalLB (for load balancing), Traefik (for ingress), persistent storage with Rook/Ceph, and secrets management with External Secrets and HashiCorp Vault. These hands-on deployments depend on the foundation you just built, so make sure everything is in place before continuing.

Let’s keep building.

Kubernetes and Containerlab: Part 1 – Building a Cluster

Jeffrey Lyon — Sat, 05 Oct 2024 19:26:28 GMT

Intro

Hello again to all my longtime readers 😉, and welcome to my new series where we’ll dive into the world of Kubernetes and Containerlab (creators named this Clabernetes) to help you build containerized virtual network environments from the ground up. As a former network engineer, my goal with this series is to help those with little to no Kubernetes experience get containerized labs up and running quickly, using minimal resources. Advanced platform engineers seeking a streamlined approach to deploying Kubernetes may also find these posts useful.

We’ll kick off this series by laying the foundation: building a small Kubernetes cluster. We'll use tools like Kubespray (Ansible) along with some of my custom playbooks to streamline the process. By the end of this post, we’ll transform three freshly deployed servers into a fully operational Kubernetes cluster.

Next, I’ll be configuring our existing cluster to deploy a shared MicroCeph storage pool, exploring its use cases, and transforming our cluster into a hyper-converged infrastructure.

From there, we'll explore Clabernetes, a Kubernetes based solution created by the makers of Containerlab. Clabernetes deploys Containerlab topologies into a Kubernetes cluster allowing them to scale beyond a single node. It’s designed to allow for some pretty robust networking labs, and I’ll show you how to create some example topologies. Each post in this series will build on the previous, providing practical, hands-on examples to guide you through every step.

By the end of this series, you’ll have a fully functional Kubernetes setup with advanced storage, networking, and simulation capabilities—perfect for learning, experimenting, or even scaling into production.

Let’s get started!

Scenario

Let’s imagine you’re a network engineer (maybe you are!) exploring alternatives to traditional virtual lab solutions. While tools like EVE-NG, GNS3, and Cisco’s CML are popular, they often struggle to scale efficiently. You want to build larger, more complex topologies to enhance both your day-to-day work and your networking knowledge. You've heard a lot about Containerlab recently and are eager to experiment, but you're unsure where to start and your resources are limited. Although you have extensive experience with physical and virtual network setups, container-based environments—especially Kubernetes—are new territory. You've heard about the scalability and flexibility that Kubernetes and Containerlab offer for creating virtual labs and testing advanced topologies, and you want to integrate that power into your workflow.

Let’s begin by discussing the devices used in this post to build the cluster. All of them are lower-spec virtual machines running Rocky Linux 9.4. My goal is to demonstrate the power of Kubernetes and Containerlab, even when deployed with minimal, cost-effective resources. There are four devices in total: one server dedicated to Ansible and three Kubernetes nodes. The server specifications are as follows:

Ansible Host - 1 CPU core, 4Gb RAM, 50GB OS HD
3x Kubernetes Hosts - 2 CPU cores, 16Gb RAM, 50GB OS HD, 250GB HD for microCEPH storage pool (covered in Part 2 of this series)

These systems will be communicating over the same 172.16.99.x MGMT network.

NOTE: 1 host is dedicated as the control plane node, and also the etcd server. All three, even the control plane node, will be set as worker nodes and can take on workloads.

NOTE: The kubespray automation in this setup is using all other defaults and no additional plugins will be installed. I may add sections to this post or separate posts in the future covering use of additional plugins.

Requirements

I will include required packages, configuration, and setup for the systems involved in this automation.

NOTE: Unless specified I am working as a non-root user (“jeff”) and in my home directory.

Ansible host

You will need the following:

Update OS
```
  sudo dnf update -y
```

Python (3.9 or greater suggested)

Default on Rocky 9 is Python 3.9. This setup is using 3.10.9. Python installation on Rocky is a little more involved. I’ve included the steps on how to do so below:

Install Dependencies

  dnf install tar curl gcc openssl-devel bzip2-devel libffi-devel zlib-devel wget make -y

Install Python

  # Download and unzip
  wget https://www.python.org/ftp/python/3.10.9/Python-3.10.9.tar.xz
  tar -xf Python-3.10.9.tar.xz
  # Change directory and configure Python
  cd Python-3.10.9
  ./configure --enable-optimizations
  # Start and complete the build process
  make -j 2
  nproc
  # Install Python
  make altinstall
  # Verify install using
  python3.10 --version

Python Virtual Environment

I suggest using a virtual environment for this setup. Makes it easier to keep ansible, its modules, and kubespray separate from anything else the host is being used for.

  # Create the virtual environment
  python3.10 -m venv kubespray_env
  # Activate the virtual environment
  source kubespray_env/bin/activate

  # To deactivate
  deactivate

Download Kubespray

  # Change into virtual environment directory
  cd kubespray_env
  # Pull down Kubespray from Github
  git clone https://github.com/kubernetes-sigs/kubespray.git

Install Ansible and Kubespray packages within the virtual environment

  # From within the virtual environment main folder
  cd kubespray
  # Install packages (Ansible mainly)
  pip3.10 install -r requirements.txt

Tweak Ansible configuration

Modify your ansible.cfg file to ignore host_key_checking. Usually located in /etc/ansible/ Create a new file if none exists.
```
  [defaults]
  host_key_checking = False
```
NOTE: If your unsure where to find your ansible.cfg, just run ansible --version as shown below:
```
  ansible --version

  ansible [core 2.16.3]
    config file = /etc/ansible/ansible.cfg
```

Download my custom kubespray-addons repository

  # Change directory to root virtual environment folder
  cd ~/kubespray_env
  # Pull down kubespray-addons from Github
  git clone https://github.com/leothelyon17/kubespray-addons.git

Kubernetes Nodes (freshly created VMs)

Upgrade OS (same as Ansible host)
```
  sudo dnf update -y

  # Optional
  sudo dnf install nano -y
```
NOTE: I also include Nano text editor on these for quick file editing if needed. The default Python 3.9 included in the OS install works just fine.

Thats it! Automation takes care of everything else.

Getting into the Weeds

Automation Overview and Breakdown

We’ll start with a quick overview of Kubespray, then go over my custom add-on automation and what it strives to accomplish. Then we’ll go over a breakdown of both Addons playbooks—Pre and Post.

Kubespray

Kubespray is an open-source tool that automates the deployment of highly available Kubernetes clusters. It uses Ansible playbooks to install and configure Kubernetes across various environments, including bare-metal servers, virtual machines, or cloud infrastructures. Kubespray simplifies the deployment process, providing a robust, flexible, and scalable solution for setting up production-grade Kubernetes clusters.

Kubespray is undoubtedly a powerful tool. However, as I worked through various tutorials to get started, I noticed the number of steps required, such as setting up the inventory, configuring server settings, and addressing issues like fixing kubeadm on the control nodes once the cluster is up and running. This is something I felt needed to be addressed. This brings us to the next section…

Kubespray-Addons (Custom Automation)

I wanted to make using Kubespray and getting a K8s cluster up and running easier than it already is. This is especially true for my fellow network engineers who might be new to it all things Kubernetes, or just anyone that doesn’t want to spend the extra time messing with the additional setup required for running Kubespray.

The initial setup for Kubespray requires users to define environment variables, which are then passed into a Python script to generate the necessary inventory.yml file. This approach, outlined in the official Kubespray documentation and many online tutorials, produces an inventory file with numerous predefined defaults. However, users often still need to manually modify the Kubespray inventory file afterward. My goal was to create a more intuitive and streamlined solution—one that not only generates the required Kubespray inventory file but will also be used for the Addon playbooks as well.

Inventory - `inventory.yml`

Let’s breakdown the inventory file with an example:

---
all:
  hosts:
    rocky9-lab-node1:
      ansible_host: 172.16.99.25
      domain_name: jjland.local
      master_node: True
      worker_node: True
      etcd_node: True
    rocky9-lab-node2:
      ansible_host: 172.16.99.26
      domain_name: jjland.local
      master_node: False
      worker_node: True
      etcd_node: False
    rocky9-lab-node3:
      ansible_host: 172.16.99.27
      domain_name: jjland.local
      master_node: False
      worker_node: True
      etcd_node: False
    rocky9-lab-mgmt:
      ansible_host: 172.16.99.20
      domain_name: jjland.local

  children:
    k8s_nodes:
      hosts:
        rocky9-lab-node1:
        rocky9-lab-node2:
        rocky9-lab-node3:
    ansible_nodes:
      hosts:
        rocky9-lab-mgmt:

  vars:
    ansible_user: jeff

The file, which users need to customize, is based on the official Kubespray inventory file but with some key improvements. My version allows users to predefine the roles of each node—something the official method doesn’t provide. It also specifies individual host names, used not only in the Addons playbooks but also to properly name the Kubernetes nodes, instead of using the default 'node' from the Kubespray file. Additionally, it defines the domain name, used for updating the /etc/hosts file on all hosts during a task in the Pre-Kubespray playbook. It also sets the ansible_host variable for device connections and configures the ansible_user for all Addons playbook tasks.

Pre-Kubespray Playbook - `pre-kubespray-setup-pb.yml`

This playbook consists of two plays and is designed to fully prepare a set of hosts for Kubernetes deployment using Kubespray. It installs required Ansible collections, sets up SSH key-based authentication, modifies system configurations (disables swap, configures sysctl settings), ensures required kernel modules are loaded, and configures firewall rules.

Play 1 - Pre Kubespray Setup

---
- name: Pre Kubespray Setup
  hosts: all
  gather_facts: false

  tasks:

    - name: Install collections from requirements.yml
      ansible.builtin.command:
        cmd: ansible-galaxy collection install -r requirements.yml
      delegate_to: localhost
      run_once: true

    - name: Generate SSH key pair
      openssh_keypair:
        path: "/home/{{ ansible_user }}/.ssh/kubespray_ansible"
        type: rsa
        size: 2048
        state: present
        mode: '0600'
      register: ssh_keypair_result
      delegate_to: localhost
      run_once: true

    - name: Ensure the SSH public key is present on the remote host
      authorized_key:
        user: "{{ ansible_user }}"
        state: present
        key: "{{ lookup('file', '/home/{{ ansible_user }}/.ssh/kubespray_ansible.pub') }}"
      when: inventory_hostname not in groups['ansible_nodes']

    - name: Add entries to /etc/hosts
      become: true
      lineinfile:
        path: /etc/hosts
        state: present
        line: "{{ hostvars[item].ansible_host }} {{ hostvars[item].inventory_hostname }}.{{ hostvars[item].domain_name }} {{ hostvars[item].inventory_hostname }}"
        backup: yes
      loop: "{{ groups['all'] }}"
      loop_control:
        loop_var: item

Purpose:
This playbook ensures that all hosts are prepared for Kubespray by installing required Ansible collections, generating SSH keys, and configuring the environment.

Hosts:
Targets the ALL host unless specified in task.

Tasks:

Install collections from requirements.yml

Installs required Ansible collections from requirements.yml. This is only run once on the localhost. Right now, the only requirements are community.crypto and ansible.posix.
Generate SSH key pair
Generates an RSA SSH key pair for Ansible on the localhost for later access to remote hosts. The private key is stored in the .ssh directory under kubespray_ansible.
Ensure the SSH public key is present on the remote host
Adds the generated SSH public key to the remote hosts to allow passwordless access. It applies this only to hosts not in the ansible_nodes group.

NOTE: Will still need ‘sudo’ password. This also allows for flexibility to add an additional task for passwordless sudo. May add this feature task later.
Add entries to /etc/hosts
Adds entries to the /etc/hosts file on each host to ensure proper DNS resolution between them. It loops through all hosts in the inventory and updates their hosts file with IP addresses and hostnames.

Play 2 - Build Kubespray inventory and additional k8s node setup

- name: Build Kubespray inventory and additional k8s node setup
  hosts: k8s_nodes
  gather_facts: false
  tasks:
    - name: Create inventory directory if it does not exist
      ansible.builtin.file:
        path: ../kubespray/inventory/
        state: directory
        mode: '0755'
      delegate_to: localhost
      run_once: true

    - name: Generate inventory.yml for kubespray using Jinja2
      template:
        src: ./templates/kubespray-inventory-yaml.j2
        dest: ./k8s-hosts.yml
      delegate_to: localhost

    - name: Copy completed template to kubespray inventory folder
      ansible.builtin.copy:
        src: ./k8s-hosts.yml
        dest: ../kubespray/inventory
        mode: '0755'
      delegate_to: localhost

    - name: Disable swap
      become: true
      ansible.builtin.command: swapoff -a

    - name: Remove swap entry from /etc/fstab
      become: true
      ansible.builtin.replace:
        path: /etc/fstab
        regexp: '(^.*swap.*$)'
        replace: '# \1'

    - name: Load necessary kernel modules
      become: true
      ansible.builtin.modprobe:
        name: "{{ item }}"
      loop:
        - br_netfilter
        - overlay

    - name: Ensure kernel modules are loaded on boot
      become: true
      ansible.builtin.copy:
        dest: /etc/modules-load.d/kubernetes.conf
        content: |
          br_netfilter
          overlay

    - name: Configure sysctl for Kubernetes networking
      become: true
      ansible.builtin.copy:
        dest: /etc/sysctl.d/kubernetes.conf
        content: |
          net.bridge.bridge-nf-call-ip6tables = 1
          net.bridge.bridge-nf-call-iptables = 1
          net.ipv4.ip_forward = 1
      notify:
        - Reload sysctl

    - name: Apply sysctl settings
      become: true
      ansible.builtin.command: sysctl --system

    - name: Configure firewall rules for Kubernetes
      become: true
      ansible.builtin.firewalld:
        service: "{{ item }}"
        permanent: yes
        state: enabled
        immediate: yes
      loop:
        - ssh
        - http
        - https
        - kube-api 
        - kube-apiserver
        - kube-control-plane
        - kube-control-plane-secure 
        - kube-controller-manager
        - kube-controller-manager-secure
        - kube-nodeport-services
        - kube-scheduler 
        - kube-scheduler-secure
        - kube-worker 
        - kubelet
        - kubelet-readonly 
        - kubelet-worker
        - etcd-server
      notify:
        - Reload firewalld

Purpose:
This playbook sets up the environment for Kubernetes and configures the nodes for a Kubespray deployment.

Hosts:
Targets the 3 Kubernetes nodes only.

Tasks:

Create inventory directory if it does not exist
Creates the inventory directory required by Kubespray to store the inventory.yml file.
Generate inventory.yml for Kubespray using Jinja2
Uses a Jinja2 template to create the inventory.yml file that Kubespray needs, based on the defined hosts.
Copy completed template to Kubespray inventory folder
Copies the generated inventory.yml file into the Kubespray directory for further use.
Disable swap
Disables swap on the hosts as required by Kubernetes.
Remove swap entry from /etc/fstab
Removes any entries related to swap in /etc/fstab to prevent it from re-enabling at boot.
Load necessary kernel modules
Loads required kernel modules (br_netfilter and overlay) for Kubernetes networking.
Ensure kernel modules are loaded on boot
Adds kernel modules to /etc/modules-load.d/kubernetes.conf to ensure they are loaded on boot.
Configure sysctl for Kubernetes networking
Configures sysctl settings to enable IP forwarding and ensure proper Kubernetes networking (net.bridge.bridge-nf-call-iptables, net.ipv4.ip_forward).
Apply sysctl settings

Applies the sysctl settings to ensure they are active immediately.
Configure firewall rules for Kubernetes
Configures firewalld to allow traffic on essential Kubernetes services like ssh, kube-api, and more.

Handlers

 handlers:
  - name: Reload firewalld
    become: true
    ansible.builtin.command: systemctl reload firewalld

  - name: Reload sysctl
    become: true
    ansible.builtin.command: sysctl --system

Purpose:
The handlers will reload the firewall and apply sysctl settings when triggered.

Reload firewalld

Reloads the firewalld service to apply the newly configured rules.
Reload sysctl

Reloads the sysctl configurations to apply networking changes.

Post-Kubespray Playbook - `post-kubespray-setup-pb.yml`

The Post playbook currently performs a single function, though it involves several tasks. It downloads the latest version of kubeadm, sets the correct configuration, and ensures proper file ownership for the user. The playbook primarily uses Ansible's file and shell modules, essentially turning a series of steps from the documentation into an automated process. Notably, these kubeadm tasks only run on hosts designated as Kubernetes control-plane nodes. I plan to expand this playbook in the future to include additional tasks, such as tests and more.

---
- name: Setup kubectl on control plane nodes
  hosts: k8s_nodes
  gather_facts: false
  tasks:

    - name: Kubectl block
      block:
        - name: Download kubectl files (latest)
          ansible.builtin.shell:
            cmd: curl -LO https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl
            chdir: /home/{{ ansible_user }}

        - name: Copy kubernetes admin configuration
          become: true
          ansible.builtin.shell:
            cmd: cp /etc/kubernetes/admin.conf /home/{{ ansible_user }}/config
            chdir: /home/{{ ansible_user }}

        - name: Remove existing .kube directory
          ansible.builtin.file:
            path: /home/{{ ansible_user }}/.kube
            state: absent

        - name: Create fresh .kube directory
          ansible.builtin.file:
            path: /home/{{ ansible_user }}/.kube
            state: directory
            mode: '0755'

        - name: Move kubernetes admin configuration
          ansible.builtin.shell:
            cmd: mv config .kube/
            chdir: /home/{{ ansible_user }}

        - name: Correct ownership of .kube config
          become: true
          ansible.builtin.file:
            path: /home/{{ ansible_user }}/.kube/config
            owner: "{{ ansible_user }}"
            group: 1000

      when: hostvars[inventory_hostname]['master_node']

Building the K8s Cluster (Running the Playbook)

`pre-kubespray-setup-pb.yml`

To start, the inventory file needs to be created or modified. If it was pulled down from the Github repository, you will only need to modify it according to your needs. If not, the example inventory.yml file above in this post can be used. In the example file there is just a single control-plane node and etcd server (rocky9-lab-node1). All 3 nodes are set as worker nodes.

To execute the playbook run the following command -

ansible-playbook pre-kubespray-setup-pb.yml -i inventory.yml --ask-become-pass --ask-pass

NOTE: This assumes all previous setup was completed, the python virtual environment is active, and kubespray-addons folder is adjacent to the main kubespray folder. Otherwise this playbook will fail.

The ssh/sudo passwords for the K8s nodes will need inputted.

Below is an example output of successful Pre Kubespray playbook run -

(kubespray_env) [jeff@rocky9-lab-mgmt kubespray-addons]$ ansible-playbook pre-kubespray-setup-pb.yml -i inventory.yml --ask-become-pass --ask-pass
SSH password: 
BECOME password[defaults to SSH password]: 

PLAY [Pre Kubespray Setup] ***************************************************************************************************************************

TASK [Install collections from requirements.yml] *****************************************************************************************************
changed: [rocky9-lab-node1 -> localhost]

TASK [Generate SSH key pair] *************************************************************************************************************************
ok: [rocky9-lab-node1 -> localhost]

TASK [Ensure the SSH public key is present on the remote host] ***************************************************************************************
skipping: [rocky9-lab-mgmt]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]

TASK [Add entries to /etc/hosts] *********************************************************************************************************************
changed: [rocky9-lab-node2] => (item=rocky9-lab-node1)
changed: [rocky9-lab-node1] => (item=rocky9-lab-node1)
changed: [rocky9-lab-node3] => (item=rocky9-lab-node1)
changed: [rocky9-lab-node1] => (item=rocky9-lab-node2)
changed: [rocky9-lab-node2] => (item=rocky9-lab-node2)
changed: [rocky9-lab-node3] => (item=rocky9-lab-node2)
ok: [rocky9-lab-mgmt] => (item=rocky9-lab-node1)
changed: [rocky9-lab-node2] => (item=rocky9-lab-node3)
changed: [rocky9-lab-node1] => (item=rocky9-lab-node3)
changed: [rocky9-lab-node3] => (item=rocky9-lab-node3)
changed: [rocky9-lab-node2] => (item=rocky9-lab-mgmt)
changed: [rocky9-lab-node1] => (item=rocky9-lab-mgmt)
ok: [rocky9-lab-mgmt] => (item=rocky9-lab-node2)
changed: [rocky9-lab-node3] => (item=rocky9-lab-mgmt)
ok: [rocky9-lab-mgmt] => (item=rocky9-lab-node3)
ok: [rocky9-lab-mgmt] => (item=rocky9-lab-mgmt)

PLAY [Build Kubespray inventory and additional k8s node setup] ***************************************************************************************

TASK [Create inventory directory if it does not exist] ***********************************************************************************************
ok: [rocky9-lab-node1 -> localhost]

TASK [Generate inventory.yml for kubespray using Jinja2] *********************************************************************************************
ok: [rocky9-lab-node2 -> localhost]
ok: [rocky9-lab-node3 -> localhost]
ok: [rocky9-lab-node1 -> localhost]

TASK [Copy completed template to kubespray inventory folder] *****************************************************************************************
changed: [rocky9-lab-node1 -> localhost]
changed: [rocky9-lab-node2 -> localhost]
changed: [rocky9-lab-node3 -> localhost]

TASK [Disable swap] **********************************************************************************************************************************
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]
changed: [rocky9-lab-node2]

TASK [Remove swap entry from /etc/fstab] *************************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]

TASK [Load necessary kernel modules] *****************************************************************************************************************
changed: [rocky9-lab-node1] => (item=br_netfilter)
changed: [rocky9-lab-node3] => (item=br_netfilter)
changed: [rocky9-lab-node2] => (item=br_netfilter)
changed: [rocky9-lab-node1] => (item=overlay)
changed: [rocky9-lab-node2] => (item=overlay)
changed: [rocky9-lab-node3] => (item=overlay)

TASK [Ensure kernel modules are loaded on boot] ******************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Configure sysctl for Kubernetes networking] ****************************************************************************************************
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]

TASK [Apply sysctl settings] *************************************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]

TASK [Configure firewall rules for Kubernetes] *******************************************************************************************************
ok: [rocky9-lab-node1] => (item=ssh)
ok: [rocky9-lab-node2] => (item=ssh)
ok: [rocky9-lab-node3] => (item=ssh)
changed: [rocky9-lab-node3] => (item=http)
changed: [rocky9-lab-node1] => (item=http)
changed: [rocky9-lab-node2] => (item=http)
...output ommitted for brevity...
changed: [rocky9-lab-node2] => (item=etcd-server)
changed: [rocky9-lab-node3] => (item=kubelet-worker)
changed: [rocky9-lab-node3] => (item=etcd-server)

RUNNING HANDLER [Reload firewalld] *******************************************************************************************************************
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]

RUNNING HANDLER [Reload sysctl] **********************************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]

PLAY RECAP *******************************************************************************************************************************************
rocky9-lab-mgmt            : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
rocky9-lab-node1           : ok=16   changed=13   unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
rocky9-lab-node2           : ok=13   changed=12   unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
rocky9-lab-node3           : ok=13   changed=12   unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

There should be many changes for the K8s nodes and no failures.

The Kubernetes nodes should now be ready to setup/run Kubernetes via Kubespray.

`Kubespray`

The next step is executing the Kubespray cluster build playbook, which should be very easy now. We will use the k8s-hosts.yml that was created from the Pre-Kubespray playbook as the Kubespray required inventory. It is located in the main Kubespray directory within the inventory folder. You can see the contents of this file below -

all:
  hosts:
    rocky9-lab-node1:
      ansible_host: 172.16.99.25
      ip: 172.16.99.25
      access_ip: 172.16.99.25
    rocky9-lab-node2:
      ansible_host: 172.16.99.26
      ip: 172.16.99.26
      access_ip: 172.16.99.26
    rocky9-lab-node3:
      ansible_host: 172.16.99.27
      ip: 172.16.99.27
      access_ip: 172.16.99.27

  children:
    kube_control_plane:
      hosts:
        rocky9-lab-node1:

    kube_node:
      hosts:
        rocky9-lab-node1:
        rocky9-lab-node2:
        rocky9-lab-node3:

    etcd:
      hosts:
        rocky9-lab-node1:

    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Change into the main Kubespray directory and execute the playbook like below -

ansible-playbook -i inventory/k8s-hosts.yml --ask-pass --become --ask-become-pass cluster.yml

NOTE: Kubespray/Kubernetes requires ‘root’ access to run successfully hence the --become. SSH/Sudo passwords are also again required.

Kubespray can take 15-20 mins to finish execution. The output is vast so I won’t be pasting an example in here. A successful run should look like the below output, once it finishes -

PLAY RECAP ***********************************************************************************************************************************************
rocky9-lab-node1           : ok=649  changed=88   unreachable=0    failed=0    skipped=1090 rescued=0    ignored=6   
rocky9-lab-node2           : ok=415  changed=36   unreachable=0    failed=0    skipped=625  rescued=0    ignored=1   
rocky9-lab-node3           : ok=416  changed=37   unreachable=0    failed=0    skipped=624  rescued=0    ignored=1   

Saturday 05 October 2024  13:39:17 -0400 (0:00:00.115)       0:07:36.442 ****** 
=============================================================================== 
kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------- 21.11s
kubernetes/control-plane : Kubeadm | Initialize first control plane node ------------------------------------------------------------------------- 20.15s
download : Download_container | Download image if required --------------------------------------------------------------------------------------- 11.65s
download : Download_container | Download image if required --------------------------------------------------------------------------------------- 10.34s
container-engine/runc : Download_file | Download item --------------------------------------------------------------------------------------------- 8.51s
container-engine/containerd : Download_file | Download item --------------------------------------------------------------------------------------- 8.25s
container-engine/crictl : Download_file | Download item ------------------------------------------------------------------------------------------- 8.19s
container-engine/nerdctl : Download_file | Download item ------------------------------------------------------------------------------------------ 8.16s
download : Download_container | Download image if required ---------------------------------------------------------------------------------------- 7.65s
etcd : Reload etcd -------------------------------------------------------------------------------------------------------------------------------- 6.14s
container-engine/crictl : Extract_file | Unpacking archive ---------------------------------------------------------------------------------------- 6.08s
container-engine/nerdctl : Extract_file | Unpacking archive --------------------------------------------------------------------------------------- 5.62s
download : Download_container | Download image if required ---------------------------------------------------------------------------------------- 5.23s
etcd : Configure | Check if etcd cluster is healthy ----------------------------------------------------------------------------------------------- 5.23s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates ---------------------------------------------------------------------------- 4.75s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources --------------------------------------------------------------------------------------- 4.05s
download : Download_container | Download image if required ---------------------------------------------------------------------------------------- 4.02s
download : Download_container | Download image if required ---------------------------------------------------------------------------------------- 3.58s
network_plugin/cni : CNI | Copy cni plugins ------------------------------------------------------------------------------------------------------- 3.25s
download : Download_file | Download item ---------------------------------------------------------------------------------------------------------- 3.00s

Kubespray execution can sometimes fail due to connectivity issues or similar problems, especially when pulling down multiple container images, which might time out. If this happens, simply re-run the playbook as described earlier. It will pick up where it left off, skipping the tasks that have already been successfully completed.

If you want to wipe out the Kubespray/Kubernetes cluster, Kubespray does give you a playbook for that as well. That can be executed using the below example -

ansible-playbook -i inventory/k8s-hosts.yml --ask-pass --become --ask-become-pass reset.yml

`post-kubespray-setup-pb.yml`

After successfully creating a K8s cluster using Kubespray the last piece required is configuring kubectl on the control-plane nodes. To do this change back into the kubespray-addons directory. The Post-Kubespray playbook can be executed as seen below -

ansible-playbook post-kubespray-setup-pb.yml -i inventory.yml --ask-pass --ask-become-pass

A successful execution output should look something like seen here -

(kubespray_env) [jeff@rocky9-lab-mgmt kubespray-addons]$ ansible-playbook post-kubespray-setup-pb.yml -i inventory.yml --ask-pass --ask-become-pass 
SSH password: 
BECOME password[defaults to SSH password]: 

PLAY [Setup kubectl on control plane nodes] **********************************************************************************************************

TASK [Download kubectl files (latest)] ***************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Copy kubernetes admin configuration] ***********************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Remove existing .kube directory] ***************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
ok: [rocky9-lab-node1]

TASK [Create fresh .kube directory] ******************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Move kubernetes admin configuration] ***********************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Correct ownership of .kube config] *************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

PLAY RECAP *******************************************************************************************************************************************
rocky9-lab-node1           : ok=6    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
rocky9-lab-node2           : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0   
rocky9-lab-node3           : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0

NOTE: You should see ‘changed’ only for nodes designated as control-plane nodes.

Closing Thoughts

If all 3 playbooks ran successfully, CONGRATULATIONS, you should have a fully working Kubernetes cluster. To confirm this, log into any of the cluster control-plane nodes and run the command kubectl get nodes. You will hopefully see the following output result -

[jeff@rocky9-lab-node1 ~]$ kubectl get nodes
NAME               STATUS   ROLES           AGE   VERSION
rocky9-lab-node1   Ready    control-plane   39m   v1.30.4
rocky9-lab-node2   Ready              39m   v1.30.4
rocky9-lab-node3   Ready              39m   v1.30.4

The Kubernetes cluster is fully set up, providing a solid foundation for what’s coming in the next post, and eventually Containerlab/Clabernetes. You can also use this cluster to dive deeper into the world of Kubernetes beyond what we’re covering here. Experiment, expand the cluster, tear it down, and rebuild it—become an expert if you wish. Hopefully, this post makes the entry into Kubernetes a bit easier for those starting out.

What’s next?

I have at least two more pieces I will be adding to this series -

~~Building a Cluster (Part 1)~~
Adding Built-in Storage Cluster using MicroCeph (Part 2)
Setting up and exploring Containerlab/Clabernetes (Part 3)

I also plan to add posts covering specific topology examples, integration with other tools, and network automation testing. These topics may either extend this series or become their own separate posts. There’s always a wealth of topics to explore and write about.

You can find the code that goes along with this post here (Github).

Thoughts, questions, and comments are appreciated. Please follow me here at Hashnode or connect with me on Linkedin.

Thank you for reading fellow techies!

Unraid VM Snapshot Automation with Ansible: Part 2 - Restoring Snapshots

Jeffrey Lyon — Mon, 16 Sep 2024 14:21:09 GMT

Intro

Hello again, and welcome to the second post in my Unraid snapshot automation series!

In my first post, we explored how to use Ansible to automate the creation of VM snapshots on Unraid, simplifying the backup process for home lab setups or even more advanced environments. Now, it's time to complete the picture by diving into snapshot restoration. In this post, I'll show you how to leverage those snapshots we created earlier to quickly and efficiently roll back VMs to a previous state.

Whether you're testing, troubleshooting, or simply maintaining a reliable baseline for your VMs, automated snapshot restoration will save you time and effort. Like before, this is designed with the home lab community in mind, but the process can easily be adapted for other Linux-based systems.

The first post can be found here:
https://thenerdylyonsden.hashnode.dev/unraid-vm-snapshot-automation-with-ansible-part-1

Let’s get started!

Scenario and Requirements

This section largely mirrors the previous post. I'll be using the snapshot files created earlier—both the remote .img and local .tar files. The setup remains the same: I'll use the Ubuntu Ansible host, the Unraid server for local snapshots, and the Synology DiskStation for remote storage. For local restores, the Unraid server will act as both the source and destination. No additional packages or configurations are required on any of the systems.

Let's Automate!

Overview and Setup

Let's review the playbook directory structure from the previous post. It looks like this:


├── README.md
├── create-snapshot-pb.yml
├── defaults
│   └── inventory.yml
├── files
│   ├── backup-playbook-old.yml
│   └── snapshot-creation-unused.yml
├── handlers
├── meta
├── restore-from-local-tar-pb.yml
├── restore-from-snapshot-pb.yml
├── tasks
│   ├── shutdown-vm.yml
│   └── start-vm.yml
├── templates
├── tests
│   ├── debug-tests-pb.yml
│   └── simple-debugs.yml
└── vars
    ├── snapshot-creation-vars.yml
    └── snapshot-restore-vars.yml

Most of this was covered in the previous post. I will cover the new files here:

vars/snapshot-restore-vars.yml Similar to the create file, this file is where users specify the list of VMs and their corresponding disks for snapshot restoration. It primarily consists of a dictionary outlining the VMs and the disks to be restored. Additionally, it includes variables for configuring the connection to the destination NAS device.
restore-from-snapshot-pb.yml This playbook manages the restoration process from the remote snapshot repository and is composed of three plays. The first play serves two functions: it verifies the targeted Unraid VMs and disks, and builds additional data structures along with dynamic host groups. The second play locates the correct snapshots, transfers them to the Unraid server, and handles file comparison, VM shutdown, and replacing the original disk with the snapshot. The third play restarts the VMs once all other tasks are completed.
restore-from-local-tar-pb.yml Same as above. Does everything local to the Unraid server using .tar files instead of remote snapshots.

Inventory - `defaults/inventory.yml`

Covered in Part 1. Shown here again for reference:

---
nodes:
  hosts:
    diskstation:
      ansible_host: "{{ lookup('env', 'DISKSTATION_IP_ADDRESS') }}"
      ansible_user: "{{ lookup('env', 'DISKSTATION_USER') }}"
      ansible_password: "{{ lookup('env', 'DISKSTATION_PASS') }}"
    unraid:
      ansible_host: "{{ lookup('env', 'UNRAID_IP_ADDRESS') }}"
      ansible_user: "{{ lookup('env', 'UNRAID_USER') }}"
      ansible_password: "{{ lookup('env', 'UNRAID_PASS') }}"

Defines the connection variables for the unraid and diskstation hosts.

Variables - `vars/snapshot-restore-vars.yml`

Much like the snapshot creation automation, this playbook relies on a single variable file that serves as the primary point of interaction for the user. In this file, you’ll list the VMs, specify the disks to be restored for each VM, provide the path to the existing disk .img file, and indicate the snapshot you wish to restore from. If a snapshot name is not specified, the playbook will automatically search for and restore from the most recent snapshot associated with the disk.

---
snapshot_repository_base_directory: volume1/Home\ Media/Backup
repository_user: unraid

snapshot_restore_list:
  - vm_name: Rocky9-TESTNode
    disks_to_restore:
      - vm_disk_to_restore: vdisk1.img
        vm_disk_directory: /mnt/cache/domains
        snapshot_to_restore_from: test-snapshot
      - vm_disk_to_restore: vdisk2.img
        vm_disk_directory: /mnt/disk1/domains
  - vm_name: Rocky9-LabNode3
    disks_to_restore:
      - vm_disk_to_restore: vdisk1.img
        vm_disk_directory: /mnt/nvme_cache/domains
        snapshot_to_restore_from: kubernetes-baseline

Let's examine this file. It's similar to the one used for creation. I used a little more descriptive key names this time:

snapshot_restore_list - the main data structure for defining your list of VMs and disks. Within this there are two main variables vm_name and disks_to_restore
vm_name - used to define your the name of your VM. Must coincide with the name of the VM used within the Unraid system itself.
disks_to_restore - a per VM list consisting of the disks that will be restored. This list requires two variables—vm_disk_to_restore and vm_disk_directory, with snapshot_to_restore_from as an ‘optional’ third variable.
vm_disk_to_restore - consists the existing .img file name for that VM disk, i.e vdisk1.img
vm_disk_directory - consists of the absolute directory root path where the per VM files are stored. An example of a full path to an .img file within Unraid would be: /mnt/cache/domains/Rocky9-TESTNode/vdisk1.img
snapshot_to_restore_from - is an optional attribute that allows the user to specify the name of the snapshot for restoration. If this attribute is not provided, the playbook will automatically search for and use the latest snapshot that matches the disk.
snapshot_repository_base_directory and repository_user are used within the playbook's rsync task. These variables offer flexibility, allowing the user to specify their own remote user and target destination for the rsync operation. These are used only if the snapshots are being sent to remote location upon creation.

Following the provided example you can define your VMs, disk names, locations , and restoration snapshot name when running the playbook.

Playbooks

Two distinct playbooks were created to manage disk restoration. The restore-from-snapshot-pb.yml playbook handles restoration from the remote repository (DiskStation) using rsync. Meanwhile, local restoration is managed by restore-from-local-tar-pb.yml. Combining these processes proved to be too complex and unwieldy, so it was simpler and more manageable to build, test, and understand them separately.

NOTE: Snapshot restoration is much trickier to automate than creation. There are a lot more tasks/conditionals related to error handling in these playbooks.

`restore-from-snapshot-pb.yml`

Restore Snapshot Preparation Play

- name: Restore Snapshot Preparation
  hosts: unraid
  gather_facts: no
  vars_files:
    - ./vars/snapshot-restore-vars.yml

  tasks:
    - name: Retrieve List of All Existing VMs on UnRAID Hypervisor
      shell: virsh list --all | tail -n +3 | awk '{ print $2}'
      register: hypervisor_existing_vm_list

    - name: Generate VM and Disk Lists for Validated VMs in User Inputted Data
      set_fact: 
        vms_map: "{{ snapshot_restore_list | map(attribute='vm_name') }}"
        disks_map: "{{ snapshot_restore_list | map(attribute='disks_to_restore') }}"
      when: item.vm_name in hypervisor_existing_vm_list.stdout_lines
      with_items: "{{ snapshot_restore_list }}"

    - name: Build Data Structure for Snapshot Restoration
      set_fact: 
        snapshot_data_map: "{{ dict(vms_map | zip(disks_map)) | dict2items(key_name='vm_name', value_name='disks_to_restore') | subelements('disks_to_restore') }}"

    - name: Verify Snapshot Data is Available for Restoration
      assert:
        that:
          - snapshot_data_map
        fail_msg: "Restore operation failed. Not enough data to proceed."

    - name: Dynamically Create Host Group for Disks to be Restored
      ansible.builtin.add_host:
        name: "{{ item[0]['vm_name'] }}-{{ item[1]['vm_disk_to_restore'][:-4] }}"
        groups: disks
        vm_name: "{{ item[0]['vm_name'] }}"
        disk_name: "{{ item[1]['vm_disk_to_restore'] }}"
        source_directory: "{{ item[1]['vm_disk_directory'] }}"
        snapshot_to_restore_from: "{{ item[1]['snapshot_to_restore_from'] | default('latest') }}"
      loop: "{{ snapshot_data_map }}"

Purpose:
Designed to prepare for the restoration of VM snapshots on an Unraid hypervisor. It gathers information about existing VMs, validates user input, structures the data for restoration, and dynamically creates host groups for managing the restore process.

Hosts:
Targets the unraid host.

Variables File:
Loads additional variables from ./vars/snapshot_restore_vars.yml. Mainly the user's modified snapshot_restore_list.

Tasks:

Retrieve List of All Existing VMs on Unraid Hypervisor:
Executes a shell command to list all VMs on the Unraid hypervisor and registers the result. It extracts VM names using virsh and formats the output for further use.
Generate VM and Disk Lists for Validated VMs in User Inputted Data:
Constructs lists of VM names and disks to restore from the user input data, but only includes those VMs that exist on the hypervisor. Only runs if user inputted variable data matches at least 1 existing VM name. Otherwise playbook fails due to lack of data.
- vms_map: List of VM names.
- disks_map: List of disks to restore.

These lists are then used to create the larger snapshot_data_map.

Build Data Structure for Snapshot Restoration:
Creates a nested data structure that maps each VM to its corresponding disks to restore, preparing it for subsequent tasks.
- snapshot_data_map: Merges the VM and disk maps into a more structured data format, making it easier to access and manage the VM/disk information programmatically. My goal was to keep the inventory files simple for users to understand and modify. However, this approach didn’t work well with the looping logic I needed, so I created this new data map for better flexibility and control.
Verify Snapshot Data is Available for Restoration:

Checks that snapshot_data_map has been populated correctly and ensures that there is enough data to proceed with the restoration. If not, it triggers a failure message to indicate insufficient data and halts the playbook.
Dynamically Create Host Group for Disks to be Restored:

Creates dynamic host entries for each disk that needs to be restored. Each host is added to the disks group with relevant information about the VM, disk, and optional snapshot name.

Disk Restore From Snapshot Play

- name: Disk Restore From Snapshot
  hosts: disks
  gather_facts: no
  vars_files:
    - ./vars/snapshot-restore-vars.yml

  tasks:

    - name: Find files in the VM folder containing the target VM disk name
      find:
        paths: "/{{ snapshot_repository_base_directory | regex_replace('\\\\', '')}}/{{ vm_name }}/"
        patterns: "*{{ disk_name[:-4] }}*"
        recurse: yes
      register: found_files 
      delegate_to: diskstation

    - name: Ensure that files were found
      assert:
        that:
          - found_files.matched > 0
        fail_msg: "No files found matching disk {{ disk_name[:-4] }} for VM {{ vm_name }}."

    - name: Create a file list from the target VM folder with only file names
      set_fact: 
        file_list: "{{ found_files.files | map(attribute='path') | map('regex_replace','^.*/(.*)$','\\1') | list }}"

    - name: Stitch together full snapshot name. Replace dashes and remove special characters
      set_fact: 
        full_snapshot_name: "{{ disk_name[:-4] }}.{{ snapshot_to_restore_from | regex_replace('\\-', '_') | regex_replace('\\W', '') }}.img"

    - name: Find and set correct snapshot if file found in snapshot folder
      set_fact: 
        found_snapshot: "{{ full_snapshot_name }}"
      when: full_snapshot_name in file_list

    - name: Find and set snapshot to latest if undefined or error handle block
      block:

        - name: Sort found files by modification time (newest first) - LATEST Block
          set_fact:
            sorted_files: "{{ found_files.files | sort(attribute='mtime', reverse=True) | map(attribute='path') | map('regex_replace','^.*/(.*)$','\\1') | list  }}"

        - name: Find and set correct snapshot for newest found .img file - LATEST Block
          set_fact: 
            found_snapshot: "{{ sorted_files | first }}"

      when: found_snapshot is undefined or found_snapshot == None  

    - name: Ensure that the desired snapshot file was found
      assert:
        that:
          - found_snapshot is defined and found_snapshot != None
        fail_msg: "The snapshot to restore was not found. May not exist or user date was entered incorrectly."
        success_msg: "Snapshot found! Will begin restore process NOW."

    - name: Transfer snapshots to VM hypervisor server via rsync
      command: rsync {{ repository_user }}@{{ hostvars['diskstation']['ansible_host'] }}:/{{ snapshot_repository_base_directory }}/{{ vm_name }}/{{ found_snapshot }} {{ found_snapshot }}
      args:
        chdir: "{{ source_directory }}/{{ vm_name }}"
      delegate_to: unraid

    - name: Get attributes of original stored snapshot .img file
      stat:
        path: "/{{ snapshot_repository_base_directory | regex_replace('\\\\', '')}}/{{ vm_name }}/{{ found_snapshot }}"
        get_checksum: false
      register: file1
      delegate_to: diskstation

    - name: Get attributes of newly transfered snapshot .img file
      stat:
        path: "{{ source_directory }}/{{ vm_name }}/{{ found_snapshot }}"
        get_checksum: false
      register: file2
      delegate_to: unraid

    - name: Ensure original and tranferred file sizes are the same
      assert:
        that:
          - file1.stat.size == file2.stat.size
        fail_msg: "Files failed size comparison post transfer. Aborting operation for {{ inventory_hostname }}"
        success_msg: File size comparison passed.

    - name: Shutdown VM(s)
      include_tasks: ./tasks/shutdown-vm.yml
      loop: "{{ hostvars['unraid']['vms_map'] }}"

    - name: Delete {{ disk_name }} for VM {{ vm_name }}
      ansible.builtin.file:
        path: "{{ source_directory }}/{{ vm_name }}/{{ disk_name }}"
        state: absent
      delegate_to: unraid

    - name: Transfer snapshots to VM hypervisor server via rsync
      command: mv {{ found_snapshot }} {{ disk_name }}
      args:
        chdir: "{{ source_directory }}/{{ vm_name }}"
      delegate_to: unraid

Purpose:
This play facilitates the restoration of VM disk snapshots on an Unraid server. It searches for the required snapshot, validates the snapshot file, and transfers it back to the hypervisor for restoration, ensuring the integrity of the restored disk.

Hosts:
Targets the disks host group.

Variables File:
Loads additional variables from ./vars/snapshot_restore_vars.yml. Mainly the user's modified snapshot_restore_list.

Tasks:

Find Files in the VM Folder Containing the Target VM Disk Name:
Searches the snapshot repository for files that match the target VM disk name i.e. vdisk1, vdisk2, etc , and recursively lists them. Stores in found_files variable.
Ensure That Files Were Found:

Verifies that at least one file matching the disk name was found. If no files are found, will produce failure message and playbook will fail for host disk.
Create a File List From the Target VM Folder with only File Names:

Extracts and stores only the file names from the found_files list.
Stitch Together Full Snapshot Name:

Constructs the full snapshot name by combining the disk name, user inputted snapshot name (if available), and “.img”. Also replaces dashes with underscores and removes any special characters.
Find and Set the Correct Snapshot if File Found in Snapshot Folder:

If the constructed snapshot name is found in the list of files, it sets found_snapshot to this name.
Find and Set Snapshot to Latest if Undefined or Error Handling (Block):

If no specific snapshot is found or defined, this block sorts the found files by modification time (newest first) and sets the snapshot to the latest available one.
Ensure the Desired Snapshot File Was Found:

Confirms that a snapshot was found and is ready for restoration. It will fail with an error message if not found. Playbook will also fail for the host disk.
Transfer Snapshots to VM Hypervisor Server via rsync:

Uses rsync to transfer the found snapshot from the remote DiskStation to the Unraid server, where the VM is located. Changes into the correct disk directory prior to transfer.
Get Attributes of the Snapshot Files and Compare Size:

The next three tasks retrieves the attributes of both the DiskStation snapshot followed by the newly transferred snapshot on the Unraid server. It then compares the file sizes of the original and transferred snapshots to ensure the transfer was successful. Playbook fails for host disk if sizes are not equal.
Shutdown VMs:

Shuts down the VMs in preparation for the restoration process by calling a separate task file (/tasks/shutdown-vm.yml). For more details on the shutdown tasks, refer to the previous post.
Delete the Original Disk for the VM:

Deletes the original disk file for the VM to so the snapshot file can be properly renamed to the correct disk name.
Rename Snapshot to Proper Disk Name

Renames the restored snapshot file to match the original disk file name, completing the restoration process.

Restart Affected VMs Play

- name: Restart Affected VMs
  hosts: unraid
  gather_facts: no
  vars_files:
    - ./vars/snapshot-restore-vars.yml

  tasks:
    - name: Start VM(s) back up
      include_tasks: ./tasks/start-vm.yml
      loop: "{{ snapshot_restore_list }}"

Purpose:
This play’s only purpose is to start the targeted VM’s after the restore process has completed for all disks.

Hosts:
Targets the unraid host.

Variables File:
Loads additional variables from ./vars/snapshot_restore_vars.yml. Mainly the user's modified snapshot_restore_list.

Tasks:

Start VM(s) Back Up:
Starts up the VMs once the restoration process has completed by calling a separate task file (/tasks/start-vm.yml). For more details on the startup tasks, refer to the previous post.

NOTE: This was intentionally made a separate play at the end of the playbook to ensure all disk restore operations are completed beforehand. By looping over the VMs using the snapshot_restore_list variable, only one start command per VM is sent, reducing the chance of errors.

`restore-from-local-tar-pb.yml`

NOTE: This playbook is quite similar to the restore-from-snapshot-pb.yml playbook, but focuses on local restoration using the .tar files. All tasks are executed either on the Ansible host or the Unraid server. In this breakdown, I'll only highlight the key task differences from the previous playbook

Restore Snapshot Preparation Play

Exactly the same as the restore-from-snapshot-pb.yml playbook. Nothing to do.

Disk Restore From TAR file Play

- name: Find files in the VM folder containing the target VM disk name
      find:
        paths: "/{{ source_directory | regex_replace('\\\\', '')}}/{{ vm_name }}/"
        patterns: "*{{ disk_name[:-4] }}*"
        recurse: yes
      register: found_files 
      delegate_to: unraid

    - name: Filter files matching patterns for .tar files
      set_fact:
        matched_tar_files: "{{ found_files.files | selectattr('path', 'search', '.*\\.(tar)$') | list }}"

Tasks:

Find Files in the VM Folder Containing the Target VM Disk Name:

Similar to the other playbook. This task searches through the VM's directory to locate any files that match the target disk name, regardless of file type (e.g., .img, .tar).
Filter Files Matching Patterns for .tar Files

After locating files in the previous task, this task filters out only the .tar files from the list of found files. Uses set_fact to store list in variable matched_tar_files.

Everything is the same until unzip task (below).

- name: Unzip .tar file
      command: tar -xf {{ found_snapshot }}
      args:
        chdir: "{{ source_directory }}/{{ vm_name }}"
      delegate_to: unraid

Pretty straight forward here. Just unzips the correct snapshot .tar file back to a usable .img file.

The remaining tasks follow the same process as the restore-from-snapshot-pb.yml playbook. They gather the attributes of both the original and newly unzipped files, verify that their sizes match, shut down the required VMs, delete the original disk file, rename the snapshot to the appropriate disk name, and finally, restart the VMs.

Restoring in Action (Running the Playbook)

Like the create playbook in the previous post, these playbooks are also very simple to run. Run them in the root of the playbook directory:

ansible-playbook restore-from-snapshot-pb.yml -i defaults/inventory.yml

ansible-playbook restore-from-local-tar-pb.yml -i defaults/inventory.yml

Below are the results of successful playbook runs, tested using a single 2GB disk for both local and remote restores. One run uses a static snapshot name, while the other demonstrates the process of finding the 'latest' snapshot when the name is not defined.

Restore from snapshot w/ finding the latest (omitting python version warnings):

PLAY [Restore Snapshot Preparation] ******************************************************************************************************************

TASK [Retrieve List of All Existing VMs on UnRAID Hypervisor] ****************************************************************************************
changed: [unraid]

TASK [Generate VM and Disk Lists for Validated VMs in User Inputted Data] ****************************************************************************
ok: [unraid] => (item={'vm_name': 'Rocky9-TESTNode', 'disks_to_restore': [{'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains'}]})

TASK [Build Data Structure for Snapshot Restoration] *************************************************************************************************
ok: [unraid]

TASK [Verify Snapshot Data is Available for Restoration] *********************************************************************************************
ok: [unraid] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Dynamically Create Host Group for Disks to be Restored] ****************************************************************************************
changed: [unraid] => (item=[{'vm_name': 'Rocky9-TESTNode', 'disks_to_restore': [{'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains'}]}, {'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains'}])

PLAY [Disk Restore From Snapshot] ********************************************************************************************************************

TASK [Find files in the VM folder containing the target VM disk name] ********************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -> diskstation({{ lookup('env', 'DISKSTATION_IP_ADDRESS') }})]

TASK [Ensure that files were found] ******************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Create a file list from the target VM folder with only file names] *****************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Stitch together full snapshot name. Replace dashes and remove special characters] **************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot if file found in snapshot folder] ********************************************************************************
skipping: [Rocky9-TESTNode-vdisk2]

TASK [Sort found files by modification time (newest first) - LATEST Block] ***************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot for newest found .img file - LATEST Block] ***********************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Ensure that the desired snapshot file was found] ***********************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] => {
    "changed": false,
    "msg": "Snapshot found! Will begin restore process NOW."
}

TASK [Transfer snapshots to VM hypervisor server via rsync] ******************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Get attributes of original stored snapshot .img file] ******************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -> diskstation({{ lookup('env', 'DISKSTATION_IP_ADDRESS') }})]

TASK [Get attributes of newly transfered snapshot .img file] *****************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Ensure original and tranferred file sizes are the same] ****************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] => {
    "changed": false,
    "msg": "File size comparison passed."
}

TASK [Shutdown VM(s)] ********************************************************************************************************************************
included: /mnt/c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml for Rocky9-TESTNode-vdisk2 => (item=Rocky9-TESTNode)

TASK [Shutdown VM - Rocky9-TESTNode] *****************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
FAILED - RETRYING: [Rocky9-TESTNode-vdisk2 -> unraid]: Get VM status - Rocky9-TESTNode (5 retries left).
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Delete vdisk2.img for VM Rocky9-TESTNode] ******************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Rename snapshot to proper disk name] ***********************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

PLAY [Restart Affected VMs] **************************************************************************************************************************

TASK [Start VM(s) back up] ***************************************************************************************************************************
included: /mnt/c/Dev/Git/unraid-vm-snapshots/tasks/start-vm.yml for unraid => (item={'vm_name': 'Rocky9-TESTNode', 'disks_to_restore': [{'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains'}]})

TASK [Start VM - Rocky9-TESTNode] ********************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
changed: [unraid]

TASK [Ensure VM 'running' status] ********************************************************************************************************************
ok: [unraid] => {
    "changed": false,
    "msg": "Rocky9-TESTNode has successfully started. Restore from snapshot complete."
}

PLAY RECAP *******************************************************************************************************************************************
Rocky9-TESTNode-vdisk2     : ok=16   changed=5    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
unraid                     : ok=9    changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Restore from local .tar using defined snapshot name (omitting python version warnings):

PLAY [Restore Snapshot Preparation] ******************************************************************************************************************

TASK [Retrieve List of All Existing VMs on UnRAID Hypervisor] ****************************************************************************************
changed: [unraid]

TASK [Generate VM and Disk Lists for Validated VMs in User Inputted Data] ****************************************************************************
ok: [unraid] => (item={'vm_name': 'Rocky9-TESTNode', 'disks_to_restore': [{'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains', 'snapshot_to_restore_from': 'test-snapshot'}]})

TASK [Build Data Structure for Snapshot Restoration] *************************************************************************************************
ok: [unraid]

TASK [Verify Snapshot Data is Available for Restoration] *********************************************************************************************
ok: [unraid] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Dynamically Create Host Group for Disks to be Restored] ****************************************************************************************
changed: [unraid] => (item=[{'vm_name': 'Rocky9-TESTNode', 'disks_to_restore': [{'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains', 'snapshot_to_restore_from': 'test-snapshot'}]}, {'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains', 'snapshot_to_restore_from': 'test-snapshot'}])

PLAY [Disk Restore From TAR file] ********************************************************************************************************************

TASK [Find files in the VM folder containing the target VM disk name] ********************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Filter files matching patterns for .tar files] *************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Ensure that files were found] ******************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Create a file list from the target VM folder with only file names] *****************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Stitch together full snapshot name. Replace dashes and remove special characters] **************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot if file found in snapshot folder] ********************************************************************************
skipping: [Rocky9-TESTNode-vdisk2]

TASK [Sort found files by modification time (newest first) - LATEST Block] ***************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot for newest found .img file - LATEST Block] ***********************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Ensure that the desired snapshot file was found] ***********************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] => {
    "changed": false,
    "msg": "Snapshot found! Will begin restore process NOW."
}

TASK [Unzip .tar file] *******************************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Get attributes of unzipped .img file] **********************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Get attributes of original disk .img file] *****************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Ensure original and unzipped .img file sizes are the same] *************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] => {
    "changed": false,
    "msg": "File size comparison passed."
}

TASK [Shutdown VM(s)] ********************************************************************************************************************************
included: /mnt/c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml for Rocky9-TESTNode-vdisk2 => (item=Rocky9-TESTNode)

TASK [Shutdown VM - Rocky9-TESTNode] *****************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
FAILED - RETRYING: [Rocky9-TESTNode-vdisk2 -> unraid]: Get VM status - Rocky9-TESTNode (5 retries left).
FAILED - RETRYING: [Rocky9-TESTNode-vdisk2 -> unraid]: Get VM status - Rocky9-TESTNode (4 retries left).
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Delete vdisk2.img for VM Rocky9-TESTNode] ******************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Rename unzipped snapshot to proper disk name] **************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

PLAY [Restart Affected VMs] **************************************************************************************************************************

TASK [Start VM(s) back up] ***************************************************************************************************************************
included: /mnt/c/Dev/Git/unraid-vm-snapshots/tasks/start-vm.yml for unraid => (item={'vm_name': 'Rocky9-TESTNode', 'disks_to_restore': [{'vm_disk_to_restore': 'vdisk2.img', 'vm_disk_directory': '/mnt/disk1/domains', 'snapshot_to_restore_from': 'test-snapshot'}]})

TASK [Start VM - Rocky9-TESTNode] ********************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
changed: [unraid]

TASK [Ensure VM 'running' status] ********************************************************************************************************************
ok: [unraid] => {
    "changed": false,
    "msg": "Rocky9-TESTNode has successfully started. Restore from snapshot complete."
}

PLAY RECAP *******************************************************************************************************************************************
Rocky9-TESTNode-vdisk2     : ok=17   changed=5    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
unraid                     : ok=9    changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Closing Thoughts

Aside from the fingers hurting situation this was another enjoyable mini-project. With both snapshot creation and restoration now fully functional, it’s going to be incredibly useful. It will save a ton of time on larger projects I have planned, eliminating the need to manually roll back configurations.

What’s next?

I have one more piece planned for this series - Cleaning up old snapshots on your storage, whether the local .tar files or a remote repo .img files (DiskStation).

Some thoughts and drafts I have for future posts include Kubernetes, Containerlab, Network Automation testing, Nautobot, a few more. We’ll see!

You can find the code that goes along with this post here (Github).

Thoughts, questions, and comments are appreciated. Please follow me here at Hashnode or connect with me on Linkedin.

Thank you for reading fellow techies!

Unraid VM Snapshot Automation with Ansible: Part 1 - Creating Snapshots

Jeffrey Lyon — Mon, 09 Sep 2024 22:57:36 GMT

Intro

Hello! Welcome to my very first blog post EVER!

In this series, I’ll dive into how you can leverage Ansible to automate snapshot creation and restoration in Unraid, helping to streamline your backup and recovery processes. Whether you’re new to Unraid or looking for ways to optimize your existing Unraid setup, this post will provide some insight starting with what I did to create snapshots when no official solution is provided (that I know of...). Recovery using our created snapshots will come in the next post.

This is a warm-up series/post to help me start my blogging journey. It's mainly aimed at the home lab community, but this post or parts of it can definitely be useful for other scenarios and various Linux based systems as well.

Scenario

Unraid is a great platform for managing storage, virtualization, and Docker containers, but it doesn't have built-in support for taking snapshots of virtual machines (VMs). Snapshots are important because they let you save the state of a VM disk at a specific time, so you can easily restore the disk if something goes wrong, like errors, updates, or failures. Without this feature, users who depend on VMs for important tasks or development need to find other ways or use third-party tools to handle snapshots. This makes automating backup and recovery harder, especially in setups where snapshots are key for keeping the system stable and protecting data.

I will be using an Ubuntu Ansible host, my Unraid server as the snapshot source, and my Synology DiskStation as the remote storage destination to backup the snapshots. Unraid will be transferring these snapshots using the rsync synchronization protocol. Local snapshot creation as TAR files will also be covered, which allows for faster restore.

Ansible host (Ubuntu 24.04)
Unraid server (v6.12) - Runs custom Linux OS based on Slackware Linux
Synology DiskStation (v7.1) - Runs custom Linux OS - Synology DiskStation Manager (DSM)

These systems will be communicating over the same 192.168.x.x MGMT network.

NOTE: Throughout this post (and in future related posts), I’ll refer to the DiskStation as the "destination" or "NAS" device. I’m keeping these terms more generic to accommodate those who might be following along with different system setups, ensuring the concepts apply broadly across various environments. I also wont be going into much detail on specific ansible modules, structured data, or Jinja2 templating syntax. There are plenty of great resources/documentation out there to cover that.

Requirements

I will include required packages, configuration, and setup for the systems involved in this automation.

Ansible host

You will need the following:

Python (3.10 or greater suggested)

Ansible core

  sudo apt install -y ansible-core python3

Modify your ansible.cfg file to ignore host_key_checking. Usually located in /etc/ansible/
```
  [defaults]
  host_key_checking = False
```

NOTE: If your unsure where to find your ansible.cfg, just run ansible --version as shown below:

ansible --version

ansible [core 2.16.3]
  config file = /etc/ansible/ansible.cfg

Unraid server

This setup is not terrible but not as flexible:

NOTE: All commands in Unraid I'm running as user 'root'. Not the most secure, yes, but easiest for now.

Python (only supports version 3.8) - Needs to be installed from the 'Nerd Tools' plugin and enabled in GUI.
Nerd Tools plugin

To Install - In GUI click on APPs -> Search for 'nerd tools' -> Click ~~'Actions'~~ 'Install'.

Once installed click on 'Settings' -> Scroll down until you see 'Nerd Tools' and click on it

Once it loads find the Python3 option and flip it to 'On' -> Scroll down to the bottom of that page and click 'Apply'.

NOTE: Will install 'pip' and 'python-setuptools' automatically as well.
rsync (enabled by default)

Synology DiskStation

A few things are needed here. You can't really install packages in the CLI, everything is pulled down from the Package Center:

Python (minimum version 3.8 - higher versions can be downloaded from the Package Center)

NOTE: This isn't necessary for the automation covered in this post. Will be necessary for future posts when Ansible actually has to connect directly.
Enable SSH
Enable rsync service

Let's Automate!

...but first some more boring setup

Most of the automation will be executed directly on the Unraid host. This means we need to configure proper Ansible credentials for both the Ansible host and Unraid to authenticate when connecting remotely to the DiskStation. Using rsync—particularly with Ansible's module—can be quite troublesome when setting up remote-to-remote authentication. To simplify this process, I'll be using SSH key-based authentication, enabling passwordless login and making remote connectivity much smoother.

As a prerequisite to this I have already setup a user 'unraid' on the DiskStation system. It is configured to be allowed to SSH into the DiskStation and to have read/write access to the Backup folder I have already created.

To configure SSH key-based authentication (on Unraid server)

Generate SSH Key

unraid# ssh-keygen

Follow the prompts. Name the key-pair something descriptive if you wish. Dont bother creating a password for it. Since I was doing this as the 'root' user, it dropped the new public and private key files in '/root/.ssh/'.
Copy SSH Key to DiskStation system.

unraid# ssh-copy-id unraid@

You will be prompted and need to provide the 'unraid' users SSH password. If successful you should see something similar to the below output.

Number of key(s) added: 1

Now try logging into the machine, with: "ssh 'unraid@'" and check to make sure that only the key(s) you wanted were added.
Check the destination directory for the key files. Should be in the home directory of the user i.e. 'unraid' in their .ssh folder.
Test using the above mentioned command. In my case -

ssh unraid@

NOTE: A few gotchas I'd like to share -

Destination NAS device (DiskStation) still asking for password

Solved by modifying the .ssh folder rights on both the Unraid and destination NAS (DiskStation) devices as follows -

chmod g-w //.ssh/

chmod o-wx /


Errors for modifying the 'ssh known_hosts file
  hostfile_replace_entries: link /root/.ssh/known_hosts to /root/.ssh/known_hosts.old: Operation not permitted
  update_known_hosts: hostfile_replace_entries failed for /root/.ssh/known_hosts: Operation not permitted
  Solved by running an ssh-keyscan from Unraid to destination NAS -
  unraid# ssh-keyscan -H  ~/.ssh/known_hosts


Overview and Breakdown
Let's start by discussing the playbook directory structure. It looks like this:

├── README.md
├── create-snapshot-pb.yml
├── defaults
│   └── inventory.yml
├── files
│   ├── backup-playbook-old.yml
│   └── snapshot-creation-unused.yml
├── handlers
├── meta
├── restore-from-local-tar-pb.yml
├── restore-from-snapshot-pb.yml
├── tasks
│   ├── shutdown-vm.yml
│   └── start-vm.yml
├── templates
├── tests
│   ├── debug-tests-pb.yml
│   └── simple-debugs.yml
└── vars
    ├── snapshot-creation-vars.yml
    └── snapshot-restore-vars.yml

I copied the standard Ansible Role directory structure in case I wanted to publish it as a role in the future. Let's go over the breakout:

defaults/inventory.yml The main static inventory. Consists of unraid and diskstation hosts with their ansible connection variables and ssh credentials

vars/snapshot-creation-vars.yml This file is where users define the list of VMs and their associated disks for snapshot creation. It's mainly a dictionary specifying the targeted VMs and their disks to be snapshotted. Additionally, it includes a few variables related to the connection with the destination NAS device.

tasks/shutdown-vm.yml Consists of tasks used to shutdown targeted VMs gracefully and poll until shutdown status is confirmed.

tasks/start-vm.yml Consists of tasks used to start up targeted VMs, poll their status, and assert they are running before moving on.

create-snapshot-pb.yml The main playbook we are covering in this post. Consists of 2 plays. The first play has two purposes - to perform checks on the targeted Unraid VMs/disks and also to build additional data structures/dynamic hosts. The other play then will create the snapshots and push them to the destination.

Tests and Files folder - (Files) Consists of unused files/tasks I used to create and test the main playbook. (Tests) Contains some simple debug tasks I could copy and paste in quickly to get output from playbook execution.

restore-from-local-tar-pb.yml, restore-from-snapshot-pb.yml, and snapshot-restore-vars.yml These are files related to restoring the disks once the snapshots are created. They will be covered in the next article of this series.


Inventory - defaults/inventory.yml
The inventory file is pretty straight forward, as shown here:
---
nodes:
  hosts:
    diskstation:
      ansible_host: "{{ lookup('env', 'DISKSTATION_IP_ADDRESS') }}"
      ansible_user: "{{ lookup('env', 'DISKSTATION_USER') }}"
      ansible_password: "{{ lookup('env', 'DISKSTATION_PASS') }}"
    unraid:
      ansible_host: "{{ lookup('env', 'UNRAID_IP_ADDRESS') }}"
      ansible_user: "{{ lookup('env', 'UNRAID_USER') }}"
      ansible_password: "{{ lookup('env', 'UNRAID_PASS') }}"

This file defines two hosts—unraid and diskstation—along with the essential connection variables Ansible requires to establish SSH access to these devices. For more details on the various types of connection variables, refer to the link provided below:
Ansible Connection Variables
To keep things simple (and enhance security), I’m using environment variables to store the Ansible connection values. These variables need to be set up on the Ansible host before running the playbook. If you’re new to automation or Linux, you can create environment variables using the examples provided below:
ansible_host# export UNRAID_USER=root
ansible_host# export DISKSTATION_IP=192.168.1.100
Variables - vars/snapshot-creation-vars.yml
This playbook uses a single variable file, which serves as the main file the user will interact with. In this file, you'll define your list of VMs, specify the disks associated with each VM that need snapshots, and provide the path to the directory where each VM's existing disk .img files are stored.
---
snapshot_repository_base_directory: volume1/Home\ Media/Backup
repository_user: unraid

snapshot_create_list:
  - vm_name: Rocky9-TESTNode
    disks_to_snapshot:
      - disk_name: vdisk1.img
        source_directory: /mnt/cache/domains
        desired_snapshot_name: test-snapshot
      - disk_name: vdisk2.img
        source_directory: /mnt/disk1/domains
  - vm_name: Rocky9-LabNode3
    disks_to_snapshot:
      - disk_name: vdisk1.img
        source_directory: /mnt/nvme_cache/domains
        desired_snapshot_name: kuberne&

Let's break this down:

snapshot_create_list - the main data structure for defining your list of VMs and disks. Within this there are two main variables 'vm_name' and 'disks_to_snapshot'

vm_name - used to define your the name of your VM. Must coincide with the name of the VM used within the Unraid system itself.

disks_to_snapshot - a per VM list consisting of the disks that will be snapshot. This list requires two variables—disk_name and source_directory, with desired_snapshot_name as an ‘optional’ third variable.

disk_name - consists the existing .img file name for that VM disk, i.e vdisk1.img

source_directory - consists of the absolute directory root path where the per VM files are stored. An example of a full path to an .img file within Unraid would be: /mnt/cache/domains/Rocky9-TESTNode/vdisk1.img

desired_snapshot_name - is an optional attribute the user can define to customize the name of the snapshot. If left undefined, a timestamp of the current date/time will be used as the snapshot name, i.e vdisk2.2024-09-12T03.09.17Z.img

snapshot_repository_base_directory and repository_user are used within the playbook's rsync task. These variables offer flexibility, allowing the user to specify their own remote user and target destination for the rsync operation. These are used only if the snapshots are being sent to remote location upon creation.


Following the provided example you can define your VMs, disk names, and locations when running the playbook.
The Playbook
The playbook file is called create-snapshot-pb.yml. The playbook consists of two plays and 2 additional task files.
Snapshot Creation Prep Play
- name: Unraid Snapshot Creation Preperation
  hosts: unraid
  gather_facts: yes
  vars:
    needs_shutdown: []
    confirmed_shutdown: []
    vms_map: "{{ snapshot_create_list | map(attribute='vm_name') }}"
    disks_map: "{{ snapshot_create_list | map(attribute='disks_to_snapshot') }}"
    snapshot_data_map: "{{ dict(vms_map | zip(disks_map)) | dict2items(key_name='vm_name', value_name='disks_to_snapshot') | subelements('disks_to_snapshot') }}"
  vars_files:
    - ./vars/snapshot-creation-vars.yml

  tasks:

    - name: Get initial VM status
      shell: virsh list --all | grep "{{ item.vm_name }}" | awk '{ print $3}'
      register: cmd_res
      tags: always
      with_items: "{{ snapshot_create_list }}"

    - name: Create list of VMs that need shutdown
      set_fact:
        needs_shutdown: "{{ needs_shutdown + [item.item.vm_name] }}"
      when: item.stdout != 'shut'
      tags: always
      with_items: "{{ cmd_res.results }}"

    - name: Shutdown VM(s)
      include_tasks: ./tasks/shutdown-vm.yml
      loop: "{{ needs_shutdown }}"
      tags: always
      when: needs_shutdown

Purpose:
Prepares the Unraid server for VM snapshot creation by checking the status of VMs, identifying which need to be shut down, and initiating shutdowns where necessary.
Hosts:
Targets the unraid host.
Variables:

needs_shutdown: Placeholder list of VMs that require shutdown before snapshot creation.

confirmed_shutdown: Placeholder list for VMs confirmed to be shut down.

vms_map and disks_map: Maps (creates new lists) for just VM names and their individual disk data respectfully. These lists are then used to create the larger snapshot_data_map.

snapshot_data_map: Merges the VM and disk maps into a more structured data format, making it easier to access and manage the VM/disk information programmatically. My goal was to keep the inventory files simple for users to understand and modify. However, this approach didn’t work well with the looping logic I needed, so I created this new data map for better flexibility and control.


Variables File:
Loads additional variables from ./vars/snapshot_creation_vars.yml. Mainly the user's modified snapshot_create_list.
Tasks:

Get Initial VM Status:
 Runs a shell command using virsh list --all to check the current status of each VM (running or shut down). Results are stored in cmd_res.

Identify VMs Needing Shutdown:
 Uses a conditional check to add VMs that are not already shut down to the needs_shutdown list.

Shutdown VMs:
 Includes an external task file (shutdown-vm.yml) to gracefully shut down the VMs listed in needs_shutdown. This task loops through the VMs in that list and executes the shutdown process. Using an external task file enables looping over a block of tasks while preserving error handling. If any task within the block fails, the entire block fails, ensuring that the VM is not added to the confirmed_shutdown list later in the play. This method provides better control and validation during the shutdown process.


NOTE: Tasks above all have the tag ‘always’ which is a special tag that ensures a task will always run, regardless of which tags are specified when you run a playbook.
Shutdown VMs task block (within Snapshot Creation Preparation play)
- name: Shutdown VMs Block
  block:

  - name: Shutdown VM - {{ item }}
    command: virsh shutdown {{ item }}
    ignore_errors: true

  - name: Get VM status - {{ item }}
    shell: virsh list --all | grep {{ item }} | awk '{ print $3}'
    register: cmd_res
    retries: 5
    delay: 10
    until: cmd_res.stdout != 'running'

  delegate_to: unraid
  tags: always

Here's a breakdown of the task block to shut down the targeted VMs:
Purpose:
This block is designed to gracefully shut down virtual machines (VMs) and verify their shutdown status. This block is also tagged as ‘always’, ensuring ALL tasks in the block run.
Tasks:

Shutdown VM:
 Uses the virsh shutdown command to initiate the shutdown of the specified VM.

Check VM Status:
 Runs a shell command to retrieve the VM's current status using virsh list. The status is checked by parsing the output to confirm whether the VM is no longer running. The task will retry up to 5 times, with a 10-second delay between attempts, until the VM is confirmed to have shut down (cmd_res.stdout != 'running').


Snapshot Creation Preparation Play (continued)
- name: Get VM status
      shell: virsh list --all | grep "{{ item.vm_name }}" | awk '{ print $3}'
      register: cmd_res
      tags: always
      with_items: "{{ snapshot_create_list }}"

- name: Create list to use for confirmation of VMs being shutdown
  set_fact:
    confirmed_shutdown: "{{ confirmed_shutdown + [item.item.vm_name] }}"
  when: item.stdout == 'shut'
  tags: always
  with_items: "{{ cmd_res.results }}"

- name: Add host to group 'disks' with variables
  ansible.builtin.add_host:
    name: "{{ item[0]['vm_name'] }}-{{ item[1]['disk_name'][:-4] }}"
    groups: disks
    vm_name: "{{ item[0]['vm_name'] }}"
    disk_name: "{{ item[1]['disk_name'] }}"
    source_directory: "{{ item[1]['source_directory'] }}"
    desired_snapshot_name: "{{ item[1]['desired_snapshot_name'] | default('') }}"
  tags: always
  loop: "{{ snapshot_data_map }}"

Purpose:
This 2nd group of tasks (still within the Snapshot Prep play) checks the status of VMs, confirms which have been shut down, and adds their disks to a dynamic inventory group for snapshot creation.
Tasks:

Get VM Status:
 Runs a shell command using virsh list --all to retrieve the current status (e.g., running, shut) of each VM in the snapshot_create_list. The result is stored in cmd_res.

Confirm VM Shutdown:
 Updates the confirmed_shutdown list by adding VMs that are confirmed to be in the "shut" state. This ensures only properly shut down VMs proceed to the next steps.

Add Disks to Group 'disks':
 Dynamically adds VMs and their respective disks to the Ansible inventory group disks. It includes variables like vm_name, disk_name, and source_directory, which will be used for subsequent snapshot operations.


Other things to point out:

Ansible lets you dynamically add inventory hosts during playbook execution, which I used to treat each disk as a "host" rather than relying solely on variables. This approach enables the playbook to leverage Ansible's native task batch execution, allowing snapshot creation tasks to run concurrently across all disks. Without this method, using standard variables and looping would result in snapshots being created and synced one at a time— UGH. That's the reason behind Task #3. Also, these tasks are also all tagged with ‘always’.

Snapshot Creation Play
- name: Unraid Snapshot Creation
  hosts: disks
  gather_facts: no
  vars_files:
    - ./vars/snapshot-creation-vars.yml

  tasks:
    - name: Snapshot Creation Task Block
      block:
        - setup:
            gather_subset:
              - 'min'
          delegate_to: unraid

        - name: Create snapshot image filename
          set_fact:
            snapshot_filename: "{{ disk_name[:-4] }}.{{ desired_snapshot_name | regex_replace('\\-', '_') | regex_replace('\\W', '') }}.img"
          delegate_to: unraid
          when: desired_snapshot_name is defined and desired_snapshot_name | length > 0

        - name: Create snapshot image filename with default date/time if necessary
          set_fact:
            snapshot_filename: "{{ disk_name[:-4] }}.{{ ansible_date_time.iso8601|replace(':', '.')}}.img"
          delegate_to: unraid
          when: desired_snapshot_name is not defined or desired_snapshot_name | length == 0

        - name: Create reflink for {{ vm_name }}
          command: cp --reflink -rf {{ disk_name }} {{ snapshot_filename }}
          args:
            chdir: "{{ source_directory }}/{{ vm_name }}"
          delegate_to: unraid

        - name: Check if reflink exists
          stat: 
            path: "{{ source_directory }}/{{ vm_name }}/{{ snapshot_filename }}"
            get_checksum: False
          register: check_reflink_hd
          delegate_to: unraid

        - name: Backup HD(s) to DiskStation
          command: rsync --progress {{ snapshot_filename }} {{ repository_user }}@{{ hostvars['diskstation']['ansible_host'] }}:/{{ snapshot_repository_base_directory }}/{{ vm_name }}/
          args:
            chdir: "{{ source_directory }}/{{ vm_name }}"
          when: check_reflink_hd.stat.exists and 'use_local' not in ansible_run_tags
          delegate_to: unraid

        - name: Backup HD(s) to Local VM Folder as .tar
          command: tar cf {{ snapshot_filename }}.tar {{ snapshot_filename }}
          args:
            chdir: "{{ source_directory }}/{{ vm_name }}"
          when: check_reflink_hd.stat.exists and 'use_local' in ansible_run_tags
          delegate_to: unraid

        - name: Delete reflink file
          command: rm "{{ source_directory }}/{{ vm_name }}/{{ snapshot_filename }}"
          when: check_reflink_hd.stat.exists
          delegate_to: unraid

        - name: Start VM following snapshot transfer
          command: virsh start {{ vm_name }}
          tags: always
          delegate_to: unraid


      when: vm_name in hostvars['unraid']['confirmed_shutdown']
      tags: always

Here's a breakdown of the second play in the playbook—Unraid Snapshot Creation
Purpose:
This play automates the creation of VM disk snapshots on the Unraid server, backing them up to a destination NAS via rsync or creating local snapshots as TAR files, stored in the same directory as the original disk.
Hosts:

Uses the dynamically created disks group made from the previous play. Also is able to use the unraid host still in memory from the previous play. gather_facts is set to 'no', as the disks group aren't actually hosts we connect to (explained in the previous play).

Variables:

Loads variables from an external file ./vars/variables.yml, specifically destination_directory and destination_user.

Tasks:

Setup Minimal Facts:
 Gathers a minimal fact subset from unraid host to prepare for snapshot creation, mainly used for ansible_date_time_iso8601 variable.

Create Snapshot Filename:
 Generates a unique snapshot filename based off the ‘desired_snapshot_name’ variable if defined by the user. Also sanitizes that data by replacing dashes with slashes and removing special characters.

Create Snapshot Image Filename with Default Date/Time if necessary:
 Used as a default for creating snapshot name. Generates the snapshot filename based with ISO8601 date/time stamp if the filename wasn’t created with the previous task.

Create Snapshot (Reflink):
 Uses a cp --reflink command to create a snapshot (reflink) of the specified disk in the source directory.

Verify Snapshot Creation:
 Checks if the snapshot (reflink) was successfully created in the target directory.

Backup Snapshot to DiskStation:
 If the snapshot exists, it's transferred to the DiskStation NAS using rsync, executed via Ansible's command module. A downside is that there’s no live progress shown in the Ansible shell output, which can be frustrating for large or numerous disk files. In my case, I monitor the DiskStation GUI to track the snapshot's file size growth to confirm it’s still running. If you want better visibility, Ansible AWX provides progress tracking without this limitation. Conditionally runs only if Ansible finds an existing reflink for the disk and the playbook WASN’T run with the use_local tag.

Backup HD(s) to Local VM Folder as .tar:
 Alternatively, if the use_local tag is present, the snapshot is archived locally as a .tar file. This option allows users to store the snapshot on the same server, in the same source disk folder, without needing external storage. The play provides a mechanism to skip this step if not required, offering tag-based control for local or remote backups. Conditionally runs only if Ansible finds an existing reflink for the disk.

Delete Reflink File:
 Once the snapshot has been successfully backed up, it deletes the temporary reflink file on the unraid host.

Start VM Following Successful Snapshot Creation
 Starts the impacted VMs back up once the snapshot creation process completes.


Conditional Execution:

The play is only executed if the VM is confirmed to be in a shutdown state, based on the vm_name value being present in the unraid.confirmed_shutdown host variable list created in the previous play. This whole block is tagged with ‘always’. Every task will always run with the exception of Backup HD(s) to DiskStation (see above)

Other things to point out:

All these tasks are being executed or delegated_to the unraid host itself. Nothing will run on the disks host group.

I opted to use .tar files to speed up both the creation and restoration of snapshots. Both rsync and traditional file copy methods took nearly as long as rsync for remote destinations. By using .tar files within the same disk source folder, I reduced the time required by 25-50%.


Creating the Snapshots (Running the Playbook)
Finally we can move on to the most exciting piece, running the playbook. It's very simple to run. Just run the following command in the root of the playbook directory:
ansible-playbook create-snapshot-pb.yml -i defaults/inventory.yml

As long as your data and formatting is clean and all required setup was done you should see the playbook shutdown the VMs (if necessary) and quickly get to the Backup task for the disks. That's where it's going to spend the majority of its time.
Alternatively, you can run this play with the use_local tag to save snapshots as .tar files locally. This approach is ideal for faster recovery in a lab environment, where you're actively building or testing. Instead of rolling back multiple changes on a server, it's quicker and simpler to erase the disk and restore from a local baseline snapshot.
ansible-playbook create-snapshot-pb.yml -i defaults/inventory.yml --tags 'use_local'

Successful output should look like similar to the following:
PLAY [Unraid Snapshot Creation Prep] *****************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************
ok: [unraid]

TASK [Get initial VM status] *************************************************************************************************************************
changed: [unraid] => (item={'vm_name': 'Rocky9-TESTNode', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/cache/domains'}]})
changed: [unraid] => (item={'vm_name': 'Rocky9-LabNode3', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/nvme_cache/domains'}]})

TASK [Create list of VMs that need shutdown] *********************************************************************************************************
ok: [unraid]

TASK [Shutdown VM(s)] ********************************************************************************************************************************
included: /mnt/c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml for unraid => (item=Rocky9-TESTNode)
included: /mnt/c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml for unraid => (item=Rocky9-LabNode3)

TASK [Shutdown VM - Rocky9-TESTNode] *****************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
changed: [unraid]

TASK [Shutdown VM - Rocky9-LabNode3] *****************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-LabNode3] ***************************************************************************************************************
FAILED - RETRYING: [unraid]: Get VM status - Rocky9-LabNode3 (5 retries left).
changed: [unraid]

TASK [Get VM status] *********************************************************************************************************************************
changed: [unraid] => (item={'vm_name': 'Rocky9-TESTNode', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/cache/domains'}]})
changed: [unraid] => (item={'vm_name': 'Rocky9-LabNode3', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/nvme_cache/domains'}]})

TASK [Create list to use for confirmation of VMs being shutdown] *************************************************************************************
ok: [unraid] => (item={'changed': True, 'stdout': 'shut', 'stderr': '', 'rc': 0, 'cmd': 'virsh list --all | grep "Rocky9-TESTNode" | awk \'{ print $3}\'', 'start': '2024-09-09 18:04:55.797046', 'end': '2024-09-09 18:04:55.809047', 'delta': '0:00:00.012001', 'msg': '', 'invocation': {'module_args': {'_raw_params': 'virsh list --all | grep "Rocky9-TESTNode" | awk \'{ print $3}\'', '_uses_shell': True, 'expand_argument_vars': True, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['shut'], 'stderr_lines': [], 'failed': False, 'item': {'vm_name': 'Rocky9-TESTNode', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/cache/domains'}]}, 'ansible_loop_var': 'item'})
ok: [unraid] => (item={'changed': True, 'stdout': 'shut', 'stderr': '', 'rc': 0, 'cmd': 'virsh list --all | grep "Rocky9-LabNode3" | awk \'{ print $3}\'', 'start': '2024-09-09 18:04:57.638402', 'end': '2024-09-09 18:04:57.650150', 'delta': '0:00:00.011748', 'msg': '', 'invocation': {'module_args': {'_raw_params': 'virsh list --all | grep "Rocky9-LabNode3" | awk \'{ print $3}\'', '_uses_shell': True, 'expand_argument_vars': True, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['shut'], 'stderr_lines': [], 'failed': False, 'item': {'vm_name': 'Rocky9-LabNode3', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/nvme_cache/domains'}]}, 'ansible_loop_var': 'item'})

TASK [Add host to group 'disks' with variables] ******************************************************************************************************
changed: [unraid] => (item=[{'vm_name': 'Rocky9-TESTNode', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/cache/domains'}]}, {'disk_name': 'vdisk1.img', 'source_directory': '/mnt/cache/domains'}])
changed: [unraid] => (item=[{'vm_name': 'Rocky9-LabNode3', 'disks_to_snapshot': [{'disk_name': 'vdisk1.img', 'source_directory': '/mnt/nvme_cache/domains'}]}, {'disk_name': 'vdisk1.img', 'source_directory': '/mnt/nvme_cache/domains'}])

PLAY [Unraid Snapshot Creation] **********************************************************************************************************************

TASK [setup] *****************************************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]
ok: [Rocky9-LabNode3-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Create snapshot image filename] ****************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]
ok: [Rocky9-LabNode3-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Create reflink for Rocky9-TESTNode] ************************************************************************************************************
changed: [Rocky9-LabNode3-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]
changed: [Rocky9-TESTNode-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Check if reflink exists] ***********************************************************************************************************************
ok: [Rocky9-LabNode3-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]
ok: [Rocky9-TESTNode-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Backup HD1 to DiskStation] *********************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]
changed: [Rocky9-LabNode3-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

TASK [Delete reflink file] ***************************************************************************************************************************
changed: [Rocky9-LabNode3-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]
changed: [Rocky9-TESTNode-vdisk1 -> unraid({{ lookup('env', 'UNRAID_IP_ADDRESS') }})]

PLAY RECAP *******************************************************************************************************************************************
Rocky9-LabNode3-vdisk1     : ok=6    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
Rocky9-TESTNode-vdisk1     : ok=6    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
unraid                     : ok=12   changed=7    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

From DiskStation:


Closing Thoughts
Well, that was fun. Creating and backing up snapshots, especially in a home lab where tools might be less advanced, is incredibly useful. I plan to leverage this for more complex automation (Kubernetes anyone?) since restoring from a snapshot is far simpler than undoing multiple changes. Again, the main drawback is using the raw rsync command through Ansible lacks progress visibility. Also pushing backups to the NAS can be slow when dealing with hundreds of GBs or more. Takes roughly 4-5 mins to push 25GB image file over 1 Gbp/s connection.
What’s next?
I have two more pieces I will hopefully be adding to this series -

Restoring from a snapshot (whether its a specific snapshot or the latest).

Cleaning up old snapshots on your storage, in my case the DiskStation.


Down the road I may look at updating this using the rclone utility instead of rsync. Also might turn all this into a published Ansible role.
You can find the code that goes along with this post here (Github).
Thoughts, questions, and comments are appreciated. Please follow me here at Hashnode or connect with me on Linkedin.
Thank you for reading fellow techies!

Nerdy Lyon's Den... Tech Blog

GitOps for Network Engineers - Deploying Nautobot

Previous Articles in the Series

Intro

Adding Nautobot’s Helm Chart

Step 1: Add the Helm Repo

Step 2: Create the ArgoCD Application

Overview for Nautobot’s Helm Values

Add Persistent Storage

CephFS StorageClass (Retain)

Why choose Retain vs Delete

Why allowVolumeExpansion is important

MetalLB IP Pool for Traefik

1) Give MetalLB an address to hand out

2) Pin that IP on the Traefik Service

3) (Optional) DNS now or later

Exposing Nautobot using an Ingress Route

Deploying Securely - Creating Our Secrets

Kubernetes Authentication + Nautobot Role

Add Secrets to Vault (under secret/)

Enable the KV secrets engine

Create the secrets

How ESO and Vault Work Together (high-level)

The Rest of the Kustomize Resources

ClusterSecretStore: ESO’s shortcut to Vault

ExternalSecrets: mapping Vault data into Kubernetes Secrets

Database & Redis ExternalSecret (commented)

Superuser ExternalSecret (commented)

Notes

Kustomize: Bundling our Resources Together

Notes

The Final Pieces - Custom Helm Values + ArgoCD App Manifest

Custom Helm values (values-prod.yml)

Why these choices

Rounding Out the Argo CD Application

Deploying and Syncing the App

Storage

Secrets

IngressRoute/Traefik

The Application Pods

Troubleshooting Tips

1) Argo CD & Kustomize

2) Secrets pipeline: Vault → ESO → K8s Secret

3) Storage: CephFS StorageClass & PVCs

4) Postgres & Redis (built-in charts)

5) Nautobot app (web/worker/beat)

6) Ingress, Traefik & DNS

7) MetalLB (external reachability)

8) Resources & scheduling

9) Naming & key mismatches (sneaky but common)

10) Quick sanity commands (lightweight)

Summary

Why this helps your automation journey

Where I’m going next

Links

Bridging the Gap: GitOps for Network Engineers - Part 2

Intro

MetalLB: Load Balancing for Bare Metal and Home Labs

What is MetalLB?

Why You Need It

How It Powers the Platform

Quick Review: Helm Charts and How They Fit into ArgoCD

Using the MetalLB Helm Chart with ArgoCD

MetalLB Installation

Step 1: Add the Helm Repo

Step 2: Create the ArgoCD Application

Step 3: Add the Kustomize Configuration Layer

Step 4: Sync the App

MetalLB Custom Configuration

File Breakdown

ip-address-pool.yml

l2-advertisement.yml

kustomization.yml

Why It Matters

Traefik: Ingress Routing Built for GitOps

What is Traefik?

Why You Need It

How It Powers the Platform

Traefik Installation

Step 1: Add the Traefik Helm Repo

Add Secrets to Vault (under `secret/`)

Custom Helm values (`values-prod.yml`)

`ip-address-pool.yml`

`l2-advertisement.yml`

`kustomization.yml`

Why the `rook-ceph` Namespace?

`apps/rook-ceph/values-lab.yml`

`apps/rook-ceph/overlays/lab/`

`ceph-cluster.yml`

`ceph-filesystem.yml`

`ceph-storageclass-delete.yml` & `ceph-storageclass-retain.yml`

`ingress-route-gui.yml`

`kustomization.yml`