<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Nerdy Lyon's Den... Tech Blog]]></title><description><![CDATA[Hello! I'm just a curious tech Lyon here to explore and share some fun experiences in tech. Despite being told by a previous manager I have no relevant tech ski]]></description><link>https://blog.nerdylyonsden.io</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1724528104871/9a1703a6-34c8-4bf5-90a4-45f60869d4aa.png</url><title>Nerdy Lyon&apos;s Den... Tech Blog</title><link>https://blog.nerdylyonsden.io</link></image><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Apr 2026 22:44:58 GMT</lastBuildDate><atom:link href="https://blog.nerdylyonsden.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[GitOps for Network Engineers - Deploying Nautobot]]></title><description><![CDATA[Previous Articles in the Series
Bridging the Gap: GitOps for Network Engineers - Part 1 (Deploying ArgoCD)
Bridging the Gap: GitOps for Network Engineers - Part 2 (Deploying Critical Infrastructure with ArgoCD)
Intro
Here we go! Time to deploy someth...]]></description><link>https://blog.nerdylyonsden.io/gitops-for-network-engineers-deploying-nautobot</link><guid isPermaLink="true">https://blog.nerdylyonsden.io/gitops-for-network-engineers-deploying-nautobot</guid><category><![CDATA[nautobot]]></category><category><![CDATA[gitops]]></category><category><![CDATA[ArgoCD]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[metallb]]></category><category><![CDATA[Traefik]]></category><category><![CDATA[hashicorp-vault]]></category><category><![CDATA[external-secrets]]></category><category><![CDATA[Helm]]></category><category><![CDATA[Kustomize]]></category><category><![CDATA[Network Automation]]></category><category><![CDATA[networking]]></category><dc:creator><![CDATA[Jeffrey Lyon]]></dc:creator><pubDate>Tue, 16 Sep 2025 12:13:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756162643734/ee38d38f-a5bf-4d88-97ea-562bc82a104f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-previous-articles-in-the-series">Previous Articles in the Series</h3>
<p><a target="_blank" href="https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-1">Bridging the Gap: GitOps for Network Engineers - Part 1</a> (Deploying ArgoCD)</p>
<p><a target="_blank" href="https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-2">Bridging the Gap: GitOps for Network Engineers - Part 2</a> (Deploying Critical Infrastructure with ArgoCD)</p>
<h1 id="heading-intro">Intro</h1>
<p>Here we go! Time to deploy something network automation engineers actually use: <a target="_blank" href="https://docs.nautobot.com/projects/core/en/stable/"><strong>Nautobot</strong></a>. For those who are unfamiliar, Nautobot is an open-source Network Source of Truth and automation platform. It gives you a clean API, GraphQL, plugins, and jobs for modeling your network and driving intent-based automation. In a GitOps workflow, Nautobot becomes the living database of network intent and inventory, while Argo CD ensures the platform itself is deployed and maintained declaratively. It’s one of my favorite tools because you can’t have a solid network automation foundation without a solid source of truth (okay, “source of intent” if you prefer). Either way, Nautobot is among the best, kudos to the <strong>Network to Code</strong> team for a great product. Before we dive in, let’s quickly recap previous <strong>GitOps for Network Engeineers</strong> posts. If you haven’t read those yet, I’d recommend starting there first. The links are posted above.</p>
<p><strong>Part 1</strong> established the groundwork: why GitOps matters for network engineers (intent-as-code, reviews, rollbacks), installing <strong>Argo CD</strong>, connecting it to Git, and proving the reconcile loop with a simple, Git-managed deployment.</p>
<p><strong>Part 2</strong> leveled that foundation into a production-ready platform. We declaratively integrated:</p>
<ul>
<li><p><strong>MetalLB</strong> for external service IPs</p>
</li>
<li><p><strong>Traefik</strong> for ingress routing and TLS</p>
</li>
<li><p><strong>Rook-Ceph</strong> for durable, cluster-native storage</p>
</li>
<li><p>A secrets stack using <strong>External Secrets</strong> backed by <strong>HashiCorp Vault,</strong> all continuously managed by <strong>ArgoCD</strong>.</p>
</li>
</ul>
<p>As a result, the platform can now:</p>
<ul>
<li><p>Expose apps securely via external IPs and ingress rules</p>
</li>
<li><p>Persist data with Ceph-backed volumes</p>
</li>
<li><p>Manage secrets without committing them to Git</p>
</li>
<li><p>Treat infrastructure the same as applications: defined in code, reconciled by Argo CD</p>
</li>
</ul>
<p>Instead of stamping this post as “Part 3,” I’m branching it off from the foundation posts. That gives me room to play with future installments while still keeping them under the <strong>GitOps for Network Engineers</strong> umbrella when it makes sense. The goal here is simple: bring a <strong>basic</strong> Nautobot deployment online, fully managed by ArgoCD, using the same GitOps patterns we established earlier. Specifically, we will:</p>
<ul>
<li><p>Add the main Nautobot Helm chart to ArgoCD</p>
</li>
<li><p>Define (or confirm) a StorageClass for Nautobot’s persistent needs</p>
</li>
<li><p>Allocate a MetalLB IP for Traefik to serve Nautobot externally</p>
</li>
<li><p>Create Secrets for DB, Redis, and an initial Nautobot superuser</p>
</li>
<li><p>Compose Kustomize resources to wrap Helm and environment overlays</p>
</li>
<li><p>Author custom <code>values.yml</code> for your environment</p>
</li>
<li><p>Deploy the App</p>
</li>
</ul>
<p>When we are done our deployment will include 5 total pods:</p>
<ul>
<li><p><strong>Nautobot Web (frontend/API)</strong> - serves the UI plus REST/GraphQL endpoints</p>
</li>
<li><p><strong>Nautobot Celery Worker</strong> - executes background jobs and plugin tasks</p>
</li>
<li><p><strong>Nautobot Celery Beat</strong> - schedules periodic tasks for the worker</p>
</li>
<li><p><strong>PostgreSQL</strong> - primary application database for Nautobot objects/state</p>
</li>
<li><p><strong>Redis</strong> - cache and message broker backing Celery queues</p>
</li>
</ul>
<p>This deployment will not include any building of custom container images, Nautobot plugins, or custom Nautobot configurations. I’m planning that for a future post.</p>
<p>Let’s dive in.</p>
<h1 id="heading-adding-nautobots-helm-chart">Adding Nautobot’s Helm Chart</h1>
<p>First things first: let’s add the Nautobot Helm chart to <strong>Argo CD</strong>. If you followed the earlier posts, this will feel familiar. In the examples below, I’m using my <code>prod-home</code> Argo CD Project, you’ll see that name throughout. Your Project name can (and likely will) be different; substitute your own wherever you see <code>prod-home</code>.</p>
<h3 id="heading-step-1-add-the-helm-repo"><strong>Step 1: Add the Helm Repo</strong></h3>
<ul>
<li><strong>Helm Repo URL</strong>:<br />  <code>https://nautobot.github.io/helm-charts/</code></li>
</ul>
<p>In the ArgoCD UI:</p>
<ul>
<li><p>Go to <strong>Settings → Repositories</strong></p>
</li>
<li><p>Click <strong>+ CONNECT REPO</strong></p>
</li>
<li><p>Enter the Helm repo URL</p>
</li>
<li><p>Choose <strong>Helm</strong> as the type</p>
</li>
<li><p>Give the repo a name (Optional)</p>
</li>
<li><p>Chose the project you created earlier to associate this repo to (mine was ‘prod-home’)</p>
</li>
<li><p>No authentication is needed for this public repo</p>
</li>
<li><p>When done, click <strong>CONNECT</strong></p>
</li>
</ul>
<p>Once added, ArgoCD can now pull charts from this source.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756514554340/1bdcaa3f-b118-4108-bd9a-d7f81d8697b0.png" alt class="image--center mx-auto" /></p>
<p><strong>Note:</strong> As seen in Part 2, you’ll also need to add the <strong>GitHub repo</strong> that contains your custom configuration files, like Helm <code>values.yml</code> files and Kustomize overlays.</p>
<ul>
<li><p>If you're using <strong>my example repo</strong>, add <code>https://github.com/leothelyon17/kubernetes-gitops-playground.git</code> as another source, of type Git.</p>
</li>
<li><p>If you're using <strong>your own repo</strong>, just make sure it's added in the same way so ArgoCD can pull your values and overlays when syncing.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756824233567/6010ca27-6e7b-4ccf-b34e-2546baace246.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-step-2-create-the-argocd-application"><strong>Step 2: Create the ArgoCD Application</strong></h3>
<p>Head to the <strong>Applications</strong> tab and click <strong>+ NEW APP</strong> to start the deployment.</p>
<p>Here’s how to fill it out:</p>
<ul>
<li><p><strong>Application Name</strong>: <code>nautobot</code> (or in my case <code>nautobot-prod</code>)</p>
</li>
<li><p><strong>Project</strong>: Select your project (e.g., prod-home)</p>
</li>
<li><p><strong>Sync Policy</strong>: Manual for now (we’ll automate later)</p>
</li>
<li><p><strong>Repository URL</strong>: Select the Helm repo you just added</p>
</li>
<li><p><strong>Chart Name</strong>: <code>nautobot</code></p>
</li>
<li><p><strong>Target Revision</strong>: Use the latest or specify a version (latest is recommended)</p>
</li>
<li><p><strong>Cluster URL</strong>: Use <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a> if deploying to the same cluster (mine might be different than the default, dont worry.)</p>
</li>
<li><p><strong>Namespace</strong>: <code>nautobot</code> or <code>nautobot-prod</code> to match the ArgoCD application name. Check the box for creating the namespace if it doesn’t exist already in your kubernetes cluster</p>
</li>
</ul>
<p>Click CREATE when finished.</p>
<p>If everything is in order you should see the App created like the screenshot below, though your’s will be all yellow status and ‘OutOfSync’ -</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756515387339/a0339e2f-b27f-4c62-a6a2-67d1180a61de.png" alt class="image--center mx-auto" /></p>
<p>Just like before, ArgoCD will immediately show you all the Kubernetes objects it plans to create. <strong>Don’t hit Sync yet.</strong> We haven’t done the configuration of the databases, secrets, or persistent storage, so a deploy right now would fail. Databases would fail to mount their volumes. We’ll get there.</p>
<p>For this first section, the goal was simple: pull in the main <strong>Nautobot</strong> Helm chart, which we’ve done. In previous posts, we’d usually fine-tune the ArgoCD Application to point at our Kustomize overlays or custom helm values. We’ll come back to that once all those pieces exist; if you do this in the Application now ArgoCD will fail on the missing paths. Onward.</p>
<h1 id="heading-overview-for-nautobots-helm-values">Overview for Nautobot’s Helm Values</h1>
<p>Here we’ll take a quick pass over Nautobot’s default Helm values so we know exactly where our overrides will land later.</p>
<p>Defaults can be found here:<br /><a target="_blank" href="https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/values.yaml">https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/values.yaml</a></p>
<p>For this deployment, we’ll customize these core sections:</p>
<ul>
<li><code>superuser</code> - bootstrap admin (username/email/password).</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756863994072/f0340d09-1759-4174-a99b-0f3837d0a327.png" alt class="image--center mx-auto" /></p>
<ul>
<li><code>postgresql</code> - point at our Postgres (in-cluster or external), version, storage, and connection settings.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756864141999/c9a1d046-53e6-4579-97a7-b0eb2bf8b6ef.png" alt class="image--center mx-auto" /></p>
<ul>
<li><code>redis</code> - enable/disable and wire the cache/queue endpoint (persistence optional).</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756864187629/81674508-8ad8-4cde-896f-bde3b28e14eb.png" alt class="image--center mx-auto" /></p>
<p>A few optional knobs worth calling out:</p>
<ul>
<li><p><strong>Replicas</strong>: under both <code>nautobot</code> and <code>celery</code>, you can set <code>replicas: 1</code> for dev or tight clusters; bump later as you scale. I will be setting the replicas to ‘1’.</p>
</li>
<li><p><strong>Image</strong>: under <code>nautobot.image</code> set a specific <code>tag</code> (or a custom image) if you don’t want “latest.” Unless you know what you are doing, leave the defaults for this deployment.</p>
</li>
<li><p><strong>In</strong><a target="_blank" href="https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/values.yaml"><strong>g</strong></a><strong>ress</strong>: the chart can create it, but we’re keeping that <strong>off</strong> and handling exposure via our Kustomize <strong>IngressRoute</strong> pattern.</p>
</li>
</ul>
<p>That’s it for the big call-outs. We’ll circle back and set those values once the rest of the pieces (storage, secrets, and ingress) are in place later in the post.</p>
<h1 id="heading-add-persistent-storage">Add Persistent Storage</h1>
<p>For Nautobot, the one thing that absolutely needs persistence is the primary database, by default that’s PostgreSQL, and it should live on durable storage. Redis handles caching/queuing, and persistence there is optional: if you need cached data to survive pod restarts or rolling updates, back it with a PVC; otherwise keep it ephemeral and let it rebuild as needed.</p>
<p>In the earlier <strong>Part 2</strong> post we previously created two CephFS storage classes with Rook-Ceph. For this post I’m using the <code>rook-cephfs-retain</code> class for Postgres and <code>rook-cephfs-delete</code> for Redis (optional) which we will see later in our helm custom values.</p>
<h3 id="heading-cephfs-storageclass-retain">CephFS StorageClass (Retain)</h3>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">storage.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">StorageClass</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-comment"># Name you’ll reference from PVCs (spec.storageClassName)</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">rook-cephfs-retain</span>
<span class="hljs-comment"># CSI driver that provisions CephFS-backed volumes via Rook</span>
<span class="hljs-attr">provisioner:</span> <span class="hljs-string">rook-ceph.cephfs.csi.ceph.com</span>

<span class="hljs-attr">parameters:</span>
  <span class="hljs-comment"># ----- Tell the CSI driver which Ceph cluster/filesystem to use -----</span>

  <span class="hljs-comment"># Namespace where your Rook-Ceph cluster runs (operator, mons/osds, etc.)</span>
  <span class="hljs-comment"># If your cluster is in a different namespace, update this and the secrets below.</span>
  <span class="hljs-attr">clusterID:</span> <span class="hljs-string">rook-ceph</span>

  <span class="hljs-comment"># Name of the CephFS filesystem (created during CephFS setup)</span>
  <span class="hljs-comment"># You can confirm with `ceph fs ls`.</span>
  <span class="hljs-attr">fsName:</span> <span class="hljs-string">k8s-ceph-fs</span>

  <span class="hljs-comment"># Ceph pool backing the filesystem (required when provisionVolume is true)</span>
  <span class="hljs-comment"># Must match the pool configured for your fsName.</span>
  <span class="hljs-attr">pool:</span> <span class="hljs-string">k8s-ceph-fs-replicated</span>

  <span class="hljs-comment"># ----- CSI secrets for provisioning/expansion/node-stage (auto-created by Rook) -----</span>

  <span class="hljs-comment"># Secret used by the provisioner sidecar to create volumes</span>
  <span class="hljs-attr">csi.storage.k8s.io/provisioner-secret-name:</span> <span class="hljs-string">rook-csi-cephfs-provisioner</span>
  <span class="hljs-attr">csi.storage.k8s.io/provisioner-secret-namespace:</span> <span class="hljs-string">rook-ceph</span>

  <span class="hljs-comment"># Secret used by the controller for volume expansion operations</span>
  <span class="hljs-attr">csi.storage.k8s.io/controller-expand-secret-name:</span> <span class="hljs-string">rook-csi-cephfs-provisioner</span>
  <span class="hljs-attr">csi.storage.k8s.io/controller-expand-secret-namespace:</span> <span class="hljs-string">rook-ceph</span>

  <span class="hljs-comment"># Secret used on the node to stage/mount volumes</span>
  <span class="hljs-attr">csi.storage.k8s.io/node-stage-secret-name:</span> <span class="hljs-string">rook-csi-cephfs-node</span>
  <span class="hljs-attr">csi.storage.k8s.io/node-stage-secret-namespace:</span> <span class="hljs-string">rook-ceph</span>

  <span class="hljs-comment"># ----- Optional: choose the client implementation for CephFS mounts -----</span>
  <span class="hljs-comment"># If omitted, CSI auto-detects. Kernel client is typical in prod.</span>
  <span class="hljs-comment"># mounter: kernel</span>

<span class="hljs-comment"># Keep PVs (and data) when PVCs are deleted—safer for DBs and long-lived data</span>
<span class="hljs-attr">reclaimPolicy:</span> <span class="hljs-string">Retain</span>

<span class="hljs-comment"># Allow growing PVCs in place (kubectl patch ... size: 40Gi, etc.)</span>
<span class="hljs-attr">allowVolumeExpansion:</span> <span class="hljs-literal">true</span>

<span class="hljs-comment"># Mount-time options passed to the client</span>
<span class="hljs-attr">mountOptions:</span>
  <span class="hljs-comment"># Uncomment for verbose client debug logs during troubleshooting</span>
  <span class="hljs-comment"># - debug</span>
</code></pre>
<h3 id="heading-why-choose-retain-vs-delete">Why choose <strong>Retain</strong> vs <strong>Delete</strong></h3>
<ul>
<li><p><code>Retain</code> keeps the PV (and data) when its PVC is deleted.<br />  Use it for anything you don’t want accidentally destroyed (databases, long-lived app data, easy rollbacks). The trade-off is manual cleanup later.</p>
</li>
<li><p><code>Delete</code> removes the PV and backend data when the PVC goes away.<br />  Great for ephemeral/dev workloads where you don’t care about the data. Trade-off: once it’s gone, it’s gone.</p>
</li>
</ul>
<h3 id="heading-why-allowvolumeexpansion-is-important">Why <strong>allowVolumeExpansion</strong> is important</h3>
<ul>
<li><p>Lets you grow PVCs <strong>in place</strong> as your data grows (no migrate-and-restore dance).</p>
</li>
<li><p>With CephFS + CSI, online expansion is supported; Kubernetes handles the resize.</p>
</li>
<li><p>You still need available capacity in the Ceph cluster. This just makes growth operationally simple.</p>
</li>
</ul>
<p>Use this class for your Nautobot <strong>Postgres</strong> PVC. Redis persistence is optional. Enable it only if you truly need cache durability.</p>
<p>Add this storage class or classes to your Rook-Ceph deployment if you haven’t already (below) and let’s move forward.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756842384980/3ee3ea61-aa87-4314-b611-715bfa7b35ef.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-metallb-ip-pool-for-traefik">MetalLB IP Pool for Traefik</h1>
<p>Before we can expose apps to the outside world, <strong>Traefik</strong> needs an externally reachable IP from <strong>MetalLB</strong>. “Public” here just means <strong>outside the cluster</strong> (it can still be RFC1918). Since we already set up MetalLB in the earlier posts, this is a quick tweak.</p>
<h3 id="heading-1-give-metallb-an-address-to-hand-out">1) Give MetalLB an address to hand out</h3>
<p>Add a single IP (or a range) to your existing <code>IPAddressPool</code>. I like dedicating a single /32 for Traefik so DNS stays stable.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">metallb.io/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">IPAddressPool</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">prod-traefik-pool</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">metallb-prod</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">addresses:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-number">192.168</span><span class="hljs-number">.101</span><span class="hljs-number">.161</span><span class="hljs-string">/32</span>  <span class="hljs-comment"># Traefik LB IP</span>
</code></pre>
<p>If you don’t already have one, pair the pool with an <code>L2Advertisement</code> (MetalLB won’t announce addresses without it):</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">metallb.io/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">L2Advertisement</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">prod-traefik-l2adv</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">metallb-prod</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">ipAddressPools:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">prod-traefik-pool</span>
</code></pre>
<p><strong>Note:</strong> Pick an <strong>unused</strong> IP in your LAN (outside DHCP scope). Then sync your Argo CD app for MetalLB.</p>
<h3 id="heading-2-pin-that-ip-on-the-traefik-service">2) Pin that IP on the Traefik Service</h3>
<p>In your Traefik Helm values, set the Service to <code>LoadBalancer</code> and assign the static IP:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">service:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
  <span class="hljs-attr">spec:</span>
    <span class="hljs-attr">loadBalancerIP:</span> <span class="hljs-number">192.168</span><span class="hljs-number">.101</span><span class="hljs-number">.161</span>
  <span class="hljs-comment"># optional, preserves client source IP if you care about logs:</span>
  <span class="hljs-attr">externalTrafficPolicy:</span> <span class="hljs-string">Local</span>
</code></pre>
<p>Sync your Traefik app. You should see the EXTERNAL-IP appear:</p>
<pre><code class="lang-bash">kubectl -n kube-system get svc

traefik-prod  LoadBalancer   10.233.23.76   192.168.101.161   32400:30228/TCP,80:31007/TCP,443:32150/TCP   108d
</code></pre>
<h3 id="heading-3-optional-dns-now-or-later">3) (Optional) DNS now or later</h3>
<p>Once the IP is live, create a DNS <strong>A record</strong> (e.g., <code>nautobot.example.local → 192.168.101.161</code>). We’ll wire the IngressRoute host to match this in the next steps.</p>
<p>That’s it. Traefik now has a stable, outside-facing address; we can safely publish Nautobot behind it.</p>
<h1 id="heading-exposing-nautobot-using-an-ingress-route">Exposing Nautobot using an Ingress Route</h1>
<p>With Traefik now holding an external IP, we can move on to exposing <strong>Nautobot</strong> to the outside world, time to configure the IngressRoute so users and devices can reach it.</p>
<p>This part is straightforward if you already have an ingress controller. If not, jump back to the Part 2 post for deploying Traefik in-cluster. By default, the Nautobot Helm chart does <strong>not</strong> create any Ingress/IngressRoute resources.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756833851243/18758fa6-db12-4d93-ba0a-c74b4aa595a5.png" alt class="image--center mx-auto" /></p>
<p>You can use the Nautobot chart values to let it create ingress, but we’re leaving those at the defaults. Instead, we’ll handle exposure in the overlay with a <strong>Traefik IngressRoute</strong>. I prefer this split: Helm owns the app; Kustomize owns how it’s exposed. It’s a repeatable, cookie-cutter pattern across apps and keeps odd edge cases out of chart values. Goal here is simple, publish the web UI outside the cluster. Nothing fancy.</p>
<p>A working IngressRoute example is below, and can also be found in my GitOps Playground repository in the apps/nautobot/overlays/prod folder -</p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-comment"># --&gt; (Example) Create an IngressRoute for your service...</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">traefik.io/v1alpha1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">IngressRoute</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">nautobot-prod-ingressroute</span>  <span class="hljs-comment"># &lt;-- Replace with your IngressRoute name</span>
   <span class="hljs-attr">namespace:</span> <span class="hljs-string">nautobot-prod</span>  <span class="hljs-comment"># &lt;-- Replace with your namespace</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">entryPoints:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-string">websecure</span>
   <span class="hljs-attr">routes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">match:</span> <span class="hljs-string">Host(`nautobot.home.nerdylyonsden.io`)</span>  <span class="hljs-comment"># &lt;-- Replace with your FQDN</span>
       <span class="hljs-attr">kind:</span> <span class="hljs-string">Rule</span>
       <span class="hljs-attr">services:</span>
         <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">nautobot-prod-default</span>  <span class="hljs-comment"># &lt;-- Replace with your service name</span>
           <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
<span class="hljs-comment"># --&gt; (Optional) Add certificate secret</span>
   <span class="hljs-attr">tls:</span>
     <span class="hljs-attr">secretName:</span> <span class="hljs-string">prod-apps-certificate-secret</span> <span class="hljs-comment"># &lt; cert-manager will store the created certificate in this secret.</span>
<span class="hljs-comment"># &lt;--</span>
</code></pre>
<p>The main points to cover here are:</p>
<ul>
<li><p><strong>Namespace</strong> - Make sure the manifest’s namespace matches where Nautobot will live.</p>
</li>
<li><p><strong>EntryPoints</strong> - Use only <code>websecure</code> so traffic is encrypted at least up to Traefik inside the cluster.</p>
</li>
<li><p><strong>Host rule</strong> - <code>routes.match</code> must match the public DNS A record users will hit for Nautobot.</p>
</li>
<li><p><strong>Service wiring</strong> - <a target="_blank" href="http://services.name"><code>services.name</code></a> and <code>services.port</code> must match the Nautobot Service. In my setup the name is <code>&lt;namespace&gt;-default</code>; adjust if yours differs.</p>
</li>
<li><p><strong>Port</strong> - Defaults to <code>80</code> unless you’ve changed it in the Service.</p>
</li>
<li><p><strong>TLS / certs</strong> - If you have a cluster cert solution (e.g., cert-manager), wire it here. If not, leave the TLS section out for now; I’ll cover this in an advanced post.</p>
</li>
</ul>
<p><strong>Note:</strong> To check the Service name and port you can go click into the app on ArgoCD, whether fully deployed or not, click the Service → Summary Tab → Desired Manifest as shown below -</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756836806375/d2b5e972-0227-41e9-90a1-170ed32b1487.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756837047208/f97d7b21-38b8-43f1-9449-a6902dec8280.png" alt class="image--center mx-auto" /></p>
<p><strong>Note (again):</strong> The Service also exposes port <strong>443</strong>, but we’re not using it. Nautobot needs additional app-level config to terminate HTTPS directly. For now we’ll keep TLS at Traefik and speak HTTP to the Service. End-to-end HTTPS on Nautobot itself is out of scope for this post (maybe a future one).</p>
<p>Once the IngressRoute is set the way you want, drop it into your environment overlay (e.g., <code>apps/nautobots/overlays/prod/ingress-route.yml</code>) and commit it. That’s it for this piece, on to the next section.</p>
<h1 id="heading-deploying-securely-creating-our-secrets">Deploying Securely - Creating Our Secrets</h1>
<p>For a starter implementation of Nautobot with some basic security we are going to need the following secrets stored in Vault -</p>
<ul>
<li><p>Super User Login Credentials (which will include a password and API token)</p>
</li>
<li><p>Postgres DB Credentials</p>
</li>
<li><p>Redis DB Credentials</p>
</li>
</ul>
<p>We’re going to keep credentials out of Git and let <strong>External Secrets (ESO)</strong> fetch them from <strong>HashiCorp Vault</strong> at deploy time. The two things we need to cover here are: (1) enabling <strong>Kubernetes authentication</strong> in Vault with a <strong>role</strong> dedicated to Nautobot, and (2) <strong>adding the actual secrets</strong> into Vault under the <code>/secret</code> path.</p>
<p>You should hopefully have an existing instance of Hashicorp Vault already if you’ve been following along with the previous posts.</p>
<h2 id="heading-kubernetes-authentication-nautobot-role">Kubernetes Authentication + Nautobot Role</h2>
<p><strong>Why we need it:</strong><br />External Secrets runs inside your cluster. It needs a secure, short-lived way to prove to Vault, “I’m allowed to read <strong>only</strong> the Nautobot secrets.” Vault’s <strong>Kubernetes auth</strong> method does exactly that by validating a pod’s <strong>service account</strong> token against the cluster API and mapping it to a <strong>least-privilege policy</strong>.</p>
<p><strong>What the role does:</strong></p>
<ul>
<li><p>Binds a specific <strong>ServiceAccount + Namespace</strong> (e.g., the one where Nautobot lives) to a <strong>read-only policy</strong> for your Nautobot secret paths.</p>
</li>
<li><p>Issues <strong>short-lived Vault tokens</strong> to ESO when it presents the Kubernetes JWT. No root tokens or static creds in manifests.</p>
</li>
<li><p>Scopes access to exactly the secret paths you choose (nothing more).</p>
</li>
</ul>
<p>First piece that has to be done (if never done previously) is not unable the Kubernetes Authentication method. For enabling through the GUI follow the steps below:</p>
<ol>
<li><p>In the left-hand pane, click <strong>Access</strong>.</p>
</li>
<li><p>Under <strong>Authentication</strong>, click <strong>Enable New Method</strong> (top right).</p>
</li>
<li><p>Under <strong>Infra</strong>, choose <strong>Kubernetes</strong>.</p>
</li>
<li><p>Leave the options at their defaults and click <strong>Enable Method</strong>.</p>
</li>
<li><p>Back on the <strong>Authentication Methods</strong> list, you should now see <strong>kubernetes/</strong> and <strong>token/</strong>. Click <strong>kubernetes/</strong>.</p>
</li>
<li><p>Click <strong>Configuration</strong> (top area), then <strong>Configure</strong> (right side).</p>
</li>
<li><p>In <strong>Configuration</strong>, set <strong>Kubernetes host</strong> to your API URL (I use the Kube-VIP URL from earlier posts). If you don’t have one, you can use <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a>.</p>
</li>
</ol>
<p>Kubernetes auth is now configured. Next, create the Nautobot role.</p>
<hr />
<p><strong>Create the Nautobot role:</strong></p>
<ol>
<li><p>From <strong>Authentication Methods</strong>, select <strong>kubernetes/</strong>.</p>
</li>
<li><p>Click <strong>Create role</strong> (right side).</p>
</li>
<li><p>Use these values (adjust as needed for your environment):</p>
<ul>
<li><p><strong>Name:</strong> <code>nautobot</code></p>
</li>
<li><p><strong>Alias name source:</strong> <code>serviceaccount_name</code></p>
</li>
<li><p><strong>Bound service account names:</strong> <code>nautobot-prod</code></p>
</li>
<li><p><strong>Bound service account namespaces:</strong> <code>nautobot-prod</code></p>
</li>
<li><p>Under <strong>Tokens → Generated Token’s Policies:</strong> add <code>nautobot</code> (we’ll create this policy next)</p>
</li>
</ul>
</li>
<li><p>Leave other token settings at their defaults; other fields can remain blank.</p>
</li>
<li><p>Click <strong>Save</strong>.</p>
</li>
</ol>
<p>That’s all we need for the Nautobot role. The referenced ServiceAccount will be created by our Helm deployment a bit later.</p>
<hr />
<p><strong>Create the ACL policy:</strong></p>
<ol>
<li><p>In the left-hand pane, click <strong>Policies</strong>.</p>
</li>
<li><p>Click <strong>Create ACL policy</strong>.</p>
</li>
<li><p>Enter a policy name (e.g., <code>nautobot</code>).</p>
</li>
<li><p>Paste in the policy content. <em>Note:</em> for simplicity, I start from the default policy and add <strong>read</strong> and <strong>list</strong> capabilities for the upcoming secrets paths (shown below).</p>
</li>
</ol>
<pre><code class="lang-typescript"># Allow tokens to look up their own properties
path <span class="hljs-string">"auth/token/lookup-self"</span> {
    capabilities = [<span class="hljs-string">"read"</span>]
}

# Allow tokens to renew themselves
path <span class="hljs-string">"auth/token/renew-self"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow tokens to revoke themselves
path <span class="hljs-string">"auth/token/revoke-self"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow a token to look up its own capabilities on a path
path <span class="hljs-string">"sys/capabilities-self"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow a token to look up its own entity by id or name
path <span class="hljs-string">"identity/entity/id/{{identity.entity.id}}"</span> {
  capabilities = [<span class="hljs-string">"read"</span>]
}
path <span class="hljs-string">"identity/entity/name/{{identity.entity.name}}"</span> {
  capabilities = [<span class="hljs-string">"read"</span>]
}


# Allow a token to look up its resultant ACL <span class="hljs-keyword">from</span> all policies. This is useful
# <span class="hljs-keyword">for</span> UIs. It is an internal path because the format may change at <span class="hljs-built_in">any</span> time
# based on how the internal ACL features and capabilities change.
path <span class="hljs-string">"sys/internal/ui/resultant-acl"</span> {
    capabilities = [<span class="hljs-string">"read"</span>]
}

# Allow a token to renew a lease via lease_id <span class="hljs-keyword">in</span> the request body; old path <span class="hljs-keyword">for</span>
# old clients, <span class="hljs-keyword">new</span> path <span class="hljs-keyword">for</span> newer
path <span class="hljs-string">"sys/renew"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}
path <span class="hljs-string">"sys/leases/renew"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow looking up lease properties. This requires knowing the lease ID ahead
# <span class="hljs-keyword">of</span> time and does not divulge <span class="hljs-built_in">any</span> sensitive information.
path <span class="hljs-string">"sys/leases/lookup"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow a token to manage its own cubbyhole
path <span class="hljs-string">"cubbyhole/*"</span> {
    capabilities = [<span class="hljs-string">"create"</span>, <span class="hljs-string">"read"</span>, <span class="hljs-string">"update"</span>, <span class="hljs-string">"delete"</span>, <span class="hljs-string">"list"</span>]
}

# Allow a token to wrap arbitrary values <span class="hljs-keyword">in</span> a response-wrapping token
path <span class="hljs-string">"sys/wrapping/wrap"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow a token to look up the creation time and TTL <span class="hljs-keyword">of</span> a given
# response-wrapping token
path <span class="hljs-string">"sys/wrapping/lookup"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow a token to unwrap a response-wrapping token. This is a convenience to
# avoid client token swapping since <span class="hljs-built_in">this</span> is also part <span class="hljs-keyword">of</span> the response wrapping
# policy.
path <span class="hljs-string">"sys/wrapping/unwrap"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow general purpose tools
path <span class="hljs-string">"sys/tools/hash"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}
path <span class="hljs-string">"sys/tools/hash/*"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow checking the status <span class="hljs-keyword">of</span> a Control Group request <span class="hljs-keyword">if</span> the user has the
# accessor
path <span class="hljs-string">"sys/control-group/request"</span> {
    capabilities = [<span class="hljs-string">"update"</span>]
}

# Allow a token to make requests to the Authorization Endpoint <span class="hljs-keyword">for</span> OIDC providers.
path <span class="hljs-string">"identity/oidc/provider/+/authorize"</span> {
    capabilities = [<span class="hljs-string">"read"</span>, <span class="hljs-string">"update"</span>]
}

# Allow a token to access nautobot db secrets
path <span class="hljs-string">"secret/nautobot-prod-db-credentials"</span> {
    capabilities = [<span class="hljs-string">"read"</span>, <span class="hljs-string">"list"</span>]
}

# Allow a token to access nautobot superuser secrets
path <span class="hljs-string">"secret/nautobot-prod-superuser-credentials"</span> {
    capabilities = [<span class="hljs-string">"read"</span>, <span class="hljs-string">"list"</span>]
}
</code></pre>
<p>That’s it. Kubernetes Auth, the Nautobot Role, and the policy are set. Let’s finally add our actual secrets to Vault.</p>
<h2 id="heading-add-secrets-to-vault-under-secret">Add Secrets to Vault (under <code>secret/</code>)</h2>
<p>We’ll store the credentials and app secrets that Nautobot (and its dependencies) need under a clear, predictable hierarchy in the <code>/secret</code> (KV) mount.</p>
<p><strong>What to store for a “basic but secure” deploy:</strong></p>
<ul>
<li><p><strong>Superuser</strong>: password and API token (for first login and automation).</p>
</li>
<li><p><strong>Database Passwords:</strong> Postgres and Redis</p>
</li>
</ul>
<p>If the <strong>KV (Key/Value) secrets engine</strong> isn’t enabled yet, start here. Otherwise, skip to <strong>Create the secrets</strong>.</p>
<h3 id="heading-enable-the-kv-secrets-engine">Enable the KV secrets engine</h3>
<ol>
<li><p>In the left navigation, click <strong>Secrets Engines</strong>.</p>
</li>
<li><p>Click <strong>Enable new engine +</strong> (top right).</p>
</li>
<li><p>Choose <strong>KV</strong> under “Generic.”</p>
</li>
<li><p>Set the <strong>Path</strong> to <code>secret</code>; leave other options at defaults.</p>
</li>
<li><p>Click <strong>Enable Engine</strong>.</p>
</li>
</ol>
<p>If this is a fresh Vault and KV wasn’t previously enabled, you should now see it listed alongside the existing engines.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757300650945/bc75aa92-b0f4-402f-95c1-48b5b8e3d07d.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-create-the-secrets">Create the secrets</h3>
<ol>
<li><p>In the left navigation, click <strong>Secrets Engines</strong>.</p>
</li>
<li><p>Select the new <strong>secret</strong> (KV) engine.</p>
</li>
<li><p>Click <strong>Create secret +</strong> (right side).</p>
</li>
<li><p>For <strong>Path</strong>, enter <code>nautobot-prod-db-credentials</code> (or your preferred name).</p>
</li>
<li><p>Under <strong>Secret data</strong>, add a key <code>postgres-pass</code> with its value.</p>
</li>
<li><p>Click <strong>Add</strong> and create a second key <code>redis-pass</code> with its value.</p>
</li>
<li><p>Click <strong>Save</strong>.</p>
</li>
</ol>
<p>If completed correctly it should look like below -</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757301597283/980ad067-8e9c-473f-85c9-f4d143f25333.png" alt class="image--center mx-auto" /></p>
<p>Repeat the process for the <strong>Superuser</strong> secret. Create a new secret with two keys (for example, <code>password</code> and <code>api-token</code>) and save.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757301897242/ea3aad8a-6741-44f9-86f2-b4425e9a91c2.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-how-eso-and-vault-work-together-high-level">How ESO and Vault Work Together (high-level)</h2>
<p>Once the role and secrets exist and ArgoCD goes to deploy the app, ESO will:</p>
<ul>
<li><p>Use the <strong>Kubernetes auth</strong> role to obtain a short-lived Vault token (via its ServiceAccount).</p>
</li>
<li><p>Read the exact keys under <code>/secret/nautobot...</code> as defined by your policy.</p>
</li>
<li><p>Materialize a single <strong>Kubernetes Secret</strong> (or multiple, your call) in the Nautobot namespace with the names/keys your Helm chart expects.</p>
</li>
</ul>
<p>With Vault and External Secrets in place, we now have a clean, Git-free path for credentials: a Kubernetes auth role that scopes exactly who can read what, a tidy set of KV paths for Nautobot’s superuser + databases, and ESO ready to materialize those values as Kubernetes Secrets when Argo CD reconciles. That closes the loop on “secure by default” for this deployment. Next up, we’ll use everything we’ve built so far (storage classes, ingress patterns, secrets) to assemble our <strong>Kustomize</strong> resources and configure the Nautobot Helm chart the GitOps way.</p>
<h1 id="heading-the-rest-of-the-kustomize-resources">The Rest of the Kustomize Resources</h1>
<p>Earlier we created the <strong>IngressRoute</strong> Kustomize file to publish Nautobot through Traefik. Now we’ll add the rest of the overlay, mostly focused on integrating in the work from the <strong>Secrets</strong> section. We’ll also add a top-level <code>kustomization.yml</code> to bundle these pieces so the cluster can build them as a single unit. Once this overlay is in place, everything we’ve prepared (storage, secrets, and ingress) comes together as one declarative package.</p>
<p>The first file will be the ClusterSecretStore - <code>cluster-secret-store.yml</code></p>
<h2 id="heading-clustersecretstore-esos-shortcut-to-vault">ClusterSecretStore: ESO’s shortcut to Vault</h2>
<p>A <strong>ClusterSecretStore</strong> is a cluster-wide connection profile that tells External Secrets (ESO) how to reach Vault, which KV (“/secret”) mount to read, and how to auth (Kubernetes auth + Vault role). Use <strong>ClusterSecretStore</strong> to share one Vault setup across namespaces; use <strong>SecretStore</strong> if you want it namespace-scoped. For a more simplistic deployment I choose to use a <strong>ClusterSecretStore</strong>.</p>
<p><strong>What this sets:</strong></p>
<ul>
<li><p><code>server</code> – Vault URL reachable from the cluster</p>
</li>
<li><p><code>path</code>/<code>version</code> – your KV mount (e.g., <code>secret</code>, v2)</p>
</li>
<li><p><code>auth.kubernetes</code> – use SA token login; <code>role</code> maps SA+namespace → read-only policy</p>
</li>
<li><p><code>serviceAccountRef</code> – which SA ESO uses to authenticate</p>
</li>
</ul>
<p><strong>Repo Example (with comments):</strong></p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">external-secrets.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterSecretStore</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">vault-backend</span>                 <span class="hljs-comment"># cluster-wide handle ESO will reference</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">provider:</span>
    <span class="hljs-attr">vault:</span>
      <span class="hljs-comment"># Where Vault is reachable from the cluster using a cluster internal URL</span>
      <span class="hljs-comment"># (Many setups use http://vault.vault.svc:8200 or https with proper CA)</span>
      <span class="hljs-attr">server:</span> <span class="hljs-string">"http://hashi-vault-prod-0.hashi-vault-prod-internal.hashi-vault-prod.svc.cluster.local:8200"</span>

      <span class="hljs-comment"># The KV mount path and version you enabled in Vault</span>
      <span class="hljs-attr">path:</span> <span class="hljs-string">"secret"</span>                  <span class="hljs-comment"># e.g., 'secret', 'kv', etc.</span>
      <span class="hljs-attr">version:</span> <span class="hljs-string">"v2"</span>                   <span class="hljs-comment"># be explicit to avoid surprises</span>

      <span class="hljs-comment"># Authenticate to Vault using the Kubernetes auth method</span>
      <span class="hljs-attr">auth:</span>
        <span class="hljs-attr">kubernetes:</span>
          <span class="hljs-attr">mountPath:</span> <span class="hljs-string">"kubernetes"</span>     <span class="hljs-comment"># must match your Vault auth mount path</span>
          <span class="hljs-attr">role:</span> <span class="hljs-string">"nautobot"</span>            <span class="hljs-comment"># Vault role bound to SA+namespace with read-only policy</span>
          <span class="hljs-attr">serviceAccountRef:</span>
            <span class="hljs-attr">name:</span> <span class="hljs-string">nautobot-prod</span>       <span class="hljs-comment"># SA whose token ESO will use for login</span>
            <span class="hljs-attr">namespace:</span> <span class="hljs-string">nautobot-prod</span>  <span class="hljs-comment"># namespace where that SA lives</span>
</code></pre>
<p><strong>How it flows:</strong> ESO reads this store → logs into Vault with the SA token → gets a short-lived token for the <code>nautobot</code> role → pulls only the allowed keys → renders Kubernetes Secrets for Helm/Kustomize.</p>
<p>The next pair of files are for the database and superuser secret creation.</p>
<h2 id="heading-externalsecrets-mapping-vault-data-into-kubernetes-secrets">ExternalSecrets: mapping Vault data into Kubernetes Secrets</h2>
<p><strong>Why these exist:</strong> An <strong>ExternalSecret</strong> tells ESO <em>which</em> Vault keys to read and <em>how</em> to materialize them as a plain Kubernetes Secret that Helm/Kustomize can mount. We’ll use two: one for <strong>database/redis creds</strong> and one for the <strong>Nautobot superuser</strong>. An ExternalSecret is namespace-scoped.</p>
<h3 id="heading-database-amp-redis-externalsecret-commented">Database &amp; Redis ExternalSecret (commented)</h3>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">external-secrets.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ExternalSecret</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">nautobot-prod-db-external-secret</span>   <span class="hljs-comment"># ESO resource name</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">nautobot-prod</span>                 <span class="hljs-comment"># where the resulting K8s Secret will live</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">refreshInterval:</span> <span class="hljs-string">"1h"</span>                    <span class="hljs-comment"># re-sync cadence from Vault</span>
  <span class="hljs-attr">secretStoreRef:</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">vault-backend</span>                    <span class="hljs-comment"># points to our (Cluster)SecretStore</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterSecretStore</span>
  <span class="hljs-attr">target:</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">nautobot-prod-db-secrets</span>         <span class="hljs-comment"># name of the K8s Secret ESO will create/update</span>
    <span class="hljs-attr">creationPolicy:</span> <span class="hljs-string">Owner</span>                  <span class="hljs-comment"># ESO owns and reconciles this Secret</span>
  <span class="hljs-attr">data:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">secretKey:</span> <span class="hljs-string">postgres-password</span>         <span class="hljs-comment"># key inside the K8s Secret</span>
      <span class="hljs-attr">remoteRef:</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">secret/data/nautobot-prod-db-credentials</span>  <span class="hljs-comment"># Vault path (KV v2 HTTP style)</span>
        <span class="hljs-attr">property:</span> <span class="hljs-string">postgres-pass</span>            <span class="hljs-comment"># field inside that Vault doc</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">secretKey:</span> <span class="hljs-string">password</span>                  <span class="hljs-comment"># duplicate key for charts expecting 'password'</span>
      <span class="hljs-attr">remoteRef:</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">secret/data/nautobot-prod-db-credentials</span>
        <span class="hljs-attr">property:</span> <span class="hljs-string">postgres-pass</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">secretKey:</span> <span class="hljs-string">redis-password</span>            <span class="hljs-comment"># Redis password (optional if Redis is unauthenticated)</span>
      <span class="hljs-attr">remoteRef:</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">secret/data/nautobot-prod-db-credentials</span>
        <span class="hljs-attr">property:</span> <span class="hljs-string">redis-pass</span>
</code></pre>
<h3 id="heading-superuser-externalsecret-commented">Superuser ExternalSecret (commented)</h3>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">external-secrets.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ExternalSecret</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">nautobot-prod-superuser-external-secret</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">nautobot-prod</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">refreshInterval:</span> <span class="hljs-string">"1h"</span>
  <span class="hljs-attr">secretStoreRef:</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">vault-backend</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterSecretStore</span>
  <span class="hljs-attr">target:</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">nautobot-prod-superuser-secrets</span>  <span class="hljs-comment"># K8s Secret with Nautobot bootstrap creds</span>
    <span class="hljs-attr">creationPolicy:</span> <span class="hljs-string">Owner</span>
  <span class="hljs-attr">data:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">secretKey:</span> <span class="hljs-string">password</span>                  <span class="hljs-comment"># superuser password</span>
      <span class="hljs-attr">remoteRef:</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">secret/data/nautobot-prod-superuser-credentials</span>
        <span class="hljs-attr">property:</span> <span class="hljs-string">superuser-pass</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">secretKey:</span> <span class="hljs-string">api_token</span>                 <span class="hljs-comment"># superuser API token</span>
      <span class="hljs-attr">remoteRef:</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">secret/data/nautobot-prod-superuser-credentials</span>
        <span class="hljs-attr">property:</span> <span class="hljs-string">superuser-api-token</span>
</code></pre>
<h3 id="heading-notes">Notes</h3>
<ul>
<li><p><strong>Key naming:</strong> The <code>secretKey</code> entries become keys in your Kubernetes Secret. Align them with whatever your Helm values or manifests expect.</p>
</li>
<li><p><strong>KV v2 pathing:</strong> Some setups prefer the logical path (e.g., <code>nautobot-prod-db-credentials</code>) rather than the HTTP-style <code>secret/data/...</code>. Use the style that matches how your <code>ClusterSecretStore</code> is configured.</p>
</li>
<li><p><strong>Duped mappings:</strong> Having both <code>postgres-password</code> <strong>and</strong> <code>password</code> mapped to the same Vault value is fine if different consumers expect different key names.</p>
</li>
<li><p><strong>Refresh:</strong> <code>refreshInterval</code> controls how quickly rotations in Vault propagate to Kubernetes. Pick something that fits your rotation policy.</p>
</li>
</ul>
<h2 id="heading-kustomize-bundling-our-resources-together">Kustomize: Bundling our Resources Together</h2>
<p>Time to bundle everything we’ve created into a single overlay Kustomize can build (and Argo CD can track). Keep this file in your environment overlay (e.g., <code>overlays/prod/</code>).</p>
<p><strong>What this overlay does:</strong></p>
<ul>
<li><p>Registers Vault access for ESO via the <strong>ClusterSecretStore</strong></p>
</li>
<li><p>Pulls database + superuser creds via <strong>ExternalSecret</strong> objects</p>
</li>
<li><p>Publishes Nautobot through Traefik with our <strong>IngressRoute</strong></p>
</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-comment">## kustomize.yml</span>

<span class="hljs-comment"># The building blocks we created earlier</span>
<span class="hljs-attr">resources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">cluster-secret-store.yml</span>        <span class="hljs-comment"># ESO → Vault connection (cluster-scoped; namespace here is ignored)</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">external-secrets-db.yml</span>         <span class="hljs-comment"># Database &amp; Redis credentials from Vault → K8s Secret</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">external-secret-superuser.yml</span>   <span class="hljs-comment"># Nautobot superuser creds from Vault → K8s Secret</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">ingress-route.yml</span>               <span class="hljs-comment"># Traefik exposure for Nautobot</span>
</code></pre>
<h3 id="heading-notes-1">Notes</h3>
<ul>
<li><p><strong>Order of operations:</strong> Kustomize doesn’t enforce ordering, but Argo CD will reconcile until everything is healthy. If you want strict sequencing later, you can add Argo CD sync waves via annotations.</p>
</li>
<li><p><strong>Where this fits:</strong> Your Argo CD Application will point at this folder (done in the next section). Once synced, ESO will authenticate to Vault, create the Kubernetes Secrets, and Traefik will expose the app host defined in your IngressRoute.</p>
</li>
</ul>
<p>Commit this file alongside the four resources, and you’ve got a clean, declarative package ready for Argo CD to manage.</p>
<h1 id="heading-the-final-pieces-custom-helm-values-argocd-app-manifest">The Final Pieces - Custom Helm Values + ArgoCD App Manifest</h1>
<h2 id="heading-custom-helm-values-values-prodyml">Custom Helm values (<code>values-prod.yml</code>)</h2>
<p>This file wires Nautobot to the secrets that will be deployed, dials replicas down for a tidy first deploy, and pins persistence to the CephFS StorageClass(es). Drop it next to your overlay (e.g., <code>apps/nautobot/values-prod.yml</code>) and reference it from your Argo CD Application (next section).</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># values-prod.yml</span>
<span class="hljs-attr">nautobot:</span>
  <span class="hljs-comment"># Keep it small for the first sync; scale later.</span>
  <span class="hljs-attr">replicaCount:</span> <span class="hljs-number">1</span>

  <span class="hljs-comment"># Probes off for initial bring-up (migrations can make probes flap).</span>
  <span class="hljs-comment"># Once stable, consider enabling these.</span>
  <span class="hljs-attr">livenessProbe:</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>
  <span class="hljs-attr">readinessProbe:</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>

  <span class="hljs-comment"># Bootstrap superuser from our ExternalSecret-backed K8s Secret.</span>
  <span class="hljs-attr">superUser:</span>
    <span class="hljs-attr">existingSecret:</span> <span class="hljs-string">"nautobot-prod-superuser-secrets"</span>   <span class="hljs-comment"># created by ESO</span>
    <span class="hljs-attr">existingSecretPasswordKey:</span> <span class="hljs-string">"password"</span>               <span class="hljs-comment"># key in that Secret</span>
    <span class="hljs-attr">existingSecretApiTokenKey:</span> <span class="hljs-string">"api_token"</span>              <span class="hljs-comment"># key in that Secret</span>
    <span class="hljs-attr">username:</span> <span class="hljs-string">"jeff"</span>                                    <span class="hljs-comment"># static bootstrap username</span>

<span class="hljs-attr">celery:</span>
  <span class="hljs-comment"># One worker to start; bump if you run jobs/heavy plugins.</span>
  <span class="hljs-attr">replicaCount:</span> <span class="hljs-number">1</span>

<span class="hljs-attr">serviceAccount:</span>
  <span class="hljs-comment"># Leave token mounted, used for ESO/ClusterSecretStore</span>
  <span class="hljs-attr">automountServiceAccountToken:</span> <span class="hljs-literal">true</span>

<span class="hljs-attr">postgresql:</span>
  <span class="hljs-comment"># Using the chart’s built-in PostgreSQL with CephFS persistence.</span>
  <span class="hljs-attr">primary:</span>
    <span class="hljs-attr">persistence:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">size:</span> <span class="hljs-string">"2Gi"</span>                          <span class="hljs-comment"># starter size; expand later</span>
      <span class="hljs-attr">storageClass:</span> <span class="hljs-string">"rook-cephfs-retain"</span>   <span class="hljs-comment"># keep data if PVC is deleted</span>
      <span class="hljs-attr">accessModes:</span> [<span class="hljs-string">'ReadWriteOnce'</span>]       <span class="hljs-comment"># DB should be single-writer</span>
  <span class="hljs-attr">auth:</span>
    <span class="hljs-comment"># Pull the password from the ExternalSecret-created Secret.</span>
    <span class="hljs-attr">existingSecret:</span> <span class="hljs-string">nautobot-prod-db-secrets</span>

<span class="hljs-attr">redis:</span>
  <span class="hljs-comment"># Enable persistence if you want cache/queue data to survive restarts.</span>
  <span class="hljs-attr">master:</span>
    <span class="hljs-attr">persistence:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">size:</span> <span class="hljs-string">"1Gi"</span>
      <span class="hljs-attr">storageClass:</span> <span class="hljs-string">"rook-cephfs-delete"</span>   <span class="hljs-comment"># okay to delete for cache data</span>
      <span class="hljs-attr">accessModes:</span> [<span class="hljs-string">'ReadWriteOnce'</span>]
  <span class="hljs-attr">auth:</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">existingSecret:</span> <span class="hljs-string">nautobot-prod-db-secrets</span>
</code></pre>
<h3 id="heading-why-these-choices">Why these choices</h3>
<ul>
<li><p><strong>Probes disabled (initially):</strong> first runs often include migrations; turning probes off avoids noisy restarts. Re-enable once everything is healthy.</p>
</li>
<li><p><strong>CephFS everywhere:</strong> aligns with the storage classes you built earlier.</p>
<ul>
<li><p><code>rook-cephfs-retain</code> for Postgres so accidental PVC deletes don’t nuke data.</p>
</li>
<li><p><code>rook-cephfs-delete</code> for Redis because it’s cache/queue data.</p>
</li>
</ul>
</li>
<li><p><code>ReadWriteOnce</code> for DB/Redis: even though CephFS supports RWX, keeping databases single-writer reduces foot-guns (performance issues, data corruption, or scalability bottlenecks).</p>
</li>
<li><p><strong>Secrets via ESO:</strong> <code>existingSecret</code> keys point at the Kubernetes Secrets materialized from Vault, so nothing sensitive lives in Git or in the helm values.</p>
</li>
</ul>
<h2 id="heading-rounding-out-the-argo-cd-application">Rounding Out the Argo CD Application</h2>
<p>Now that Helm (the app) and Kustomize (secrets + ingress) are defined and your custom Helm values exist we just need to finish the Argo CD Application so it points at both sources and deploys them to the right place (below).</p>
<pre><code class="lang-yaml"><span class="hljs-attr">project:</span> <span class="hljs-string">prod-home</span>
<span class="hljs-attr">destination:</span>
  <span class="hljs-attr">server:</span> <span class="hljs-string">https://prod-kube-vip.jjland.local:6443</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">nautobot-prod</span>
<span class="hljs-attr">syncPolicy:</span>
  <span class="hljs-attr">syncOptions:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">CreateNamespace=true</span>
<span class="hljs-attr">sources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://nautobot.github.io/helm-charts/</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-number">2.5</span><span class="hljs-number">.5</span>
    <span class="hljs-attr">helm:</span>
      <span class="hljs-attr">valueFiles:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">$values/apps/nautobot/values-prod.yml</span>
    <span class="hljs-attr">chart:</span> <span class="hljs-string">nautobot</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://github.com/leothelyon17/kubernetes-gitops-playground.git</span>
    <span class="hljs-attr">path:</span> <span class="hljs-string">apps/nautobot/overlays/prod</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-string">HEAD</span>
    <span class="hljs-attr">ref:</span> <span class="hljs-string">values</span>
</code></pre>
<p>Copy and paste the above file in the ArgoCD GUI or edit manually. Same as similar app manifests that were configured in previous posts.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757979999407/3ceacc80-ade3-43bb-8317-beb39b1aa09a.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757980030045/270127da-a3e2-4c1f-8b85-524ce9aa5b40.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-deploying-and-syncing-the-app">Deploying and Syncing the App</h1>
<p>With everything bundled via <strong>Kustomize</strong> and correctly referenced by ArgoCD, it’s time to deploy.</p>
<p>Open the Argo CD Application and click <strong>Sync</strong>. You should see the Helm release create a batch of Kubernetes objects. To focus on what we built in <em>this</em> post, look for:</p>
<ul>
<li><p><strong>PVCs bound</strong> to your CephFS StorageClasses and mounted by the pods</p>
</li>
<li><p><strong>PostgreSQL and Redis</strong> pods coming up <strong>Healthy</strong></p>
</li>
<li><p><strong>Secrets flow</strong>: <code>ClusterSecretStore</code> and <code>ExternalSecret</code> resources showing <strong>Synced</strong>, and the resulting Kubernetes Secrets present in the namespace</p>
</li>
<li><p><strong>IngressRoute</strong> created and admitted by Traefik (host matches your DNS A record)</p>
</li>
</ul>
<p>If all of the above is green, the Argo CD app should land in <strong>Synced / Healthy</strong>. Screenshots below show an example of what you should see.</p>
<h3 id="heading-storage"><strong>Storage</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948088008/8175b608-36f6-4c19-972f-b4ce9a9026fb.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948164574/2f5d5d8b-5834-451d-a51f-785437169fa2.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-secrets"><strong>Secrets</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948312940/7cf275ea-cc3c-473c-910f-ab58a29e9756.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948386277/c99b8106-b287-41d7-a286-ae97ccfba0d0.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-ingressroutetraefik"><strong>IngressRoute/Traefik</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948427916/e46f68ab-65f8-44d1-b25d-0049a7f2027d.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948476931/c0ea51c6-8bab-4c44-b3fb-33a02f887bcc.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948588646/ab6ec5fb-8e03-4daa-b8d8-7e28f47ef020.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-the-application-pods">The Application Pods</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757980315587/097ea482-297b-4416-8f08-955e09946014.png" alt class="image--center mx-auto" /></p>
<p><strong>Note:</strong> It can take a little while for the app to show <strong>Healthy</strong> and become reachable. On the first deploy, once Nautobot connects to Postgres it will run initial database migrations to create tables—this adds extra time on top of the normal startup. If you’re curious, watch the <code>nautobot-init</code> logs for migration progress.</p>
<p>If everything’s green in Argo CD and the pods look steady, open the host defined in your <strong>IngressRoute</strong>. You should land on the Nautobot login page. Sign in with the <strong>superuser</strong> credentials you stored in Vault (surfaced via External Secrets and referenced in your custom Helm values). If login fails, check the logs for the <code>nautobot-init</code> container. On first start it runs migrations and bootstraps the superuser. You’ll see log messages confirming the account creation (not the raw secrets), which is a quick way to verify the secret wiring end to end.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757948964815/b37fd16c-babe-4027-90ad-0f8bfe28007f.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757949163152/fd20e079-45b0-4099-8b6c-4da3e5aa3daf.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757949210758/9fad986e-c571-4ae9-97a8-ce315968a690.png" alt class="image--center mx-auto" /></p>
<p>If you can log in, <strong>CONGRATULATIONS!</strong> You’ve just deployed Nautobot on Kubernetes, fully managed the GitOps way.</p>
<h1 id="heading-troubleshooting-tips">Troubleshooting Tips</h1>
<p>If your deployment isn’t landing cleanly, work through these quick checks, organized by the same pieces we built in this post.</p>
<hr />
<h2 id="heading-1-argo-cd-amp-kustomize">1) Argo CD &amp; Kustomize</h2>
<p><strong>What to look for</strong></p>
<ul>
<li><p>App stuck in <strong>OutOfSync</strong> or <strong>Progressing</strong>.</p>
</li>
<li><p>Sync immediately fails</p>
</li>
<li><p>Resources missing from the tree.</p>
</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p>Open the Argo CD git diff for the app: look for bad paths/filenames in <code>kustomization.yml</code>.</p>
</li>
<li><p>Verify the repo folder the Application points to contains:</p>
<ul>
<li><p><code>cluster-secret-store.yml</code></p>
</li>
<li><p><code>external-secrets-db.yml</code></p>
</li>
<li><p><code>external-secret-superuser.yml</code></p>
</li>
<li><p><code>ingress-route.yml</code></p>
</li>
<li><p><code>values-prod.yml</code> (referenced by your Helm app)</p>
</li>
</ul>
</li>
<li><p>Confirm files paths in the ArgoCD App manifest</p>
</li>
<li><p>Double check all YAML syntax</p>
</li>
</ul>
<hr />
<h2 id="heading-2-secrets-pipeline-vault-eso-k8s-secret">2) Secrets pipeline: Vault → ESO → K8s Secret</h2>
<p><strong>Symptoms</strong></p>
<ul>
<li>ExternalSecrets show <strong>Not Synced</strong>, Nautobot init fails with missing env/creds.</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p><code>ClusterSecretStore</code>:</p>
<ul>
<li><p>Server URL reachable inside the cluster?</p>
</li>
<li><p><code>auth.kubernetes.mountPath</code> matches your Vault auth mount?</p>
</li>
<li><p><code>role</code> name matches the role you created in Vault?</p>
</li>
</ul>
</li>
<li><p><code>ExternalSecret</code>:</p>
<ul>
<li><p>Conditions should be <strong>Ready=True</strong>; if not, <code>describe</code> it for a clear error (auth denied, key not found, etc.).</p>
</li>
<li><p>Verify Vault paths/field names match exactly (KV v2 pathing trips people up).</p>
</li>
</ul>
</li>
<li><p>ServiceAccount binding:</p>
<ul>
<li>The SA referenced in the store exists in the right namespace, and your Vault role binds to <strong>that</strong> SA+namespace.</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-3-storage-cephfs-storageclass-amp-pvcs">3) Storage: CephFS StorageClass &amp; PVCs</h2>
<p><strong>Symptoms</strong></p>
<ul>
<li>PVCs stuck in <strong>Pending</strong>; pods can’t mount volumes.</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p><code>StorageClass</code> name in Helm values matches your CephFS SC (e.g., <code>rook-cephfs-retain</code>).</p>
</li>
<li><p>Access modes fit usage:</p>
<ul>
<li><p><strong>Postgres/Redis</strong>: <code>ReadWriteOnce</code> (single writer).</p>
</li>
<li><p><strong>Nautobot media/static</strong> (if used): <code>ReadWriteMany</code>.</p>
</li>
</ul>
</li>
<li><p>Rook-Ceph health:</p>
<ul>
<li>OSDs/MONs healthy, pool/FS exists, quota not exceeded.</li>
</ul>
</li>
<li><p>If PVC deleted but PV persists:</p>
<ul>
<li>That’s expected with <code>reclaimPolicy: Retain</code>; either reuse or manually clean it up before recreating.</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-4-postgres-amp-redis-built-in-charts">4) Postgres &amp; Redis (built-in charts)</h2>
<p><strong>Symptoms</strong></p>
<ul>
<li>DB pod CrashLoopBackOff; app can’t connect.</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p>Secrets:</p>
<ul>
<li>The <code>existingSecret</code> names line up with what the subcharts expect, and key names (<code>password</code>, <code>postgres-password</code>, <code>redis-password</code>) match your ExternalSecret outputs.</li>
</ul>
</li>
<li><p>Persistence:</p>
<ul>
<li>Correct StorageClass; PVC bound.</li>
</ul>
</li>
<li><p>Logs:</p>
<ul>
<li><p>Postgres: authentication/permissions, initdb errors.</p>
</li>
<li><p>Redis: refuses connections or auth errors if <code>auth.enabled=true</code>.</p>
</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-5-nautobot-app-webworkerbeat">5) Nautobot app (web/worker/beat)</h2>
<p><strong>Symptoms</strong></p>
<ul>
<li>Web never becomes Ready, 502 via Traefik, or superuser not created.</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p><code>nautobot-init</code> logs:</p>
<ul>
<li>Confirms migrations and superuser bootstrap; errors here usually mean secret keys missing/wrong.</li>
</ul>
</li>
<li><p>Probes:</p>
<ul>
<li>We disabled probes initially—good. If you enabled them early, they can flap during migrations; disable, sync, let it settle, then re-enable.</li>
</ul>
</li>
<li><p>Environment wiring:</p>
<ul>
<li>Confirm the Helm values reference the K8s Secret keys you created (names and casing must match).</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-6-ingress-traefik-amp-dns">6) Ingress, Traefik &amp; DNS</h2>
<p><strong>Symptoms</strong></p>
<ul>
<li>404/503 at the browser, TLS errors, or wrong host.</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p><strong>IngressRoute</strong>:</p>
<ul>
<li><p><code>routes.match</code> host matches your DNS A record exactly.</p>
</li>
<li><p><code>entryPoints: ["websecure"]</code> and Traefik has that entrypoint enabled.</p>
</li>
</ul>
</li>
<li><p>Traefik Service:</p>
<ul>
<li>Has an <strong>EXTERNAL-IP</strong> from MetalLB; DNS A record points to it.</li>
</ul>
</li>
<li><p>If using certs later:</p>
<ul>
<li>Don’t reference cert-manager resources yet if you haven’t set them up; keep TLS simple at Traefik.</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-7-metallb-external-reachability">7) MetalLB (external reachability)</h2>
<p><strong>Symptoms</strong></p>
<ul>
<li>Traefik never gets an external IP; no traffic into the cluster.</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p><code>IPAddressPool</code> contains the IP/range; it’s unused on your LAN.</p>
</li>
<li><p><code>L2Advertisement</code> exists for that pool.</p>
</li>
<li><p>Traefik Service <code>type: LoadBalancer</code> and (optionally) <code>loadBalancerIP</code> matches your chosen IP.</p>
</li>
</ul>
<hr />
<h2 id="heading-8-resources-amp-scheduling">8) Resources &amp; scheduling</h2>
<p><strong>Symptoms</strong></p>
<ul>
<li>Pods Pending or OOMKilled.</li>
</ul>
<p><strong>Checks</strong></p>
<ul>
<li><p>Nodes have capacity; Ceph/DB pods especially need memory/CPU.</p>
</li>
<li><p>Start small (single replicas) then scale up.</p>
</li>
<li><p>If OOMs, raise limits/requests or add memory.</p>
</li>
</ul>
<hr />
<h2 id="heading-9-naming-amp-key-mismatches-sneaky-but-common">9) Naming &amp; key mismatches (sneaky but common)</h2>
<p><strong>What to verify</strong></p>
<ul>
<li><p>Secret <strong>names</strong> and <strong>keys</strong> in:</p>
<ul>
<li><p><code>ExternalSecret</code> → <strong>target</strong> Secret</p>
</li>
<li><p>Helm values (e.g., <code>existingSecret</code>, <code>existingSecretPasswordKey</code>, etc.)</p>
</li>
</ul>
</li>
<li><p>Namespace consistency across all manifests (<code>nautobot-prod</code> vs something else).</p>
</li>
</ul>
<hr />
<h2 id="heading-10-quick-sanity-commands-lightweight">10) Quick sanity commands (lightweight)</h2>
<ul>
<li><p><strong>Objects at a glance:</strong> <code>kubectl -n nautobot-prod get all</code></p>
</li>
<li><p><strong>ESO health:</strong> <code>kubectl -n nautobot-prod get externalsecret,secretstore,clustersecretstore</code></p>
</li>
<li><p><strong>PVCs:</strong> <code>kubectl -n nautobot-prod get pvc</code></p>
</li>
<li><p><strong>Describe failures:</strong> <code>kubectl -n nautobot-prod describe &lt;kind&gt; &lt;name&gt;</code></p>
</li>
<li><p><strong>App logs:</strong> <code>kubectl -n nautobot-prod logs deploy/nautobot -c nautobot-init --tail=100</code></p>
</li>
</ul>
<p>Ultimately, if you don’t know where to start, <strong>USE THE CONTAINER LOGS</strong>. ArgoCD makes it so easy and usually you can find the issue in the logs themselves.</p>
<h1 id="heading-summary">Summary</h1>
<p>Well, we did it. We didn’t just get Nautobot running, we established a repeatable pattern for network-automation apps or any containerized app: ArgoCD for reconciliation, Kustomize for environment shaping, Vault + External Secrets for credentials, Traefik + MetalLB for reachability, and CephFS for persistence. That stack gives a stable runway to ship changes the same way every time, through Git, without snowflakes or manual tweaks. This same method can be used to deploy both on-prem, in the cloud, or a mix of the two.</p>
<h3 id="heading-why-this-helps-your-automation-journey">Why this helps your automation journey</h3>
<ul>
<li><p><strong>Trustable intent:</strong> Nautobot becomes the system of record for sites, devices, IPAM, and custom models exposed via REST/GraphQL for pipelines and tools.</p>
</li>
<li><p><strong>Safe, auditable change:</strong> Every tweak (charts, values, secrets wiring, ingress) goes through Git reviews and rolls back cleanly. Drift is visible; fixes are deterministic.</p>
</li>
<li><p><strong>Fewer blockers:</strong> Secrets are handled with least-privilege, storage/ingress are standardized, so you can focus on workflows, not plumbing.</p>
</li>
<li><p><strong>From dev to prod:</strong> The same pattern scales to new apps (observability, chatops, CI/CD helpers) with minimal friction. Copy the overlay, adjust values, and commit.</p>
</li>
</ul>
<h3 id="heading-where-im-going-next">Where I’m going next</h3>
<ul>
<li><p>An <strong>advanced Nautobot</strong> deployment (plugins, app config, HTTPS/certs, SSO).</p>
</li>
<li><p><strong>Integrations</strong> with other GitOps-deployed apps.</p>
</li>
<li><p>A <strong>NetBox</strong> deployment for folks who prefer that app. Love it too!</p>
</li>
</ul>
<p>This is the moment where GitOps stops being theory and starts accelerating real network automation and manageable application delivery.</p>
<p>Thanks for reading!</p>
<hr />
<h2 id="heading-links">Links</h2>
<ul>
<li><p><a target="_blank" href="https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-1">Bridging the Gap: GitOps for Network Engineers - Part 1</a> (Deploying ArgoCD)</p>
</li>
<li><p><a target="_blank" href="https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-2">Bridging the Gap: GitOps for Network Engineers - Part 2</a> (Deploying Critical Infrastructure with ArgoCD)</p>
</li>
<li><p><a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground/tree/main">Github Repo</a></p>
</li>
<li><p><a target="_blank" href="https://networktocode.com/nautobot/">Nautobot - Official Site</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Bridging the Gap: GitOps for Network Engineers -  Part 2]]></title><description><![CDATA[ArgoCD Is Amazing—But Let’s Make It Do Something!
Intro
In Part 1, we laid the foundation by installing ArgoCD and setting up the basic structure for a GitOps-driven platform. If you've followed along, you should now have a working Kubernetes cluster...]]></description><link>https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-2</link><guid isPermaLink="true">https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-2</guid><category><![CDATA[gitops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[ArgoCD]]></category><category><![CDATA[Network Automation]]></category><category><![CDATA[metallb]]></category><category><![CDATA[Traefik]]></category><category><![CDATA[ceph]]></category><category><![CDATA[hashicorp-vault]]></category><category><![CDATA[external-secrets]]></category><dc:creator><![CDATA[Jeffrey Lyon]]></dc:creator><pubDate>Mon, 05 May 2025 16:23:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744599246210/770fda69-3c15-4b77-810c-89d5bc72797a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>ArgoCD Is Amazing—But Let’s Make It Do Something!</p>
<h1 id="heading-intro">Intro</h1>
<p>In <strong>Part 1</strong>, we laid the foundation by installing ArgoCD and setting up the basic structure for a GitOps-driven platform. If you've followed along, you should now have a working Kubernetes cluster, ArgoCD deployed and accessible, and your first project created in the UI.</p>
<p>Now it's time to turn that foundation into something usable.</p>
<p>In <strong>Part 2</strong>, we'll start deploying the critical infrastructure pieces that power everything else. That includes <strong>MetalLB</strong> for external load balancing, <strong>Traefik</strong> for ingress, persistent storage using <strong>Rook + Ceph</strong>, and secrets management with <strong>External Secrets and HashiCorp Vault</strong>. All of these will be deployed through ArgoCD, GitOps-style.</p>
<p>We’ll kick things off with <strong>MetalLB</strong>, which enables us to expose services outside the cluster, an essential first step in making your platform actually accessible. Let’s get into it.</p>
<h1 id="heading-metallb-load-balancing-for-bare-metal-and-home-labs"><strong>MetalLB: Load Balancing for Bare Metal and Home Labs</strong></h1>
<p>If you're running Kubernetes in a cloud environment, you typically get a load balancer as part of the package, something like an AWS ELB or an Azure Load Balancer that magically routes traffic to your services. But when you're running on bare metal, in a lab, or on-prem (which, let’s be real, a lot of network engineers are), you're on your own. That's where <strong>MetalLB</strong> comes in.</p>
<h2 id="heading-what-is-metallb"><strong>What is MetalLB?</strong></h2>
<p><strong>MetalLB</strong> is a load balancer implementation for Kubernetes clusters that <strong>don’t</strong> have access to cloud-native load balancer resources. It allows you to assign external IP addresses to your Kubernetes services so that they can be accessed from outside the cluster, exactly what you'd expect from a "real" load balancer, just built for the DIY crowd.</p>
<h2 id="heading-why-you-need-it"><strong>Why You Need It</strong></h2>
<p>In any Kubernetes-based GitOps platform, exposing services to the outside world is non-negotiable. Whether it’s ArgoCD, Traefik, Vault, or any of your network automation tools, they all need to be reachable by users, APIs, or other systems. While NodePorts can get the job done in a lab, they’re clunky, inconsistent, and definitely not production-grade.</p>
<p>MetalLB solves this by handling <strong>Service type: LoadBalancer</strong> in environments where a cloud load balancer doesn’t exist, like bare metal or your home lab. You define a pool of IP addresses from your local network, and MetalLB assigns those IPs to services that request them.</p>
<p>Here’s where the networking magic comes in: MetalLB (when running in Layer 2 mode) announces those external IPs using ARP. If a device outside of the cluster ARPs for an exposed service IP, MetalLB replies with the MAC address of the node running the service. It’s simple, reliable, and doesn’t require BGP or complex router configs.</p>
<p>So when a LoadBalancer service is created, for example, to expose ArgoCD or Traefik, MetalLB makes that service’s external IP reachable from anywhere on your local network, just like a real load balancer would in a cloud environment.</p>
<h2 id="heading-how-it-powers-the-platform"><strong>How It Powers the Platform</strong></h2>
<p>MetalLB becomes one of the core enablers of our GitOps stack. It allows you to:</p>
<ul>
<li><p>Expose ArgoCD with a proper external IP</p>
</li>
<li><p>Route external traffic to Traefik, our ingress controller</p>
</li>
<li><p>Provide consistent access to internal services that need to be reachable from your network</p>
</li>
<li><p>Maintain a production-like networking experience, even in a lab or homelab environment</p>
</li>
</ul>
<p>Without MetalLB, you’d either be stuck manually forwarding ports, messing with IP tables, or leaning on NodePorts. With it, your platform starts acting like it belongs in a real, routable network, and that’s exactly what we want.</p>
<p>Now that we understand what MetalLB does and how it fits into the big picture, let’s deploy it the GitOps way, starting with adding the Helm chart repository to our config</p>
<h3 id="heading-quick-review-helm-charts-and-how-they-fit-into-argocd"><strong>Quick Review: Helm Charts and How They Fit into ArgoCD</strong></h3>
<p>Before we deploy MetalLB, let’s quickly go over how <strong>Helm</strong> works, especially how it integrates with <strong>ArgoCD</strong>.</p>
<p>Helm is a package manager for Kubernetes. Instead of manually writing and applying a bunch of YAML files, Helm lets you deploy versioned, configurable "charts", pre-packaged bundles of Kubernetes manifests that define an application. These charts live in remote <strong>Helm repositories</strong>, similar to how <code>apt</code> or <code>yum</code> fetch packages on a Linux system.</p>
<p>In a GitOps workflow, Helm charts are referenced as part of an ArgoCD <strong>Application</strong> manifest, specifically as a <code>source</code>. ArgoCD uses this source definition to pull the chart directly from the repo, apply any custom <code>values.yaml</code> overrides you’ve stored in Git, and deploy everything into your cluster automatically.</p>
<h3 id="heading-using-the-metallb-helm-chart-with-argocd"><strong>Using the MetalLB Helm Chart with ArgoCD</strong></h3>
<p>The official MetalLB Helm chart is hosted at:</p>
<pre><code class="lang-bash">https://metallb.github.io/metallb
</code></pre>
<p>When creating your ArgoCD Application, one of your <code>sources</code> will look like this:</p>
<ul>
<li><p><strong>Type</strong>: <code>Helm</code></p>
</li>
<li><p><strong>Chart</strong>: <code>metallb</code></p>
</li>
<li><p><strong>Repo URL</strong>: <a target="_blank" href="https://metallb.github.io/metallb"><code>https://metallb.github.io/metallb</code></a></p>
</li>
<li><p><strong>Target Revision</strong>: Usually the latest</p>
</li>
</ul>
<p>ArgoCD will then treat this Helm chart as part of the desired state. It will sync the chart, merge in your values (if you’re overriding anything), and deploy MetalLB as part of your platform, all driven from Git.</p>
<h2 id="heading-metallb-installation">MetalLB Installation</h2>
<p>These initial steps, adding the Helm repo or other base sources, creating the app in ArgoCD, and wiring up the basic Helm configuration, are <strong>mostly the same for every application</strong> we’ll deploy. Because of that, I’ll only walk through this process in detail once (here), and only call out major differences for other apps later in the post. Screenshots are included below where it helps, but once you’ve done it once, you’ll be able to rinse and repeat for everything else.</p>
<h3 id="heading-step-1-add-the-helm-repo"><strong>Step 1: Add the Helm Repo</strong></h3>
<p>ArgoCD needs to know where to fetch the Helm chart from. For MetalLB, we’ll be using the Github-hosted chart:</p>
<ul>
<li><strong>Helm Repo URL</strong>:<br />  <code>https://metallb.github.io/metallb</code></li>
</ul>
<p>In the ArgoCD UI:</p>
<ul>
<li><p>Go to <strong>Settings → Repositories</strong></p>
</li>
<li><p>Click <strong>+ CONNECT REPO</strong></p>
</li>
<li><p>Enter the Helm repo URL</p>
</li>
<li><p>Choose <strong>Helm</strong> as the type</p>
</li>
<li><p>Give the repo a name (Optional)</p>
</li>
<li><p>Chose the project you created earlier to associate this repo to (mine was ‘prod-home’)</p>
</li>
<li><p>No authentication is needed for this public repo</p>
</li>
<li><p>Once done, click <strong>CONNECT</strong></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745370089083/f3239ee4-6f34-4663-a800-45a20626e987.png" alt class="image--center mx-auto" /></p>
<p>Once added, ArgoCD can now pull charts from this source.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744597093848/3ce4d22b-a831-47ba-9629-e114dcd2b704.png" alt class="image--center mx-auto" /></p>
<p><strong>Note:</strong> You’ll also need to add the <strong>GitHub repo</strong> that contains your custom configuration files, like Helm <code>values.yml</code> files and Kustomize overlays.</p>
<ul>
<li><p>If you're using <strong>my example repo</strong>, add <code>https://github.com/leothelyon17/kubernetes-gitops-playground.git</code> as another source, of type Git.</p>
</li>
<li><p>If you're using <strong>your own repo</strong>, just make sure it's added in the same way so ArgoCD can pull your values and overlays when syncing.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745371223758/1a29db15-5695-4cbb-b229-12876dc0f5d7.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h3 id="heading-step-2-create-the-argocd-application"><strong>Step 2: Create the ArgoCD Application</strong></h3>
<p>Head to the <strong>Applications</strong> tab and click <strong>+ NEW APP</strong> to start the deployment.</p>
<p>Here’s how to fill it out:</p>
<ul>
<li><p><strong>Application Name</strong>: <code>metallb</code></p>
</li>
<li><p><strong>Project</strong>: Select your project (e.g., lab-home)</p>
</li>
<li><p><strong>Sync Policy</strong>: Manual for now (we’ll automate later)</p>
</li>
<li><p><strong>Repository URL</strong>: Select the Helm repo you just added</p>
</li>
<li><p><strong>Chart Name</strong>: <code>metallb</code></p>
</li>
<li><p><strong>Target Revision</strong>: Use the latest or specify a version (recommended once things are stable)</p>
</li>
<li><p><strong>Cluster URL</strong>: Use <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a> if deploying to the same cluster (mine might be different than the default, dont worry.)</p>
</li>
<li><p><strong>Namespace</strong>: <code>metallb-system</code> (check to create it if it doesn’t exist)</p>
</li>
</ul>
<p>Click CREATE when finished.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745370359345/80f3c9da-09e6-4cdd-bab6-d82f4c1ea8c3.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745370416424/160969ef-aaf7-41ab-bdb3-1032ffe5f716.png" alt class="image--center mx-auto" /></p>
<p>If everything is in order you should see the App created like the screenshot below, though your’s will be all yellow status and ‘OutOfSync’ -</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745370461925/06f245ac-7393-4cf8-a38b-5cebe567c3ef.png" alt class="image--center mx-auto" /></p>
<p>Click into the app and you’ll see that ArgoCD has pulled in all the Kubernetes objects defined by the Helm chart. Everything will show as <strong>OutOfSync</strong> for now, ArgoCD knows what needs to be deployed, but we’re not quite ready to hit sync just yet. You're doing great, let’s move on to the next step.</p>
<h3 id="heading-step-3-add-the-kustomize-configuration-layer"><strong>Step 3: Add the Kustomize Configuration Layer</strong></h3>
<p>For MetalLB, we’re keeping things straightforward (kind of): the Helm chart gets deployed using its <strong>default values</strong>, no need to touch <code>values.yml</code> here. But MetalLB still needs to be told <em>how</em> to operate: what IP ranges it can assign, and how it should advertise them. We handle that using a second source: a <strong>Kustomize overlay</strong>.</p>
<p>Here’s what to do next:</p>
<ol>
<li><p>In the ArgoCD UI, go to the <strong>Application</strong> you just created for MetalLB.</p>
</li>
<li><p>Click the <strong>App details (🖉 edit)</strong> icon in the top right to open the manifest editor.</p>
</li>
<li><p>Scroll down to the <code>source</code> section.</p>
</li>
<li><p>You’ll now be editing this app to include a <strong>second source</strong>.</p>
</li>
</ol>
<p>Add the following block under <code>source:</code> to include the Kustomize overlay for your MetalLB custom resources:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">project:</span> <span class="hljs-string">prod-home</span>
<span class="hljs-attr">destination:</span>
  <span class="hljs-attr">server:</span> <span class="hljs-string">https://prod-kube-vip.jjland.local:6443</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">metallb-prod</span>
<span class="hljs-attr">syncPolicy:</span>
  <span class="hljs-attr">syncOptions:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">CreateNamespace=true</span>
<span class="hljs-attr">sources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://metallb.github.io/metallb</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-number">0.14</span><span class="hljs-number">.9</span>
    <span class="hljs-attr">chart:</span> <span class="hljs-string">metallb</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://github.com/leothelyon17/kubernetes-gitops-playground.git</span>
    <span class="hljs-attr">path:</span> <span class="hljs-string">apps/metallb/overlays/lab</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-string">HEAD</span>
</code></pre>
<p>NOTE: ‘source’ needs to be changed to ‘sources’, as there are now more than one.</p>
<p>This tells ArgoCD to deploy not just the Helm chart, but also the additional Kubernetes objects (like <code>IPAddressPool</code> and <code>L2Advertisement</code>) defined in your overlay. These are located in your <code>apps/metallb</code> directory and should include a <code>kustomization.yml</code> that pulls them together.</p>
<p>Once saved, ArgoCD will treat both the Helm install and the Kustomize overlay as part of the same application, and sync them together.</p>
<h3 id="heading-step-4-sync-the-app"><strong>Step 4: Sync the App</strong></h3>
<p>Once everything looks good, hit <strong>Sync</strong>. ArgoCD will pull the chart, merge/build your kustomize files, and deploy MetalLB into the cluster.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745371584446/205aec86-4be7-4483-bba3-023031ac1a8b.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745371728253/a26c3afe-23aa-48c0-a33c-76857d38076c.png" alt class="image--center mx-auto" /></p>
<p>You can click into the app to watch MetalLB’s resources come online; Deployments, ConfigMaps, the speaker DaemonSet, and more. If the sync fails on the first try, don’t panic, just retry it. This can happen if the chart includes CRDs (Custom Resource Definitions), which sometimes cause the sync to complete out of order while the CRDs are still registering.</p>
<p>Once things settle, you should see the application status show <strong>“Healthy”</strong> and <strong>“Synced”</strong>. You’ll also see multiple healthy MetalLB pods running in your cluster, just like the screenshot above.</p>
<p><strong>Congrats! MetalLB is now deployed and ready to hand out external IPs like a proper load balancer.</strong></p>
<h2 id="heading-metallb-custom-configuration">MetalLB Custom Configuration</h2>
<p>I wanted to provide a breakdown of the custom MetalLB files I’m using and why. This directory contains a <strong>Kustomize overlay</strong> used to deploy <strong>MetalLB's custom configuration</strong> in a lab environment. It layers environment-specific resources, like IP pools and advertisements, on top of the base Helm chart deployment, following GitOps best practices.</p>
<h3 id="heading-file-breakdown">File Breakdown</h3>
<h4 id="heading-ip-address-poolyml"><code>ip-address-pool.yml</code></h4>
<p>Defines a <code>IPAddressPool</code> custom resource:</p>
<ul>
<li><p>Specifies a range of IP addresses MetalLB can assign to LoadBalancer services</p>
</li>
<li><p>Ensures services are reachable from the local network</p>
</li>
<li><p>Helps avoid IP conflicts in your lab environment</p>
</li>
</ul>
<h4 id="heading-l2-advertisementyml"><code>l2-advertisement.yml</code></h4>
<p>Defines an <code>L2Advertisement</code> custom resource:</p>
<ul>
<li><p>Tells MetalLB to advertise the IPs via <strong>Layer 2</strong> (e.g., ARP)</p>
</li>
<li><p>Perfect for home labs and bare metal where BGP isn’t in use</p>
</li>
<li><p>Allows MetalLB to function like a basic network-aware load balancer</p>
</li>
</ul>
<h4 id="heading-kustomizationyml"><code>kustomization.yml</code></h4>
<p>Kustomize overlay file:</p>
<ul>
<li><p>Combines and applies the above resources</p>
</li>
<li><p>Enables clean separation between base and environment-specific config</p>
</li>
<li><p>Keeps your repo organized and scalable</p>
</li>
</ul>
<h3 id="heading-why-it-matters">Why It Matters</h3>
<p>This overlay is what makes MetalLB actually <em>work</em> in your lab. While the Helm chart installs the MetalLB controller and speaker pods, these custom resources tell MetalLB <strong>what IPs to use</strong> and <strong>how to announce them</strong> to your network.</p>
<p>By keeping these files in Git and applying them via ArgoCD, you’re not just deploying MetalLB, you’re making your configuration declarative, version-controlled, and repeatable across environments.</p>
<p>Moving on…</p>
<h1 id="heading-traefik-ingress-routing-built-for-gitops"><strong>Traefik: Ingress Routing Built for GitOps</strong></h1>
<p>Once MetalLB is in place and capable of handing out external IPs, we need something that can route incoming HTTP and HTTPS traffic to the right service inside the cluster. That’s where an <strong>ingress controller</strong> comes in, and for our GitOps setup, <strong>Traefik</strong> is a perfect fit.</p>
<h2 id="heading-what-is-traefik">What is Traefik?</h2>
<p>Traefik is a modern, Kubernetes-native ingress controller that handles routing external traffic into your cluster based on rules you define in Kubernetes. It supports things like:</p>
<ul>
<li><p>Routing traffic based on hostname or path</p>
</li>
<li><p>TLS termination (including Let’s Encrypt support)</p>
</li>
<li><p>Load balancing between multiple pods</p>
</li>
<li><p>Middleware support for things like authentication, redirects, rate limiting, etc.</p>
</li>
</ul>
<p>Traefik is also highly compatible with GitOps workflows. It uses Kubernetes Custom Resource Definitions (CRDs) like <code>IngressRoute</code> and <code>Middleware</code>, which makes it easy to manage all of your ingress behavior declaratively, right from your Git repo.</p>
<h2 id="heading-why-you-need-it-1"><strong>Why You Need It</strong></h2>
<p>Without an ingress controller, every service you want to expose needs its own LoadBalancer service (i.e., a dedicated external IP). That scales poorly, especially in a lab environment with limited IP space.</p>
<p>Traefik solves that problem by letting you expose <strong>multiple services through a single external IP</strong>, usually on ports 80 and 443, by routing requests based on hostnames or paths. This means:</p>
<ul>
<li><p>You can access services like <code>argocd.yourdomain.local</code> and <code>vault.yourdomain.local</code> through the same IP.</p>
</li>
<li><p>You get clean, centralized HTTPS management with built-in TLS support.</p>
</li>
<li><p>You dramatically reduce the number of open ports and public IPs you need.</p>
</li>
</ul>
<p>Paired with MetalLB, Traefik becomes the front door to your entire GitOps platform.</p>
<h2 id="heading-how-it-powers-the-platform-1">How It Powers the Platform</h2>
<p>Traefik is the gateway that makes all the services behind it easily and securely accessible. It enables you to:</p>
<ul>
<li><p>Route HTTP/HTTPS traffic to services like ArgoCD, Vault, and your internal tools</p>
</li>
<li><p>Handle TLS (with optional Let’s Encrypt integration)</p>
</li>
<li><p>Define ingress behavior declaratively via CRDs</p>
</li>
<li><p>Share a single external IP across multiple services, using hostnames or paths</p>
</li>
</ul>
<p>All of this is deployed using ArgoCD, meaning every route, certificate, and service exposure is version-controlled and reproducible.</p>
<h2 id="heading-traefik-installation"><strong>Traefik Installation</strong></h2>
<p>As we covered during the MetalLB install, adding Helm repositories, creating the app in ArgoCD, and configuring the basic Helm parameters is mostly the same for each app we deploy. Because we've already gone through that in detail with MetalLB, I'll just briefly outline the steps again here. No detailed screenshots needed unless there’s a significant difference.</p>
<h3 id="heading-step-1-add-the-traefik-helm-repo"><strong>Step 1: Add the Traefik Helm Repo</strong></h3>
<p>ArgoCD needs to know where to pull the Traefik Helm chart from. For Traefik, we’ll use the official Traefik Helm repository:</p>
<p><strong>Helm Repo URL:</strong></p>
<pre><code class="lang-bash">https://helm.traefik.io/traefik
</code></pre>
<p>In the ArgoCD UI:</p>
<ul>
<li><p>Navigate to <strong>Settings → Repositories</strong></p>
</li>
<li><p>Click <strong>+ CONNECT REPO</strong></p>
</li>
<li><p>Enter the Traefik Helm repo URL listed above</p>
</li>
<li><p>Select <strong>Helm</strong> as the repository type</p>
</li>
<li><p>Provide a name (optional, something like <code>traefik-charts</code>)</p>
</li>
<li><p>Associate the repo with the appropriate ArgoCD project (mine was <code>lab-home</code>)</p>
</li>
<li><p>No authentication is required since this repo is publicly accessible</p>
</li>
<li><p>Click <strong>CONNECT</strong> to finish</p>
</li>
</ul>
<p>Once connected, ArgoCD is ready to deploy the Traefik Helm chart into your cluster.</p>
<h3 id="heading-step-2-create-the-argocd-application-traefik"><strong>Step 2: Create the ArgoCD Application (Traefik)</strong></h3>
<p>Head to the <strong>Applications</strong> tab in ArgoCD, and click <strong>+ NEW APP</strong> to start deploying Traefik.</p>
<p>Here's how you'll fill it out:</p>
<ul>
<li><p><strong>Application Name:</strong> <code>traefik</code></p>
</li>
<li><p><strong>Project:</strong> Select your ArgoCD project</p>
</li>
<li><p><strong>Sync Policy:</strong> Manual (for now)</p>
</li>
<li><p><strong>Repository URL:</strong> Select the Traefik Helm repo you just connected</p>
</li>
<li><p><strong>Chart Name:</strong> <code>traefik</code></p>
</li>
<li><p><strong>Target Revision:</strong> Use <code>latest</code>, or specify a stable version once you've tested and confirmed compatibility</p>
</li>
<li><p><strong>Cluster URL:</strong> Typically <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a> for an in-cluster deploy (if yours differs, just use the appropriate URL)</p>
</li>
<li><p><strong>Namespace:</strong> Use <code>kube-system</code> (check the option to create it if it doesn’t exist yet)</p>
</li>
</ul>
<p><strong>Why</strong> <code>kube-system</code> namespace?<br />Deploying Traefik to the <code>kube-system</code> namespace makes sense because Traefik is essentially a core infrastructure service. Placing it here aligns with Kubernetes best practices, core infrastructure and networking-related services belong in this namespace, separating them clearly from user or application workloads.</p>
<p>When finished, click <strong>CREATE</strong> to finalize the setup.</p>
<h3 id="heading-step-3-add-custom-helm-values-for-traefik"><strong>Step 3: Add Custom Helm Values for Traefik</strong></h3>
<p>Unlike MetalLB, our Traefik deployment uses custom Helm values directly from our Git repository, <strong>without Kustomize</strong>. We'll define these custom values as a second source within our ArgoCD Application manifest.</p>
<p>Here's how you'll set this up in the ArgoCD UI:</p>
<ol>
<li><p>Navigate to the <strong>Traefik</strong> Application you created earlier.</p>
</li>
<li><p>Click the <strong>App details (🖉 edit)</strong> icon in the top-right corner to open the manifest editor.</p>
</li>
<li><p>Scroll down to the manifest, and ensure you're using <code>sources:</code> (plural), since we're adding an additional source.</p>
</li>
<li><p>Modify your ArgoCD Application manifest to look similar to this:</p>
</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">yamlCopyEditproject:</span> <span class="hljs-string">home-lab</span>
<span class="hljs-attr">destination:</span>
  <span class="hljs-attr">server:</span> <span class="hljs-string">https://172.16.99.25:6443</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">kube-system</span>
<span class="hljs-attr">syncPolicy:</span>
  <span class="hljs-attr">syncOptions:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">CreateNamespace=true</span>
<span class="hljs-attr">sources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://helm.traefik.io/traefik</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-number">35.0</span><span class="hljs-number">.1</span>
    <span class="hljs-attr">helm:</span>
      <span class="hljs-attr">valueFiles:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">$values/apps/traefik/values-lab.yml</span>
    <span class="hljs-attr">chart:</span> <span class="hljs-string">traefik</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://github.com/leothelyon17/kubernetes-gitops-playground.git</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-string">HEAD</span>
    <span class="hljs-attr">ref:</span> <span class="hljs-string">values</span>
</code></pre>
<p><strong>Explanation:</strong></p>
<ul>
<li><p>The <strong>first source</strong> references the official Traefik Helm repository, specifying the chart version.</p>
</li>
<li><p>The <strong>second source</strong> references my GitHub repo (or your own), where your custom Helm values (<code>values-lab.yml</code>) are stored.</p>
</li>
<li><p>ArgoCD merges these values when syncing Traefik, allowing environment-specific customizations, such as ingress rules, TLS settings, dashboard exposure, middleware options, and other important configurations.</p>
</li>
</ul>
<p>Once you've updated and saved this manifest, ArgoCD will apply the changes, and Traefik will deploy using your customized configuration, all neatly managed by GitOps.</p>
<h3 id="heading-step-4-sync-the-traefik-application"><strong>Step 4: Sync the Traefik Application</strong></h3>
<p>Once everything looks good, click <strong>Sync</strong> in ArgoCD. It will pull the Traefik Helm chart, merge your custom Helm values (<code>values-lab.yml</code>), and deploy Traefik into your cluster.</p>
<p>You can click into the application details to watch Traefik’s resources spin up; Deployments, Services, IngressRoutes, and more. If the sync fails initially, don't worry, just retry it.</p>
<p>After a short period, you should see Traefik showing a status of <strong>“Healthy”</strong> and <strong>“Synced”</strong>. Verify that Traefik pods are running successfully in your cluster (similar to MetalLB earlier).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745375690165/0b4a9843-3a7e-4306-9622-6dd80cf3bc32.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745375720682/203d51f8-d54a-4a20-9671-1bffe72cb7ac.png" alt class="image--center mx-auto" /></p>
<p>Congratulations! Traefik is now up and running as your ingress controller, ready to handle external HTTP(S) traffic into your cluster.</p>
<h2 id="heading-traefik-custom-helm-values"><strong>Traefik Custom Helm Values</strong></h2>
<p>Let’s take a look at the custom Helm values we’re using for Traefik, pulled from <a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground/blob/main/apps/traefik/values-lab.yml"><code>apps/traefik/values-lab.yml</code></a>. These provide a simple but functional starting point for ingress, dashboard access, and authentication in a lab environment.</p>
<h3 id="heading-key-configuration-highlights"><strong>Key Configuration Highlights</strong></h3>
<h4 id="heading-ingressroute-for-the-traefik-dashboard">IngressRoute for the Traefik Dashboard</h4>
<pre><code class="lang-yaml"><span class="hljs-attr">ingressRoute:</span>
  <span class="hljs-attr">dashboard:</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">matchRule:</span> <span class="hljs-string">Host(`YOUR-URL`)</span>
    <span class="hljs-attr">entryPoints:</span> [<span class="hljs-string">"web"</span>, <span class="hljs-string">"websecure"</span>]
    <span class="hljs-attr">middlewares:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">traefik-dashboard-auth</span>
</code></pre>
<ul>
<li><p><strong>Enables the Traefik dashboard</strong> and exposes it via both HTTP and HTTPS.</p>
</li>
<li><p>Routes traffic based on hostname, i.e. (<code>traefik-dashboard-lab.jjland.local</code>).</p>
</li>
<li><p>Adds a middleware for basic authentication to protect access.</p>
</li>
</ul>
<h4 id="heading-basic-authentication-middleware">Basic Authentication Middleware</h4>
<pre><code class="lang-yaml"><span class="hljs-attr">extraObjects:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">kind:</span> <span class="hljs-string">Secret</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">kubernetes.io/basic-auth</span>
    <span class="hljs-attr">stringData:</span>
      <span class="hljs-attr">username:</span> <span class="hljs-string">admin</span>
      <span class="hljs-attr">password:</span> <span class="hljs-string">changeme</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">kind:</span> <span class="hljs-string">Middleware</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">basicAuth:</span>
        <span class="hljs-attr">secret:</span> <span class="hljs-string">traefik-dashboard-auth-secret</span>
</code></pre>
<ul>
<li><p>Creates a <strong>Kubernetes Secret</strong> with hardcoded credentials (<code>admin</code> / <code>changeme</code>).</p>
</li>
<li><p>Defines a <strong>Traefik Middleware</strong> that references the secret and applies HTTP basic auth to protected routes.</p>
</li>
</ul>
<blockquote>
<p><strong>NOTE:</strong> These credentials are hardcoded and intended only for lab/demo use. You should absolutely replace <code>"changeme"</code> with a strong, securely managed password, or better yet, use a more robust authentication mechanism in production.</p>
</blockquote>
<h4 id="heading-static-loadbalancer-ip-assignment">Static LoadBalancer IP Assignment</h4>
<pre><code class="lang-yaml"><span class="hljs-attr">service:</span>
  <span class="hljs-attr">spec:</span>
    <span class="hljs-attr">loadBalancerIP:</span> <span class="hljs-string">&lt;YOUR</span> <span class="hljs-string">IP</span> <span class="hljs-string">SET</span> <span class="hljs-string">ASIDE</span> <span class="hljs-string">BY</span> <span class="hljs-string">METALLB&gt;</span>
</code></pre>
<ul>
<li>This assigns a <strong>specific external IP</strong> to Traefik’s LoadBalancer service, ensuring stable access through MetalLB.</li>
</ul>
<h3 id="heading-accessing-the-dashboard"><strong>Accessing the Dashboard</strong></h3>
<p>Once deployed and synced in ArgoCD, you can access the Traefik dashboard by visiting the URL set in the custom values file.</p>
<p>To make this work:</p>
<ul>
<li><p>Add a <strong>DNS record</strong> (or local <code>/etc/hosts</code> entry) pointing to your Traefik service IP (in my case, <code>172.16.99.30</code>).</p>
</li>
<li><p>Use the credentials you set in the values file (<code>admin</code> / <code>changeme</code>) to log in via the basic auth prompt.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745375844745/dbe816a6-6f02-48c1-9ccc-867f70ca8dc9.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-why-it-matters-1"><strong>Why It Matters</strong></h3>
<p>This configuration gives you:</p>
<ul>
<li><p>A working Traefik dashboard protected by basic auth</p>
</li>
<li><p>A predictable IP address exposed by MetalLB</p>
</li>
<li><p>A GitOps-managed ingress setup, all stored in Git and synced automatically via ArgoCD</p>
</li>
</ul>
<p>These are just <strong>starter settings</strong>. They work great in a lab, but you’ll want to harden and expand them for production use. Still, even at this basic level, you’re getting all the core benefits: visibility, consistency, and version-controlled configuration.</p>
<p>Let’s move on to the next part of the platform.</p>
<h1 id="heading-rook-ceph-persistent-storage-for-stateful-applications"><strong>Rook + Ceph: Persistent Storage for Stateful Applications</strong></h1>
<p>So far, we’ve deployed the pieces that make your platform accessible, MetalLB for external IPs, and Traefik for routing traffic. But modern platforms don’t just serve traffic, they store data. If you’re planning to run apps like Nautobot, NetBox, or Postgres, you’ll need reliable, persistent storage to keep data alive across restarts and node failures.</p>
<p>That’s where <strong>Rook + Ceph</strong> comes in.</p>
<h2 id="heading-what-is-rook-ceph"><strong>What is Rook + Ceph?</strong></h2>
<p><strong>Ceph</strong> is a distributed storage system that provides block, object, and file storage, all highly available and scalable. It’s used in enterprise environments for cloud-native storage, and it’s rock solid.</p>
<p><strong>Rook</strong> is the Kubernetes operator that makes deploying and managing Ceph clusters easier and more native to the Kubernetes ecosystem. Together, they turn a set of disks across your nodes into a <strong>resilient, self-healing storage platform</strong>.</p>
<h2 id="heading-why-you-need-it-2">Why You Need It</h2>
<p>Kubernetes doesn’t come with a built-in storage backend. While it allows you to declare <code>PersistentVolumeClaims</code>, it’s up to you to provide the actual storage behind them. In cloud environments, that’s easy, just hook into EBS, Azure Disks, or whatever your platform provides. But in a lab or on-prem cluster? You’re on your own.</p>
<p><strong>Rook + Ceph fills that gap</strong>. Once deployed, it becomes your cluster’s dynamic, self-healing storage layer. You can provision persistent volumes for any stateful workload—databases, internal tooling, monitoring stacks, and more, without having to manually manage local disks or worry about data loss.</p>
<h2 id="heading-how-it-powers-the-platform-2">How It Powers the Platform</h2>
<p>Rook + Ceph is the backbone of persistent infrastructure in this setup. It enables you to:</p>
<ul>
<li><p><strong>Create</strong> <code>PersistentVolumes</code> dynamically, on demand, using <code>StorageClass</code> definitions</p>
</li>
<li><p><strong>Run stateful apps</strong> like NetBox, Nautobot, PostgreSQL, and Prometheus with confidence</p>
</li>
<li><p><strong>Survive pod restarts and node reboots</strong>, your data stays intact and available</p>
</li>
<li><p><strong>Manage it all declaratively</strong>, deployed and version-controlled with ArgoCD, just like everything else</p>
</li>
</ul>
<h2 id="heading-what-this-looks-like-when-deployed">What This Looks Like When Deployed</h2>
<p>Once your Rook + Ceph configuration is applied and the cluster becomes active, you’ll effectively have a <strong>resilient, distributed storage system</strong> spanning all your nodes. In this setup:</p>
<ul>
<li><p>Ceph stores data <strong>redundantly across all three nodes</strong>, similar in concept to a 3-node <strong>RAID-1</strong> (mirrored) configuration.</p>
</li>
<li><p>When one node goes offline or a disk fails, your data is still accessible and safe.</p>
</li>
<li><p>The Ceph monitor daemons ensure quorum and cluster health, while OSDs (Object Storage Daemons) replicate data across your available storage devices (e.g., <code>/dev/vdb</code> on each node).</p>
</li>
</ul>
<p>This redundancy is built-in and automatically managed by the Ceph cluster itself, no manual RAID configuration needed. It’s a core reason why Ceph is trusted in both enterprise and lab-scale deployments.</p>
<h2 id="heading-what-were-deploying-the-operator-storagecluster">What We’re Deploying: The Operator + StorageCluster</h2>
<p>As with many Kubernetes-native tools, Rook uses the <strong>Operator pattern</strong> to manage Ceph. We’ll be deploying two key components:</p>
<ul>
<li><p><strong>The Rook-Ceph Operator</strong> – Acts as a controller that manages Ceph-specific resources and keeps everything in the desired state.</p>
</li>
<li><p><strong>A</strong> <code>CephCluster</code> resource – Defines how the storage backend should be built using the disks available across your nodes.</p>
</li>
</ul>
<blockquote>
<p><strong>What’s an Operator?</strong><br />A Kubernetes Operator is a purpose-built controller that manages complex stateful applications by watching for custom resources (like <code>CephCluster</code>) and continuously reconciling their desired state—creating, healing, scaling, and configuring everything automatically.</p>
</blockquote>
<p>By deploying both the operator and the cluster config together, we get a hands-off, fully declarative storage setup. Everything is defined in Git, synced by ArgoCD, and managed by the operator—including provisioning, recovery, and upgrades.</p>
<h3 id="heading-step-1-add-the-rook-ceph-helm-repo"><strong>Step 1: Add the Rook-Ceph Helm Repo</strong></h3>
<p>ArgoCD needs to know where to pull the Rook-Ceph Helm chart from. For this, we’ll use the official Rook Helm repository:</p>
<p><strong>Helm Repo URL:</strong></p>
<pre><code class="lang-bash">https://charts.rook.io/release
</code></pre>
<p>In the ArgoCD UI:</p>
<ul>
<li><p>Navigate to <strong>Settings → Repositories</strong></p>
</li>
<li><p>Click <strong>+ CONNECT REPO</strong></p>
</li>
<li><p>Enter the Helm repo URL listed above</p>
</li>
<li><p>Select <strong>Helm</strong> as the repository type</p>
</li>
<li><p>Optionally give it a name (e.g., <code>rook-ceph-charts</code>)</p>
</li>
<li><p>Associate the repo with your ArgoCD project (mine was <code>lab-home</code>)</p>
</li>
<li><p>No authentication is required since it’s publicly accessible</p>
</li>
<li><p>Click <strong>CONNECT</strong> to finish</p>
</li>
</ul>
<p>Once connected, ArgoCD will be able to deploy both the <strong>Rook-Ceph operator</strong> and <strong>storage cluster</strong> using this chart.</p>
<h3 id="heading-step-2-create-the-argocd-application-rook-ceph"><strong>Step 2: Create the ArgoCD Application (Rook-Ceph)</strong></h3>
<p>Now that the repo is connected, head to the <strong>Applications</strong> tab in ArgoCD and click <strong>+ NEW APP</strong> to start the deployment.</p>
<p>Here’s how to fill it out:</p>
<ul>
<li><p><strong>Application Name:</strong> <code>rook-ceph</code></p>
</li>
<li><p><strong>Project:</strong> Select your ArgoCD project (e.g., <code>lab-home</code>)</p>
</li>
<li><p><strong>Sync Policy:</strong> Manual (for now)</p>
</li>
<li><p><strong>Repository URL:</strong> Select the Rook Helm repo you just connected</p>
</li>
<li><p><strong>Chart Name:</strong> <code>rook-ceph</code></p>
</li>
<li><p><strong>Target Revision:</strong> Use <code>latest</code>, or pin to a stable version you’ve tested</p>
</li>
<li><p><strong>Cluster URL:</strong> Typically <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a> if deploying in-cluster</p>
</li>
<li><p><strong>Namespace:</strong> <code>rook-ceph</code> (check the box to create it if it doesn’t exist)</p>
</li>
</ul>
<h3 id="heading-why-the-rook-ceph-namespace">Why the <code>rook-ceph</code> Namespace?</h3>
<p>Rook and Ceph manage a lot of moving parts—monitors, OSDs, managers, etc.—and isolating those components into their own namespace (<code>rook-ceph</code>) helps keep your cluster clean and easier to troubleshoot. It also aligns with common community best practices and makes upgrades and deletions much safer.</p>
<p>Once you’ve filled everything out, click <strong>CREATE</strong> to finish provisioning the application.</p>
<h3 id="heading-step-3-add-custom-helm-values-kustomize-overlay-for-rook-ceph"><strong>Step 3: Add Custom Helm Values + Kustomize Overlay for Rook-Ceph</strong></h3>
<p>Rook-Ceph is one of the more complex components in our GitOps platform. It’s not just a single deployment, it involves multiple controllers, CRDs, and cluster-level storage logic. Because of that, we’ll be using <strong>both a Helm chart (with custom values)</strong> and a <strong>Kustomize overlay</strong> to deploy it cleanly and maintainably.</p>
<p>This dual-source approach lets us:</p>
<ul>
<li><p>Use the <strong>Helm chart</strong> to install the Rook-Ceph operator and core components</p>
</li>
<li><p>Apply <strong>custom values</strong> to tailor behavior for our environment (resource tuning, monitor placement, dashboard settings, etc.)</p>
</li>
<li><p>Layer in <strong>Kustomize-based manifests</strong> for complex resources like <code>CephCluster</code>, <code>StorageClass</code>, <code>CephFilesystem</code>, resources that often require more precise control</p>
</li>
</ul>
<h3 id="heading-argocd-application-sources">ArgoCD Application Sources</h3>
<p>When editing your ArgoCD Application manifest, your <code>sources</code> block will look similar to this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">sources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://charts.rook.io/release</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-string">v1.17.0</span>
    <span class="hljs-attr">helm:</span>
      <span class="hljs-attr">valueFiles:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">$values/apps/rook-ceph/values-lab.yml</span>
    <span class="hljs-attr">chart:</span> <span class="hljs-string">rook-ceph</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://github.com/leothelyon17/kubernetes-gitops-playground.git</span>
    <span class="hljs-attr">path:</span> <span class="hljs-string">apps/rook-ceph/overlays/lab</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-string">HEAD</span>
    <span class="hljs-attr">ref:</span> <span class="hljs-string">values</span>
</code></pre>
<h3 id="heading-why-both-sources">Why Both Sources?</h3>
<ul>
<li><p>The <strong>Helm chart</strong> deploys the operator and all required CRDs in the correct order.</p>
</li>
<li><p>The <strong>Kustomize overlay</strong> (from your Git repo) contains environment-specific resources like:</p>
<ul>
<li><p><strong>CephCluster</strong> – the main storage cluster definition</p>
</li>
<li><p><strong>StorageClass</strong> – so other apps can request storage using <code>PersistentVolumeClaims</code></p>
</li>
<li><p><strong>CephFileSystem</strong> – enables shared POSIX-compliant volumes for apps needing ReadWriteMany access</p>
</li>
<li><p><strong>Optional extras</strong> like <code>CephBlockPool</code> or a toolbox deployment for CLI-based Ceph management</p>
</li>
</ul>
</li>
</ul>
<blockquote>
<p>You can find these manifests in the repo under:<br /><code>apps/rook-ceph/overlays/lab/</code></p>
</blockquote>
<p>Once saved, ArgoCD will treat both sources as part of the same application and sync them together, ensuring everything is deployed in the right order and stays in sync with Git.</p>
<h2 id="heading-understanding-the-rook-ceph-overlay-managing-complexity-with-gitops"><strong>Understanding the Rook-Ceph Overlay: Managing Complexity with GitOps</strong></h2>
<p>I wanted to cover this now before we try and sync. Setting up Rook-Ceph in a GitOps workflow involves more than just deploying a Helm chart. You’re orchestrating a sophisticated storage platform made up of tightly coupled components: an operator, CRDs, a distributed Ceph cluster, storage classes, ingress routes, and more. Each piece needs to be configured correctly and deployed in the proper order.</p>
<p>To keep all of this manageable and repeatable, we separate concerns using a combination of <strong>custom Helm values</strong> and a <strong>Kustomize overlay</strong>. The overlay found in <code>apps/rook-ceph/overlays/lab</code> brings together the critical resources required for a working Ceph deployment—block pools, shared filesystems, storage classes, and even a dashboard ingress.</p>
<p>The sections below break down each of these files so you can understand what’s happening, why it’s needed, and how it fits into the larger GitOps puzzle.</p>
<h2 id="heading-appsrook-cephvalues-labyml"><code>apps/rook-ceph/values-lab.yml</code></h2>
<pre><code class="lang-yaml"><span class="hljs-attr">csi:</span>
  <span class="hljs-attr">enableRbdDriver:</span> <span class="hljs-literal">false</span>
</code></pre>
<ul>
<li><p><strong>Purpose:</strong> Disables the RBD (block-device) CSI driver in this lab setup, since we’re only using CephFS here.</p>
</li>
<li><p><strong>Why it matters:</strong> Keeps the cluster lean by not installing unused CSI components.</p>
</li>
</ul>
<h2 id="heading-appsrook-cephoverlayslab"><code>apps/rook-ceph/overlays/lab/</code></h2>
<h3 id="heading-ceph-clusteryml"><code>ceph-cluster.yml</code></h3>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">ceph.rook.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">CephCluster</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">rook-ceph</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">rook-ceph</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">cephVersion:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">quay.io/ceph/ceph:v19.2.1</span>
  <span class="hljs-attr">dataDirHostPath:</span> <span class="hljs-string">/var/lib/rook</span>
  <span class="hljs-attr">mon:</span>
    <span class="hljs-attr">count:</span> <span class="hljs-number">3</span>
    <span class="hljs-attr">allowMultiplePerNode:</span> <span class="hljs-literal">false</span>
  <span class="hljs-attr">dashboard:</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">storage:</span>
    <span class="hljs-attr">useAllNodes:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">useAllDevices:</span> <span class="hljs-literal">false</span>
    <span class="hljs-attr">deviceFilter:</span> <span class="hljs-string">vdb</span>
</code></pre>
<ul>
<li><p><strong>Defines</strong> the core <code>CephCluster</code> resource.</p>
</li>
<li><p><strong>Key settings:</strong></p>
<ul>
<li><p>Runs 3 monitors for quorum.</p>
</li>
<li><p>Uses each node’s <code>vdb</code> device for OSDs (fits your lab VM disk layout).</p>
</li>
<li><p>Enables the Ceph dashboard for visual health checks.</p>
</li>
</ul>
</li>
</ul>
<p><strong>⚠️ NOTE:</strong> These settings are specific to <strong>my 3-node lab cluster</strong>, where each node has:</p>
<ul>
<li><p>One OS disk (<code>vda</code>)</p>
</li>
<li><p>One dedicated Ceph data disk (<code>vdb</code>)</p>
</li>
</ul>
<p>Example disk layout (<code>lsblk</code> output from one node):</p>
<pre><code class="lang-bash">bashCopyEdit[jeff@rocky9-lab-node1 ~]$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0          11:0    1  1.7G  0 rom  
vda         252:0    0   50G  0 disk 
├─vda1      252:1    0    1G  0 part /boot
└─vda2      252:2    0   49G  0 part 
  ├─rl-root 253:0    0   44G  0 lvm  /
  └─rl-swap 253:1    0    5G  0 lvm  
vdb         252:16   0  250G  0 disk
</code></pre>
<p>Your disk layout will likely be different. I’ve configured Ceph to use only the <code>vdb</code> disk via the <code>deviceFilter</code> setting to avoid accidentally wiping the OS disk.</p>
<p>⚠️ <strong>Be careful:</strong> If you don’t tailor these values to your hardware, you could unintentionally destroy existing data. Always verify your node’s disk setup and adjust your configuration accordingly.</p>
<h3 id="heading-ceph-filesystemyml"><code>ceph-filesystem.yml</code></h3>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">ceph.rook.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">CephFilesystem</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">k8s-ceph-fs</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">rook-ceph</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">metadataPool:</span>
    <span class="hljs-attr">failureDomain:</span> <span class="hljs-string">host</span>
    <span class="hljs-attr">replicated:</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">dataPools:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">replicated</span>
      <span class="hljs-attr">failureDomain:</span> <span class="hljs-string">host</span>
      <span class="hljs-attr">replicated:</span>
        <span class="hljs-attr">size:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">preserveFilesystemOnDelete:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">metadataServer:</span>
    <span class="hljs-attr">activeCount:</span> <span class="hljs-number">1</span>
    <span class="hljs-attr">activeStandby:</span> <span class="hljs-literal">true</span>
</code></pre>
<ul>
<li><p><strong>Creates</strong> a <code>CephFilesystem</code> (CephFS) for <strong>shared, POSIX-style volumes</strong>.</p>
</li>
<li><p><strong>Why CephFS?</strong> Enables <code>ReadWriteMany</code> storage, which block pools alone can’t provide.</p>
</li>
</ul>
<h3 id="heading-ceph-storageclass-deleteyml-amp-ceph-storageclass-retainyml"><code>ceph-storageclass-delete.yml</code> &amp; <code>ceph-storageclass-retain.yml</code></h3>
<p>Both define Kubernetes <code>StorageClass</code> objects that front the CephFS CSI driver:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">storage.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">StorageClass</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">rook-cephfs-delete</span>      <span class="hljs-comment"># or rook-cephfs-retain</span>
<span class="hljs-attr">provisioner:</span> <span class="hljs-string">rook-ceph.cephfs.csi.ceph.com</span>
<span class="hljs-attr">parameters:</span>
  <span class="hljs-attr">clusterID:</span> <span class="hljs-string">rook-ceph</span>
  <span class="hljs-attr">fsName:</span> <span class="hljs-string">k8s-ceph-fs</span>
  <span class="hljs-attr">pool:</span> <span class="hljs-string">k8s-ceph-fs-replicated</span>
  <span class="hljs-attr">csi.storage.k8s.io/provisioner-secret-name:</span> <span class="hljs-string">rook-csi-cephfs-provisioner</span>
  <span class="hljs-attr">csi.storage.k8s.io/node-stage-secret-name:</span> <span class="hljs-string">rook-csi-cephfs-node</span>
<span class="hljs-attr">reclaimPolicy:</span> <span class="hljs-string">Delete</span>       <span class="hljs-comment"># or Retain</span>
<span class="hljs-attr">allowVolumeExpansion:</span> <span class="hljs-literal">true</span>
</code></pre>
<ul>
<li><p><strong>Difference:</strong></p>
<ul>
<li><p><code>rook-cephfs-delete</code> will <strong>delete</strong> PV data when PVCs are removed.</p>
</li>
<li><p><code>rook-cephfs-retain</code> will <strong>retain</strong> data for manual cleanup or backup.</p>
</li>
</ul>
</li>
<li><p><strong>Why two classes?</strong> Gives you flexibility for different workloads (ephemeral test vs. persistent data).</p>
</li>
</ul>
<h3 id="heading-ingress-route-guiyml"><code>ingress-route-gui.yml</code></h3>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">traefik.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">IngressRoute</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">ceph-ingressroute-gui</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">rook-ceph</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">entryPoints:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">web</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">websecure</span>
  <span class="hljs-attr">routes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">match:</span> <span class="hljs-string">Host(`ceph-dashboard-lab.jjland.local`)</span> <span class="hljs-comment"># EXAMPLE</span>
      <span class="hljs-attr">kind:</span> <span class="hljs-string">Rule</span>
      <span class="hljs-attr">services:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">rook-ceph-mgr-dashboard</span>
          <span class="hljs-attr">port:</span> <span class="hljs-number">7000</span>
</code></pre>
<ul>
<li><p><strong>Exposes</strong> the Ceph dashboard through Traefik on your chosen host.</p>
</li>
<li><p><strong>Why:</strong> Lets you reach the Ceph UI (after DNS/hosts setup) without manually port-forwarding.</p>
</li>
</ul>
<h3 id="heading-kustomizationyml-1"><code>kustomization.yml</code></h3>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">ceph-cluster.yml</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">ingress-route-gui.yml</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">ceph-filesystem.yml</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">ceph-storageclass-delete.yml</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">ceph-storageclass-retain.yml</span>
</code></pre>
<ul>
<li><p><strong>Aggregates</strong> all the above files into a single overlay that ArgoCD can sync.</p>
</li>
<li><p><strong>Why Kustomize?</strong> Keeps base Helm installs separate from environment-specific definitions, making updates cleaner and more maintainable.</p>
</li>
</ul>
<h3 id="heading-step-4-sync-the-rook-ceph-application"><strong>Step 4: Sync the Rook-Ceph Application</strong></h3>
<p>Ready? Go ahead and click <strong>Sync</strong> in ArgoCD for the <code>rook-ceph</code> application.</p>
<p>This one’s going to take a little more time, and for good reason. There’s a lot happening under the hood.</p>
<p>When you sync, ArgoCD will:</p>
<ul>
<li><p>Deploy the <strong>Rook-Ceph Operator</strong>, which is responsible for watching and managing Ceph resources in your cluster</p>
</li>
<li><p>Install <strong>CephFS CSI drivers</strong>, RBAC roles, and CRDs needed to support persistent volumes</p>
</li>
<li><p>Apply your <code>CephCluster</code>, <code>CephFilesystem</code>, and <code>StorageClass</code> definitions via the Kustomize overlay</p>
</li>
</ul>
<p>But the real magic starts <strong>after the operator is running</strong>.</p>
<p>Once the operator is up, it will immediately start watching for additional Ceph custom resources in the <code>rook-ceph</code> namespace. When it discovers the <code>CephCluster</code> definition, it will:</p>
<ul>
<li><p>Initialize the <strong>monitors</strong> (MONs) for quorum</p>
</li>
<li><p>Deploy the <strong>manager</strong> (MGR) for handling cluster state and dashboard</p>
</li>
<li><p>Start spinning up the <strong>OSDs</strong> (Object Storage Daemons) using the storage devices you specified (in this case, <code>vdb</code> on each node)</p>
</li>
</ul>
<p>This process can take several minutes depending on your hardware, node performance, and the size of your disks.</p>
<blockquote>
<p><strong>How do you know it worked?</strong><br />The cluster is healthy when you see:</p>
<ul>
<li><p><strong>3 running OSD pods</strong>, one for each disk across your 3 nodes</p>
</li>
<li><p>The <code>rook-ceph</code> application status in ArgoCD shows <strong>“Healthy”</strong> and <strong>“Synced”</strong></p>
</li>
<li><p>Optionally: access the Ceph dashboard and verify health checks (covered earlier)</p>
</li>
</ul>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745381983708/e34e0bd9-e2a4-47e0-a7e4-73fec8ededf3.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745382021147/c5657415-988e-4e35-be7f-645ea2bc5bd9.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-troubleshooting-tips">Troubleshooting Tips</h2>
<p>Rook-Ceph is powerful, but complex. And with that complexity comes the potential for a lot of things to go sideways. I won’t dive into every failure mode here, but I’ll leave you with a few quick tips that can help when something’s not working as expected:</p>
<ul>
<li><p><strong>Use the ArgoCD UI to inspect pod logs.</strong><br />  Click into the <code>rook-ceph</code> application, navigate to the "PODS" tab, and use the logs view to get real-time output from key components like the operator, mons, OSDs, and mgr. Most issues will reveal themselves here.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745382219139/9aa55b8c-e93d-4c20-b606-6af283220c6c.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Resync the operator app to restart it.</strong><br />  If the cluster gets stuck or fails to initialize certain pieces, manually syncing the operator application in ArgoCD will redeploy the pod. This is often enough to force a retry or pull in updated CRDs.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745382285748/83df49ee-635a-4ca7-bfb3-5e58f56657b0.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Disk issues?</strong><br />  If Ceph is skipping disks or refusing to reuse them, it’s usually leftover metadata. Try running a full zap with <code>ceph-volume</code> or fallback to <code>wipefs</code>, <code>sgdisk</code>, and <code>dd</code> to fully clean the disk.</p>
</li>
</ul>
<p>Congratulations! Once everything is green, you now have a fully functional Ceph storage backend—redundant, self-healing, and fully managed through GitOps.</p>
<h1 id="heading-secrets-management-external-secrets-hashicorp-vault">Secrets Management: External Secrets + HashiCorp Vault</h1>
<p>In any production platform, secrets management isn’t optional, it’s foundational. We're talking about things like API tokens, database passwords, SSH keys, and TLS certs. Storing these directly in your Git repo? Not an option. Hardcoding them into manifests? Definitely not.</p>
<p>That’s where <strong>External Secrets</strong> and <strong>HashiCorp Vault</strong> come in, and together, they solve this problem the right way.</p>
<h2 id="heading-what-is-hashicorp-vault">What is HashiCorp Vault?</h2>
<p><strong>Vault</strong> is a centralized secrets manager that securely stores, encrypts, and dynamically serves secrets to applications and users. It supports access control, auditing, and integration with identity systems and cloud providers. In this stack, Vault acts as the secure system of record for all sensitive data.</p>
<h2 id="heading-what-is-external-secrets">What is External Secrets?</h2>
<p><strong>External Secrets</strong> is a Kubernetes operator that bridges external secret stores (like Vault) with native Kubernetes <code>Secret</code> objects. It watches for custom resources like <code>ExternalSecret</code> and automatically pulls values from Vault into the cluster, keeping them updated and consistent without manual intervention.</p>
<h2 id="heading-why-network-automation-needs-this">Why Network Automation Needs This</h2>
<p>Network automation platforms—like NetBox, Nautobot, and custom Python tooling—frequently need access to sensitive data:</p>
<ul>
<li><p>Device credentials for SSH or API-based provisioning</p>
</li>
<li><p>Authentication tokens for systems like GitHub, Slack, or ServiceNow</p>
</li>
<li><p>Vaulted credentials for orchestrating changes via Ansible or Nornir</p>
</li>
</ul>
<p>You don’t want these values floating around in plaintext in Git. But you still want to <strong>declare your intent</strong> (what secrets are needed and where) in version control. This is especially critical when you're deploying infrastructure with GitOps and need environments to be reproducible and secure.</p>
<p>With Vault + External Secrets, you can:</p>
<ul>
<li><p>Keep the actual secret values <strong>outside of Git</strong></p>
</li>
<li><p>Still declare your <code>ExternalSecret</code> manifests <strong>in Git</strong> as part of your ArgoCD-managed platform</p>
</li>
<li><p>Let Kubernetes handle syncing and refreshing secrets automatically</p>
</li>
</ul>
<p>This pattern ensures your network automation stack is <strong>secure, scalable, and compliant</strong>, without losing any GitOps benefits.</p>
<h2 id="heading-installing-external-secrets-operator">Installing External Secrets Operator</h2>
<p>Setting up External Secrets is straightforward and follows the same pattern we’ve used throughout this platform. In this section, we’ll deploy the External Secrets Operator using its official Helm chart with default values, no custom overlays, or secret stores just yet.</p>
<h3 id="heading-step-1-add-the-helm-repo-1">Step 1: Add the Helm Repo</h3>
<p>First, add the External Secrets Helm repository to ArgoCD:</p>
<ol>
<li><p>In the ArgoCD UI, go to <strong>Settings → Repositories</strong></p>
</li>
<li><p>Click <strong>+ CONNECT REPO</strong></p>
</li>
<li><p>Fill in the following:</p>
<ul>
<li><p><strong>Type:</strong> Helm</p>
</li>
<li><p><strong>URL:</strong> <a target="_blank" href="https://charts.external-secrets.io"><code>https://charts.external-secrets.io</code></a></p>
</li>
<li><p><strong>Name (optional):</strong> external-secrets</p>
</li>
<li><p><strong>Project:</strong> Choose your ArgoCD project (e.g., <code>lab-home</code>)</p>
</li>
<li><p><strong>Authentication:</strong> Leave empty (this is a public repo)</p>
</li>
</ul>
</li>
<li><p>Click <strong>CONNECT</strong> to save</p>
</li>
</ol>
<h3 id="heading-step-2-create-the-argocd-application-1">Step 2: Create the ArgoCD Application</h3>
<p>Navigate to <strong>Applications → + NEW APP</strong>, and fill out the form like this:</p>
<ul>
<li><p><strong>Application Name:</strong> external-secrets</p>
</li>
<li><p><strong>Project:</strong> lab-home (or your equivalent)</p>
</li>
<li><p><strong>Sync Policy:</strong> Manual</p>
</li>
<li><p><strong>Repository URL:</strong> Select the Helm repo you just added</p>
</li>
<li><p><strong>Chart:</strong> <code>external-secrets</code></p>
</li>
<li><p><strong>Target Revision:</strong> latest (or a specific version like <code>0.16.1</code>)</p>
</li>
<li><p><strong>Cluster URL:</strong> <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a></p>
</li>
<li><p><strong>Namespace:</strong> <code>external-secrets</code><br />  <em>(Check the box to create the namespace if it doesn’t exist)</em></p>
</li>
</ul>
<p>Click <strong>CREATE</strong> to finish.</p>
<h3 id="heading-step-3-sync-the-application">Step 3: Sync the Application</h3>
<p>Once the app is created, hit <strong>SYNC</strong> in the ArgoCD UI. This will:</p>
<ul>
<li><p>Deploy the External Secrets Operator into your cluster</p>
</li>
<li><p>Create the necessary CRDs and controller components</p>
</li>
<li><p>Make the <code>ExternalSecret</code>, <code>SecretStore</code>, and <code>ClusterSecretStore</code> resource types available</p>
</li>
</ul>
<p>You should see the app enter a <strong>Synced</strong> and <strong>Healthy</strong> state once everything is up and running. No custom values or overlays are needed at this stage.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745417501986/ca879d52-362e-4da4-87df-14e68decf18b.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-installing-hashicorp-vault">Installing HashiCorp Vault</h2>
<p>Vault is our centralized secrets store, and in this setup we’re deploying it with two main goals in mind:</p>
<ul>
<li><p>Enable its built-in GUI for easy inspection and management</p>
</li>
<li><p>Ensure secret data is persisted using our Rook-Ceph-backed StorageClass</p>
</li>
</ul>
<p>To accomplish this, we’ll combine a Helm-based deployment with a Kustomize overlay that adds a Traefik <code>IngressRoute</code> for secure browser access.</p>
<h3 id="heading-step-1-add-the-helm-repo-2">Step 1: Add the Helm Repo</h3>
<p>Add the official HashiCorp Helm chart repo to ArgoCD:</p>
<ol>
<li><p>In the ArgoCD UI, go to <strong>Settings → Repositories</strong></p>
</li>
<li><p>Click <strong>+ CONNECT REPO</strong></p>
</li>
<li><p>Fill in:</p>
<ul>
<li><p><strong>Type:</strong> Helm</p>
</li>
<li><p><strong>URL:</strong> <code>https://helm.releases.hashicorp.com</code></p>
</li>
<li><p><strong>Project:</strong> <code>lab-home</code> (or whatever you're using)</p>
</li>
<li><p><strong>Authentication:</strong> Leave blank (public repo)</p>
</li>
</ul>
</li>
<li><p>Click <strong>CONNECT</strong> to save</p>
</li>
</ol>
<h3 id="heading-step-2-prepare-your-vault-application">Step 2: Prepare Your Vault Application</h3>
<p>Vault is more stateful and config-heavy than most apps, so we’re using <strong>two sources</strong> in our ArgoCD Application:</p>
<ul>
<li><p>A Helm chart to install Vault and enable persistent storage</p>
</li>
<li><p>A Kustomize overlay that exposes the Vault UI through Traefik</p>
</li>
</ul>
<p>Here’s an example Application manifest (adjust values as needed for your setup):</p>
<pre><code class="lang-yaml"><span class="hljs-attr">project:</span> <span class="hljs-string">lab-home</span>
<span class="hljs-attr">destination:</span>
  <span class="hljs-attr">server:</span> <span class="hljs-string">https://kubernetes.default.svc</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">vault</span>
<span class="hljs-attr">syncPolicy:</span>
  <span class="hljs-attr">syncOptions:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">CreateNamespace=true</span>
<span class="hljs-attr">sources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://helm.releases.hashicorp.com</span>
    <span class="hljs-attr">chart:</span> <span class="hljs-string">vault</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-number">0.30</span><span class="hljs-number">.0</span> <span class="hljs-comment"># or latest stable</span>
    <span class="hljs-attr">helm:</span>
      <span class="hljs-attr">valueFiles:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">$values/apps/hashicorp-vault/values-lab.yml</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://github.com/leothelyon17/kubernetes-gitops-playground.git</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-string">HEAD</span>
    <span class="hljs-attr">path:</span> <span class="hljs-string">apps/hashicorp-vault/overlays/lab</span>
    <span class="hljs-attr">ref:</span> <span class="hljs-string">values</span>
</code></pre>
<blockquote>
<p><strong>Note:</strong> The Git repo and folder structure here are based on my <a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground">kubernetes-gitops-playground</a>. If you’re using your own repo, be sure to adjust the <code>repoURL</code>, <code>path</code>, and <code>valueFiles</code> references accordingly.</p>
</blockquote>
<h3 id="heading-step-3-custom-helm-values">Step 3: Custom Helm Values</h3>
<p>In your Git repo, the file at <code>apps/vault/values-lab.yml</code> should enable:</p>
<ul>
<li><p>The Vault UI (<code>ui: true</code>)</p>
</li>
<li><p>Persistent storage via the Rook-Ceph-backed <code>StorageClass</code> you created earlier</p>
</li>
</ul>
<p>Example configuration:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">server:</span>

  <span class="hljs-attr">dataStorage:</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-comment"># Size of the PVC created</span>
    <span class="hljs-attr">size:</span> <span class="hljs-string">1Gi</span>
    <span class="hljs-comment"># Location where the PVC will be mounted.</span>
    <span class="hljs-attr">mountPath:</span> <span class="hljs-string">"/vault/data"</span>
    <span class="hljs-comment"># Name of the storage class to use.  If null it will use the</span>
    <span class="hljs-comment"># configured default Storage Class.</span>
    <span class="hljs-attr">storageClass:</span> <span class="hljs-string">rook-cephfs-retain</span>
    <span class="hljs-comment"># Access Mode of the storage device being used for the PVC</span>
    <span class="hljs-attr">accessMode:</span> <span class="hljs-string">ReadWriteOnce</span>

<span class="hljs-comment"># Vault UI</span>
<span class="hljs-attr">ui:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<h3 id="heading-step-4-expose-vault-securely-with-traefik">Step 4: Expose Vault Securely with Traefik</h3>
<p>In your <code>apps/vault/overlays/lab</code> directory, define a Kustomize file to expose the UI via Traefik.</p>
<p>Example: <code>kustomization.yml</code></p>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">ingress-route-gui.yml</span>
</code></pre>
<p>And in <code>ingress-route-gui.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">traefik.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">IngressRoute</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">vault-dashboard</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">vault</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">entryPoints:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">websecure</span>
  <span class="hljs-attr">routes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">match:</span> <span class="hljs-string">Host(`vault-lab.jjland.local`)</span> <span class="hljs-comment"># EXAMPLE</span>
      <span class="hljs-attr">kind:</span> <span class="hljs-string">Rule</span>
      <span class="hljs-attr">services:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">vault</span>
          <span class="hljs-attr">port:</span> <span class="hljs-number">8200</span>
</code></pre>
<blockquote>
<p><strong>Note:</strong> <code>vault-lab.jjland.local</code> is an example hostname used in my lab.<br />If you're following along exactly, feel free to use it, just be sure to add a local DNS or <code>/etc/hosts</code> entry that maps this to your cluster’s ingress IP.<br />Otherwise, replace this hostname with one appropriate for your environment.</p>
</blockquote>
<h3 id="heading-step-5-sync-the-application">Step 5: Sync the Application</h3>
<p>Once your Helm values and Kustomize overlay are in place and committed to Git, go ahead and <strong>sync the Vault application from ArgoCD</strong>.</p>
<p>ArgoCD will deploy all Vault components into the <code>vault-lab</code> namespace, including:</p>
<ul>
<li><p>The StatefulSet for the Vault server</p>
</li>
<li><p>The service account, RBAC roles, and services</p>
</li>
<li><p>A PersistentVolumeClaim (PVC) for storing Vault data</p>
</li>
<li><p>Your custom IngressRoute for exposing the GUI</p>
</li>
</ul>
<p>After syncing, head to the <strong>Vault app in ArgoCD</strong> to verify the following:</p>
<ul>
<li><p>The app status should be <strong>Synced</strong></p>
</li>
<li><p>The PVC should be <strong>Bound</strong> and <strong>Healthy</strong></p>
</li>
<li><p>The main Vault pod will likely remain in a <strong>Progressing</strong> state, this is expected</p>
</li>
</ul>
<p>That <strong>“Progressing” status is normal</strong> because Vault isn’t fully initialized yet. It won’t report itself as ready until it has been <strong>manually initialized and unsealed</strong> for the first time.</p>
<p>Before moving forward, it’s a good idea to:</p>
<ul>
<li><p>Inspect the pod logs in the ArgoCD UI if anything seems stuck</p>
</li>
<li><p>Check <code>kubectl get pvc -n vault-lab</code> to confirm the PVC is attached and healthy</p>
</li>
<li><p>Use <code>kubectl describe pod</code> or <code>describe pvc</code> to troubleshoot issues</p>
</li>
</ul>
<p>If all looks good, navigate to the Vault UI in your browser:</p>
<pre><code class="lang-bash">https://vault-lab.jjland.local <span class="hljs-comment"># EXAMPLE</span>
</code></pre>
<blockquote>
<p>If you’re using a different hostname, be sure you’ve created the appropriate DNS or <code>/etc/hosts</code> entry.</p>
</blockquote>
<p>From the web UI, you can <strong>initialize Vault</strong>, generate unseal keys, and perform the first unseal operation, all interactively.</p>
<h2 id="heading-initializing-vault-through-the-gui">Initializing Vault Through the GUI</h2>
<p>Once the Vault UI is accessible, it’s time to initialize the system. Vault doesn’t become “ready” until this step is completed, and it only needs to be done once per cluster.</p>
<h3 id="heading-step-1-open-the-vault-ui">Step 1: Open the Vault UI</h3>
<p>Navigate to the Vault dashboard in your browser:</p>
<pre><code class="lang-bash">https://vault-lab.jjland.local
</code></pre>
<p>(Or your custom hostname if you’re using a different setup.)</p>
<p>You’ll be presented with a message that Vault has not yet been initialized. Click the <strong>“Initialize”</strong> button to begin the process.</p>
<h3 id="heading-step-2-generate-unseal-keys">Step 2: Generate Unseal Keys</h3>
<p>The GUI will prompt you to configure <strong>key shares</strong> and <strong>key threshold</strong>. Leave these at the defaults unless you have a specific security model in mind:</p>
<ul>
<li><p><strong>Key Shares:</strong> <code>5</code></p>
</li>
<li><p><strong>Key Threshold:</strong> <code>3</code></p>
</li>
</ul>
<p>This means Vault will generate 5 unseal keys, and any 3 of them will be required to unseal the Vault.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745419735235/5fa7e30b-b7ec-4361-ab8b-b9c48a24045a.png" alt class="image--center mx-auto" /></p>
<p>Click <strong>"Initialize"</strong> to proceed. Vault will generate a JSON file containing:</p>
<ul>
<li><p>The root token (used to log in as admin)</p>
</li>
<li><p>All 5 unseal keys</p>
</li>
</ul>
<p><strong>Download this file immediately</strong> and store it in a secure location. These keys cannot be recovered later.</p>
<blockquote>
<p>⚠️ <strong>Do not skip this download</strong>. If you lose these keys before unsealing, you’ll have to wipe and redeploy Vault from scratch.</p>
</blockquote>
<h3 id="heading-step-3-unseal-the-vault">Step 3: Unseal the Vault</h3>
<p>After downloading the key file, Vault will prompt you to enter the unseal keys one by one.</p>
<ul>
<li><p>Copy a single unseal key from the JSON file</p>
</li>
<li><p>Paste it into the field and click <strong>“Unseal”</strong></p>
</li>
<li><p>Repeat with two more keys (for a total of 3)</p>
</li>
</ul>
<p>Once the required threshold is met, Vault will unlock and become <strong>active</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745419845341/08b5e59b-e7a7-434c-9a6d-af3d6c1a546e.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-step-4-log-in-with-the-root-token">Step 4: Log In with the Root Token</h3>
<p>After unsealing, return to the login screen and paste in the <strong>root token</strong> from your downloaded JSON file.</p>
<p>Once logged in, you’ll have full admin access to Vault.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745419885929/d64ca1aa-e54a-4bf4-9f49-9c72ee525ca7.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-step-5-verify-in-argocd">Step 5: Verify in ArgoCD</h3>
<p>Flip back to the ArgoCD UI and check the status of the Vault application. At this point, the main pod should switch from <strong>Progressing</strong> to <strong>Healthy</strong>, and your application should show as <strong>fully operational</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745419923132/0acf4555-4588-45e6-ae53-fdcc16245b2c.png" alt class="image--center mx-auto" /></p>
<p>You're now ready to configure Vault as a backend for External Secrets, so your GitOps-managed workloads can securely retrieve credentials, tokens, and other sensitive data on demand.</p>
<p>This completes Part 2 of this series.</p>
<h1 id="heading-summary-amp-whats-next">Summary &amp; What’s Next</h1>
<p>In Part 2, we took our GitOps foundation and turned it into a functional, production-capable platform. We integrated critical infrastructure components like MetalLB for external access, Traefik for routing, Rook-Ceph for persistent storage, and a full-fledged secrets management stack using External Secrets and HashiCorp Vault, all deployed declaratively using ArgoCD.</p>
<p>At this point, you have a GitOps-powered Kubernetes environment that’s capable of:</p>
<ul>
<li><p>Exposing services securely with external IPs and ingress rules</p>
</li>
<li><p>Persisting data across workloads using Ceph-backed volumes</p>
</li>
<li><p>Managing secrets securely without embedding them in Git</p>
</li>
<li><p>Deploying and managing infrastructure the same way you'll deploy apps: as code</p>
</li>
</ul>
<p>This platform is now ready to host real-world applications, whether it’s NetBox, Nautobot, or custom tooling built for your network automation workflows.</p>
<p>In <strong>Part 3</strong>, we’ll finally do just that: deploy a real application on top of everything we’ve built. I haven’t finalized which app we’ll use yet, but it’ll be something practical and network-engineer focused. Stay tuned and thank you for reading!</p>
]]></content:encoded></item><item><title><![CDATA[Bridging the Gap: GitOps for Network Engineers - Part 1]]></title><description><![CDATA[Intro
Over the past 6–9 months, my career and perspective on technology have shifted dramatically. I’ve found myself drifting away from my views of traditional networking and increasingly seeing everything through the lens of applications, and treati...]]></description><link>https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-1</link><guid isPermaLink="true">https://blog.nerdylyonsden.io/bridging-the-gap-gitops-for-network-engineers-part-1</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[ArgoCD]]></category><category><![CDATA[Network Automation]]></category><category><![CDATA[Git]]></category><category><![CDATA[gitops]]></category><category><![CDATA[Network Engineering]]></category><category><![CDATA[networking]]></category><dc:creator><![CDATA[Jeffrey Lyon]]></dc:creator><pubDate>Wed, 23 Apr 2025 15:32:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744512989708/13187ff9-4b83-46b1-b4d6-fa107f0e9ea1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-intro">Intro</h1>
<p>Over the past 6–9 months, my career and perspective on technology have shifted dramatically. I’ve found myself drifting away from my views of traditional networking and increasingly seeing everything through the lens of applications, and treating it accordingly. To meet the demands of my current role, a colleague introduced me to the concept of GitOps and suggested we integrate it into our network automation workflows. At the time, I had no idea what GitOps even was. But a few months later, I’m wondering why I didn’t adopt this approach much earlier in my career. Within that short span, I had built a complete platform capable of hosting all of our network automation tools—NetBox, Nautobot, custom Python scripts, databases, monitoring stacks, and even Clabernetes (containerlab) for running virtual topologies. All self-contained, all deployed declaratively, and all benefiting from the GitOps principles I’ll be breaking down throughout this article.</p>
<h2 id="heading-so-what-is-gitops">So… what is GitOps?</h2>
<p>At its core, GitOps is a way of managing infrastructure and applications using Git as the single source of truth. Think of it like this: instead of logging into systems and manually making changes (we’ve all been there), you define your desired state in code—YAML, JSON, whatever floats your repo, and store that in Git. From there, automation tools take over, constantly reconciling what's deployed with what lives in the repo. If something drifts or breaks, the system can alert you, fix it, or at least give you a clean way to roll back.</p>
<p>In traditional terms, it’s like having a version-controlled config file for every part of your infrastructure, and having robots to deploy it all for you, exactly how you wrote it.</p>
<h2 id="heading-why-should-network-engineersorgs-care">Why Should Network Engineers/Orgs Care?</h2>
<p>Historically, network automation has been about scripts, Python, maybe some Ansible sprinkled on top. But the problem with that approach is scale, visibility, and consistency. You might have 10 engineers all running different scripts in slightly different ways. Who knows what changed and when?</p>
<p>GitOps brings the same rigor DevOps teams apply to applications into the world of network automation. Imagine managing Nautobot or NetBox deployments through Git. Want to roll out a plugin, change a config, or update a container? You create a pull request, get it reviewed, and once it’s merged, it’s live in production (via ArgoCD, Flux, or whatever your GitOps controller is).</p>
<p>Even beyond the apps themselves, this mindset works for deploying the tools that generate your configs, run validations, or even trigger device changes. You're turning networking workflows into a pipeline. And once that happens, you get auditability, consistency, and less of that "it works on my machine" nonsense.</p>
<p>This is <strong>Part 1</strong> of a series aimed at helping network engineers get hands-on with GitOps and understand the core components involved in building a modern automation platform. In this first part, we’ll focus on the foundational concepts of GitOps, the tools that power it, and walk through installing ArgoCD as the GitOps engine for our platform. Even if you're not deploying anything just yet, the goal here is to bridge the knowledge gap, so network engineers can better understand the deployment process and begin delivering their own code and tools in a structured, scalable way. At the very least, this knowledge helps you communicate more effectively with DevOps and Platform Engineering teams, making it easier to explain what you need when it comes to production-ready deployments.</p>
<p>In <strong>Part 2</strong>, we’ll pick up by deploying core infrastructure components—like MetalLB, Traefik, persistent storage, and secrets management—using the GitOps workflow established here.</p>
<p>​<strong><em>For those interested in exploring the configurations and examples discussed in this article, all the code and resources are available in my GitHub repository:</em></strong> <a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground"><strong><em>kubernetes-gitops-playground</em></strong></a><strong><em>.</em></strong><a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground"><strong><em>​</em></strong></a></p>
<p><strong><em>This repository serves as a comprehensive reference for setting up a GitOps-driven Kubernetes environment. It includes structured directories for applications like Nautobot, configurations for ArgoCD, and various Kubernetes add-ons. The repository is designed to be a practical guide for network engineers aiming to implement GitOps methodologies in their infrastructure.​</em></strong></p>
<p><strong><em>Feel free to explore the repository to gain insights into the practical implementation of the concepts discussed here.</em></strong></p>
<h1 id="heading-the-gitops-ecosystem-a-network-automation-perspective">The GitOps Ecosystem: A Network Automation Perspective</h1>
<p>Here’s a hi<a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground">g</a>h-level breakdown of the components I use to power my GitOps-driven automation platform. This list reflects a practical, production-minded approach to deploying and managing applications, especially for network engineers looking to build, scale, or just better understand modern automation workflows.</p>
<p>Each component below plays a specific role in the platform, helping ensure security, flexibility, repeatability, and operational clarity.</p>
<h4 id="heading-kubernetes-cluster-obviously"><strong>Kubernetes Cluster (Obviously)</strong></h4>
<p>The foundation of everything. Kubernetes orchestrates and runs your containerized applications, managing scaling, availability, and resource utilization.</p>
<h4 id="heading-git-provider-github"><strong>Git Provider (GitHub)</strong></h4>
<p>The single source of truth. All manifests, Helm values, and Kustomize overlays live here. Every change is tracked, reviewed, and version-controlled.</p>
<h4 id="heading-argocd"><strong>ArgoCD</strong></h4>
<p>This is the GitOps engine of the platform. It continuously syncs application state from Git repositories into the cluster, ensuring what’s deployed always matches what’s defined in code.</p>
<h4 id="heading-cluster-load-balancing-metallb"><strong>Cluster Load Balancing (MetalLB)</strong></h4>
<p>MetalLB enables load-balanced services in bare-metal or home lab environments by assigning external IPs to services that require them.</p>
<h4 id="heading-traefik-ingressroute"><strong>Traefik (IngressRoute)</strong></h4>
<p>Traefik is a powerful and flexible ingress controller that routes external traffic into your Kubernetes cluster using custom IngressRoute CRDs. It gives you fine-grained control over how services are exposed, supports TLS, and integrates smoothly with GitOps workflows.</p>
<p><strong>Note:</strong> You can use NodePorts if you’re not ready for an ingress controller and want a simpler setup, but that approach isn’t ideal for production use and lacks the flexibility and security that Traefik provides.</p>
<h4 id="heading-persistent-storage-rook-ceph"><strong>Persistent Storage (Rook + Ceph)</strong></h4>
<p>Apps like network automation platforms often require persistent volumes. Rook with Ceph provides resilient, scalable storage within the cluster, critical for stateful services.</p>
<h4 id="heading-secrets-vault-ie-hashicorp-vault"><strong>Secrets Vault (i.e., HashiCorp Vault)</strong></h4>
<p>A secure place to store sensitive information like API tokens, database credentials, and TLS certificates, outside the cluster and outside of Git.</p>
<h4 id="heading-secrets-operator-ie-external-secrets"><strong>Secrets Operator (i.e., External Secrets)</strong></h4>
<p>This bridges the gap between Vault and Kubernetes. It watches your external secret store and injects the data into Kubernetes Secrets based on declarative manifests.</p>
<h4 id="heading-kubernetes-secrets"><strong>Kubernetes Secrets</strong></h4>
<p>The native format for storing and referencing secrets inside Kubernetes workloads. These are the final form of secrets that your apps consume at runtime.</p>
<h4 id="heading-helm-amp-custom-values"><strong>Helm &amp; Custom Values</strong></h4>
<p>Helm acts as the package manager for Kubernetes, simplifying the deployment of complex, production-ready applications through reusable charts. By supplying custom values, you can easily override default configurations, tuning things like ports, storage, resource limits, and app-specific settings to fit your environment without modifying the underlying chart.</p>
<h4 id="heading-kustomize"><strong>Kustomize</strong></h4>
<p>Kustomize lets you customize Kubernetes manifests without copying or editing the original files. It uses overlays to manage environment-specific changes, like different configs for dev, test, or prod. This helps keep your Git repo organized and clean.</p>
<p>You can also use Kustomize alongside Helm by referencing rendered Helm charts as a base, then layering custom configs on top, giving you the best of both tools.</p>
<h1 id="heading-requirements-amp-housekeeping"><strong>Requirements &amp; Housekeeping</strong></h1>
<p>Before we dive into the individual components of the platform, there are a few things that need to be in place:</p>
<ul>
<li><p><strong>Kubernetes Cluster:</strong> I won’t be covering how to stand up a Kubernetes cluster in this post. If you need help with that, check out <a target="_blank" href="https://blog.nerdylyonsden.io/kubernetes-and-containerlab-part-1-building-a-cluster">this earlier article I wrote</a> that walks through the setup. This also isn’t a Kubernetes 101 guide, you’ll need a solid understanding of how Kubernetes works, especially when it comes to common resource types like Deployments, Services, Secrets, ConfigMaps, and PersistentVolumeClaims. <code>Kubectl</code> and <code>helm</code> should also be installed and usable for the cluster</p>
</li>
<li><p><strong>Git &amp; GitHub (or another Git provider):</strong> This isn’t a Git 101 tutorial. You’ll need some working knowledge of Git and GitHub, and you should already have an account set up. If you’re using another provider (like GitLab or Bitbucket), that’ll work too.</p>
</li>
<li><p><strong>Persistent Storage:</strong> While persistent storage is part of the overall stack, this post won’t go deep into the setup. I’ll touch on what’s needed to support the apps, but I’m saving the storage deep dive for a separate article.</p>
</li>
<li><p><strong>Linux &amp; Bash:</strong> You should be comfortable using Linux and working in a bash shell. There will be commands, file edits, and troubleshooting that assume you’re not new to the terminal.</p>
</li>
<li><p><strong>IDE (like VSCode):</strong> You’ll need a code editor to work with YAML, Helm values, and general GitOps structure. VSCode is a solid choice, it has excellent Git integration and Kubernetes plugins that can speed up your workflow.</p>
</li>
</ul>
<h2 id="heading-my-setup">My Setup</h2>
<p>My cluster consists of a 3 node Rocky Linux 9 cluster. Same as what’s used in my other blog posts. Most other major distributions should work relatively the same but if you’re following closely Rocky and Redhat are the better OS options.</p>
<p>If you’re good on those fronts, let’s keep going.</p>
<h1 id="heading-argocd-your-gitops-automation-engine"><strong>ArgoCD: Your GitOps Automation Engine</strong></h1>
<p>Now that your Kubernetes cluster is built and your GitHub account is ready, it's time to dive into the heart of GitOps: <strong>ArgoCD</strong>.</p>
<h3 id="heading-what-is-argocd"><strong>What is ArgoCD?</strong></h3>
<p>ArgoCD (short for <em>Argo Continuous Delivery</em>) is a GitOps controller for Kubernetes. It continuously monitors Git repositories and ensures the live state of your cluster matches the declared state in Git. If something drifts, like someone manually edits a resource, ArgoCD can detect that and reconcile it back to the desired state stored in Git. It’s declarative, automated, and very production-friendly.</p>
<p>In simple terms: <strong>Git is the source of truth, and ArgoCD makes sure your cluster does what Git says.</strong></p>
<h3 id="heading-where-argocd-fits-in-the-gitops-model"><strong>Where ArgoCD Fits in the GitOps Model</strong></h3>
<p>GitOps workflows revolve around a few key principles:</p>
<ul>
<li><p><strong>Version control as truth:</strong> All manifests live in Git.</p>
</li>
<li><p><strong>Pull-based automation:</strong> Kubernetes doesn’t wait for you to push changes, it pulls from Git.</p>
</li>
<li><p><strong>Observability and rollback:</strong> You can track exactly what changed, when, and by who. Rolling back is as easy as reverting a commit.</p>
</li>
</ul>
<p>ArgoCD is the engine that powers this model. It watches your repo, compares it to what’s actually running in your cluster, and syncs everything up, either automatically or on demand. It also gives you a nice web UI, CLI, and API for managing applications and monitoring sync status.</p>
<p>On a personal note—<strong>I freaking love ArgoCD</strong>! When I was first dipping my toes into GitOps and only had a surface-level understanding of Kubernetes, ArgoCD was an absolute game changer. Being able to visually see every single Kubernetes object that makes up an app, and how they relate to each other, leveled up my Kubernetes knowledge fast. The fact that you can pause, sync, delete, or rebuild individual resources with basically the flip of a switch? Insanely useful. And not having to constantly hammer out <code>kubectl</code> commands just to check logs or dig into the YAML? Crazy time saver! Seriously, it’s one of the most valuable tools in this whole setup and tech today.</p>
<h3 id="heading-installing-argocd-on-rocky-linux-9"><strong>Installing ArgoCD on Rocky Linux 9</strong></h3>
<p>Let’s walk step-by-step through a basic installation of ArgoCD and its CLI. These steps assume you already have:</p>
<ul>
<li><p><code>kubectl</code> configured and pointing to your Kubernetes cluster</p>
</li>
<li><p><code>helm</code> installed (needed for app creation later)</p>
</li>
<li><p>Root or sudo access on your Rocky Linux 9 system</p>
</li>
</ul>
<h4 id="heading-step-1-install-argocd-into-the-cluster"><strong>Step 1: Install ArgoCD into the Cluster</strong></h4>
<p>We'll install ArgoCD in its own namespace using the official manifests:</p>
<pre><code class="lang-bash">kubectl create namespace argocd

kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
</code></pre>
<p>This will install all the ArgoCD components: API server, controller, repo server, and UI server.</p>
<p>To confirm the installation was successful, run the below command.</p>
<pre><code class="lang-bash">kubectl get pods -n argocd
</code></pre>
<p>You should see all the argocd pods in a ‘Running’ state after 30 seconds or so -</p>
<pre><code class="lang-bash">NAME                                               READY   STATUS    RESTARTS     AGE
argocd-application-controller-0                    1/1     Running   0            1m
argocd-applicationset-controller-dc47f7989-77ztg   1/1     Running   0            1m
argocd-dex-server-bc9bc7d65-68rxn                  1/1     Running   0            1m
argocd-notifications-controller-5698dbd744-7vmzc   1/1     Running   0            1m
argocd-redis-656948fbd6-zfgjd                      1/1     Running   0            1m
argocd-repo-server-74c4cb6cc5-pnxfv                1/1     Running   0            1m
argocd-server-856f78f5df-cxh9h                     1/1     Running   0            1m
</code></pre>
<h4 id="heading-step-2-expose-the-argocd-ui"><strong>Step 2: Expose the ArgoCD UI</strong></h4>
<p>By default, ArgoCD’s API server is only accessible inside the cluster. For testing or lab use, you can expose it using a <code>NodePort</code> or via your ingress controller (like Traefik):</p>
<p><strong>Option A: NodePort (quick and dirty)</strong></p>
<pre><code class="lang-bash">kubectl patch svc argocd-server -n argocd -p <span class="hljs-string">'{"spec": {"type": "NodePort"}}'</span>
</code></pre>
<p>Find the port:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">svc</span> <span class="hljs-string">argocd-server</span> <span class="hljs-string">-n</span> <span class="hljs-string">argocd</span>
</code></pre>
<p>NodePorts are usually assigned within the <strong>30000–32767</strong> range. Look for the <code>PORT(S)</code> column in the output, something like <code>8080:32678/TCP</code> means ArgoCD is accessible on port <strong>32678</strong> of any Kubernetes control-plane node.</p>
<p>Then access the UI at:<br /><code>http://&lt;node-ip&gt;:&lt;nodeport&gt;</code></p>
<p><strong>Option B: IngressRoute (we’ll add this later once Traefik is installed)</strong></p>
<p>If you're planning to use Traefik as your ingress controller, you'll eventually want to expose ArgoCD using an IngressRoute. This is the more GitOps-friendly approach because your ingress config, just like everything else, can live in Git and be managed declaratively.</p>
<p>That said, you probably don’t have an ingress controller installed yet, so this option won’t work just yet. No problem, start with the NodePort method for now, and once Traefik is in place, switching over to an IngressRoute is quick and clean. It fits perfectly into the GitOps model and keeps your exposure configs version-controlled along with the rest of your stack</p>
<h4 id="heading-step-3-get-the-initial-admin-password"><strong>Step 3: Get the Initial Admin Password</strong></h4>
<p>The default username is <code>admin</code>. To get the initial password:</p>
<pre><code class="lang-bash">kubectl get secret argocd-initial-admin-secret -n argocd \
  -o jsonpath=<span class="hljs-string">"{.data.password}"</span> | base64 -d &amp;&amp; <span class="hljs-built_in">echo</span>
</code></pre>
<p>Log in via the web UI or CLI using this password.</p>
<h4 id="heading-step-4-install-the-argocd-cli"><strong>Step 4: Install the ArgoCD CLI</strong></h4>
<p>Install the CLI to interact with ArgoCD from your terminal.</p>
<pre><code class="lang-bash">VERSION=$(curl -s https://api.github.com/repos/argoproj/argo-cd/releases/latest \
  | grep tag_name | cut -d <span class="hljs-string">'"'</span> -f 4)

curl -sSL -o argocd <span class="hljs-string">"https://github.com/argoproj/argo-cd/releases/download/<span class="hljs-variable">${VERSION}</span>/argocd-linux-amd64"</span>

chmod +x argocd
sudo mv argocd /usr/<span class="hljs-built_in">local</span>/bin/
</code></pre>
<p>Confirm it’s installed:</p>
<pre><code class="lang-bash">argocd version
</code></pre>
<h4 id="heading-step-5-log-in-using-the-cli"><strong>Step 5: Log In Using the CLI</strong></h4>
<pre><code class="lang-bash">argocd login &lt;ARGOCD-SERVER&gt; --username admin --password &lt;PASSWORD&gt;
</code></pre>
<p>Use the hostname or IP that maps to your <code>argocd-server</code> service.</p>
<h2 id="heading-alternative-installation-method-github-actions-runner">Alternative Installation Method (Github Actions Runner)</h2>
<p>If you’ve followed along this far, you’re probably realizing we could automate a good chunk of this platform bootstrapping. And yes, we absolutely can.</p>
<p>I've created a <a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground/blob/main/.github/workflows/install-argocd.yml">GitHub Actions workflow</a> that installs ArgoCD (and its CLI), exposes it, configures custom admin users, and even adds the Kubernetes cluster back into ArgoCD, all automatically. This method is particularly useful if you're managing multiple clusters or frequently rebuilding your platform. Feel free to use this for assistance in setting up your own runner. Here’s how it works.</p>
<h3 id="heading-requirements">Requirements</h3>
<p>To use this workflow, you’ll need:</p>
<ul>
<li><p>A self-hosted GitHub Actions runner that has access to your Kubernetes cluster</p>
</li>
<li><p><code>kubectl</code> and <code>python3.12</code> installed on the runner</p>
</li>
<li><p>A valid Kubeconfig file for the cluster you're targeting</p>
</li>
<li><p>GitHub repository secrets and variables configured properly:</p>
<ul>
<li><p><code>ARGOCD_ADMIN_USER</code>, <code>ARGOCD_ADMIN_PASSWORD</code> – default admin login</p>
</li>
<li><p><code>ARGOCD_MY_ADMIN_USER</code>, <code>ARGOCD_MY_ADMIN_PASSWORD</code> – a secondary, more permanent admin account</p>
</li>
<li><p><code>PAT_TOKEN</code> – GitHub personal access token for storing encrypted secrets per environment</p>
</li>
<li><p>GitHub Actions environment variables like <code>ARGOCD_PORT</code> and <code>ARGOCD_SERVER</code> (the IP or DNS hostname of a Kubernetes control-plane node)</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-supporting-workflows-worth-noting">Supporting Workflows Worth Noting</h3>
<p>If you're wondering how this all connects behind the scenes, the repo also includes a few helper workflows that make this setup much smoother.</p>
<ul>
<li><p><strong>Kubeconfig Setup &amp; Storage:</strong> There's a <a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground/blob/main/.github/workflows/create-refresh-kube-configs.yml">workflow</a> that helps you extract your kubeconfig file and securely store it in GitHub as a repository variable or secret. This is crucial for giving your self-hosted runner authenticated access to your cluster during automated jobs.</p>
</li>
<li><p><strong>kubectl Installation &amp; Verification:</strong> Another <a target="_blank" href="https://github.com/leothelyon17/kubernetes-gitops-playground/blob/main/.github/workflows/setup-kubectl-test-k8s-access.yml">workflow</a> ensures <code>kubectl</code> is installed and properly configured on your self-hosted runner. It also includes a quick test to confirm the runner can talk to the cluster, basically your first "sanity check" before deploying anything.</p>
</li>
</ul>
<p>These smaller workflows aren’t flashy, but they’re essential in keeping everything reliable, reproducible, and GitOps-friendly.</p>
<h3 id="heading-workflow-breakdown">Workflow Breakdown</h3>
<p>Here's what the job actually does:</p>
<ol>
<li><p><strong>Checkout the Repo</strong><br /> Grabs your current Git repository so that scripts and manifests can be used during the workflow.</p>
</li>
<li><p><strong>Set the Environment</strong><br /> Dynamically sets the target environment (e.g., <code>lab</code> or <code>prod</code>) based on your manual trigger input. This is used for cluster context switching and naming.</p>
</li>
<li><p><strong>Configure kubectl</strong><br /> Updates the active Kubernetes context based on the selected environment so the workflow knows which cluster to operate on.</p>
</li>
<li><p><strong>Install Dependencies</strong><br /> Sets up a Python virtual environment and installs <code>pynacl</code>, which is used later for encrypting the ArgoCD password.</p>
</li>
<li><p><strong>Install ArgoCD</strong><br /> Creates the <code>argocd</code> namespace (if it doesn't exist) and applies the official ArgoCD manifests to install the full stack into your cluster.</p>
</li>
<li><p><strong>Install the ArgoCD CLI</strong><br /> Downloads and installs the latest CLI version for use in later steps like login, user config, and cluster registration.</p>
</li>
<li><p><strong>Wait for ArgoCD to Come Online</strong><br /> Uses <code>kubectl wait</code> to ensure the <code>argocd-server</code> deployment is available before proceeding.</p>
</li>
<li><p><strong>Expose ArgoCD via NodePort</strong><br /> Temporarily exposes the ArgoCD UI using a NodePort service on the configured port. This makes it accessible during early setup (before Ingress is configured).</p>
</li>
<li><p><strong>Extract the Initial Admin Password</strong><br /> Pulls the default ArgoCD admin password from the Kubernetes secret and stores it as a masked GitHub environment variable.</p>
</li>
<li><p><strong>Encrypt and Store the Admin Password in GitHub Secrets</strong><br />Uses GitHub’s public key API and a Python script to encrypt the ArgoCD admin password and securely store it in the environment-specific GitHub Secrets.</p>
</li>
<li><p><strong>Log into ArgoCD with Default Admin</strong><br />Authenticates with ArgoCD using the default credentials and ensures the CLI is working.</p>
</li>
<li><p><strong>Create a Custom Admin User</strong><br />Edits the <code>argocd-cm</code> ConfigMap to define a new admin-level account.</p>
</li>
<li><p><strong>Assign RBAC Permissions to the New User</strong><br />Updates the <code>argocd-rbac-cm</code> ConfigMap to give your new user full admin access.</p>
</li>
<li><p><strong>Set a Password for the New User</strong><br />Uses the CLI to set the new admin user’s password securely.</p>
</li>
<li><p><strong>Verify the New Admin Login</strong><br />Logs in with the new user credentials to confirm everything’s configured properly.</p>
</li>
<li><p><strong>Register the Cluster with ArgoCD</strong><br />Ensures the current Kubernetes cluster is registered with ArgoCD, allowing future applications to target it via the ArgoCD UI or CLI.</p>
</li>
</ol>
<h3 id="heading-why-this-rocks">Why This Rocks</h3>
<p>Instead of manually copying YAML and running a dozen <code>kubectl</code> commands, this workflow automates the whole thing, and tracks it all in Git. It’s GitOps deploying GitOps, and yes, I’m into that level of inception.</p>
<p>You can trigger it manually for different environments (e.g., lab vs prod), and the entire setup becomes repeatable, shareable, and documented as code.</p>
<hr />
<p>ArgoCD is now up and running (hopefully). You should be able to access the login page using a URL from one of your Kubernetes control-plane nodes IPs (or hostname) and NodePort port -</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744570101134/fdaf99ab-3b45-427b-bca9-6538addacd19.png" alt class="image--center mx-auto" /></p>
<p>Go ahead and login using the admin credentials or credentials you created from using an Actions Workflow. You should see a blank applications list like so -</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745368878759/5e69cf35-9470-48e3-972e-065ea439505b.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-initial-setup">Initial Setup</h2>
<h3 id="heading-configuring-your-cluster-within-argocd"><strong>Configuring Your Cluster within ArgoCD</strong></h3>
<p>Before deploying anything, ArgoCD needs to know which Kubernetes cluster(s) it can target. If you installed ArgoCD into the same cluster you're working in, there's good news, ArgoCD automatically configures access to that cluster. It will show up as <code>in-cluster</code> and is ready to go out of the box.</p>
<p>But if you're managing a <strong>remote cluster</strong>, or skipped using the automated GitHub Actions workflow I showed earlier, you’ll need to manually register the cluster using the ArgoCD CLI. This is required because <strong>you cannot add a new cluster through the ArgoCD UI</strong>.</p>
<h4 id="heading-step-1-login-to-the-argocd-cli"><strong>Step 1: Login to the ArgoCD CLI</strong></h4>
<p>Before you can register a cluster, you need to authenticate using the CLI:</p>
<pre><code class="lang-bash">argocd login &lt;ARGOCD_SERVER&gt;:&lt;PORT&gt; --username admin --password &lt;PASSWORD&gt; --insecure
</code></pre>
<p>Replace the values with your ArgoCD server address and credentials. The <code>--insecure</code> flag is common during lab/testing since you might not have valid TLS configured yet.</p>
<h4 id="heading-step-2-register-the-cluster"><strong>Step 2: Register the Cluster</strong></h4>
<p>Once logged in, you can add the Kubernetes cluster currently pointed to by <code>kubectl</code>:</p>
<pre><code class="lang-bash">argocd cluster add &lt;kube-context-name&gt;
</code></pre>
<p>You can find your context name with:</p>
<pre><code class="lang-bash">kubectl config current-context
</code></pre>
<p>This command sets up a service account and RBAC within the target cluster, and registers it inside ArgoCD. Once complete, the cluster will appear in the UI and can be used for application deployments.</p>
<h3 id="heading-adding-and-configuring-a-new-project-via-gui"><strong>Adding and Configuring a New Project (via GUI)</strong></h3>
<p>Projects in ArgoCD are used to organize applications, enforce boundaries, and apply access rules. They’re especially useful when you want to group related apps, like having one project for core platform components and another for automation tools.</p>
<h4 id="heading-step-by-step-create-a-new-project-in-the-ui"><strong>Step-by-Step: Create a New Project in the UI</strong></h4>
<ol>
<li><p><strong>Login to the ArgoCD UI</strong></p>
<p> Use the NodePort or Ingress you’ve set up earlier to access the web UI. Login with your <code>admin</code> or custom user credentials.</p>
</li>
<li><p><strong>Go to “Settings” → “Projects”</strong></p>
<p> In the sidebar, click <strong>Settings</strong>, then select <strong>Projects</strong>. Click <strong>+ NEW PROJECT</strong> to create a new one.</p>
</li>
<li><p><strong>Name Your Project</strong></p>
<p> Give your project a meaningful name, like <code>platform-core</code> or <code>network-tools</code>. Mine is shown below -</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745369063850/d71ee8b5-a449-4aef-976b-4d4a5caabd1a.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Define Destinations</strong></p>
<p> These are the clusters and namespaces that apps in this project are allowed to deploy to. If you're using the default in-cluster setup, your server URL will be <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a>.</p>
<ul>
<li><p>Server: <a target="_blank" href="https://kubernetes.default.svc"><code>https://kubernetes.default.svc</code></a></p>
</li>
<li><p>Namespace: e.g., <code>default</code>, <code>argocd</code>, or <code>tools</code> For a basic setup just set it to * (for all namespaces)</p>
</li>
</ul>
</li>
</ol>
<p>    <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745369318060/4a919df4-9520-4f55-ae0a-344fe4f3b504.png" alt class="image--center mx-auto" /></p>
<ol start="5">
<li><p><strong>Configure Role-Based Access and Restrictions</strong></p>
<p> When you're setting up a new project in ArgoCD, you'll see options to define what types of Kubernetes resources the project is allowed to manage. This is where you can lock things down pretty tightly, but for <strong>basic setup and initial testing</strong>, it’s easiest to just allow everything and refine later once things are working.</p>
<p> Here’s what that looks like:</p>
<ul>
<li><p><strong>Cluster Resource Allow List</strong></p>
<ul>
<li><p>Kind: <code>*</code></p>
</li>
<li><p>Group: <code>*</code></p>
</li>
</ul>
</li>
<li><p><strong>Cluster Resource Deny List</strong></p>
<ul>
<li>Leave this empty</li>
</ul>
</li>
<li><p><strong>Namespace Resource Allow List</strong></p>
<ul>
<li><p>Kind: <code>*</code></p>
</li>
<li><p>Group: <code>*</code></p>
</li>
</ul>
</li>
<li><p><strong>Namespace Resource Deny List</strong></p>
<ul>
<li>Leave this empty</li>
</ul>
</li>
</ul>
</li>
</ol>
<p>    <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745369201192/2d7ab395-8511-4a49-a2f6-5ad9ba8f5973.png" alt class="image--center mx-auto" /></p>
<ol start="6">
<li><p>Resource Monitoring</p>
<ul>
<li>Move the slider to ‘Enabled’</li>
</ul>
</li>
<li><p><strong>Click “Create”</strong></p>
<p> Your project is now set up and ready to have apps assigned to it.</p>
</li>
</ol>
<p>From here, you’ll be able to define Git-based applications, point them at your manifests or Helm charts, and let ArgoCD handle the rest.</p>
<p>The first applications we’ll use it to deploy are the core pieces of our GitOps infrastructure itself, tools like MetalLB for load balancing, Traefik for ingress, and persistent storage components. In other words, we’ll be using GitOps to finish building out the platform that enables GitOps. Poetic, right?</p>
<p>Before we end this Part 1, let's talk about how this all ties back to Git...</p>
<h2 id="heading-understanding-the-repo-structure-and-why-everything-belongs-in-git"><strong>Understanding the Repo Structure (and Why Everything Belongs in Git)</strong></h2>
<p>One of the core principles of GitOps is keeping <strong>everything</strong>—infrastructure, applications, configurations, and deployment logic—in version control. The folder layout in my example repo is designed with that in mind. It reflects GitOps best practices: everything is declarative, versioned, and easy to manage or scale over time.</p>
<p>Having a clear and intentional structure not only makes your deployments cleaner, it also simplifies troubleshooting, auditing, onboarding new team members, and extending the platform as your needs grow.</p>
<p>Here’s a quick breakdown of the folders that matter most for this series:</p>
<ul>
<li><p><code>apps/</code><br />  This is where you’ll find custom Helm values files and Kustomize overlays for each application managed by ArgoCD. Each subdirectory corresponds to a specific app—like MetalLB, Traefik, or ArgoCD itself—and contains the configuration needed to tailor the deployment to your environment. This keeps your app logic cleanly separated and easy to maintain.</p>
</li>
<li><p><code>argocd-app-manifests/</code><br />  Contains the ArgoCD <code>Application</code> and <code>AppProject</code> manifests. These define <em>what</em> ArgoCD deploys, <em>where</em>, and <em>from which repo</em>. Managing these separately from app-specific config keeps the logic declarative and helps you track application lifecycle separately from platform logic.</p>
</li>
<li><p><code>helm-charts/</code><br />  This folder stores any custom or forked Helm charts that don’t live in an external Helm repo. It gives you a clean place to manage pinned chart versions or make local edits without cluttering the main app or manifest directories.</p>
</li>
</ul>
<p>This layout isn’t just for organization, it’s what enables a GitOps workflow to scale. As your platform grows, this structure makes it easy to maintain a consistent, observable, and testable deployment pipeline across your infrastructure.</p>
<h1 id="heading-summary-amp-whats-next"><strong>Summary &amp; What’s Next</strong></h1>
<p>In Part 1, we laid the groundwork for a GitOps-driven automation platform. We covered the key components that make up the stack, walked through what GitOps actually is (without the fluff), and deployed ArgoCD, the engine that brings it all to life.</p>
<p>By now, you should have:</p>
<ul>
<li><p>A working Kubernetes cluster</p>
</li>
<li><p>ArgoCD fully installed and accessible via NodePort or Ingress</p>
</li>
<li><p>Logged into the ArgoCD UI</p>
</li>
<li><p>Created your first ArgoCD project and verified it’s configured with the settings described earlier (associated cluster aka ‘Destination’, RBAC/Allowed Lists, enabled Resource Monitoring)</p>
</li>
</ul>
<p>If you’ve made it this far, that’s a huge step forward, especially if you’re coming from a traditional networking background. You’ve already started to shift from manually pushing scripts to building a scalable, Git-driven platform.</p>
<p>But we’re just getting started.</p>
<p>In <strong>Part 2</strong>, we’ll begin deploying actual infrastructure apps using the GitOps workflow you’ve set up here. We’ll cover MetalLB (for load balancing), Traefik (for ingress), persistent storage with Rook/Ceph, and secrets management with External Secrets and HashiCorp Vault. These hands-on deployments depend on the foundation you just built, so make sure everything is in place before continuing.</p>
<p>Let’s keep building.</p>
]]></content:encoded></item><item><title><![CDATA[Kubernetes and Containerlab: Part 1 – Building a Cluster]]></title><description><![CDATA[Intro
Hello again to all my longtime readers 😉, and welcome to my new series where we’ll dive into the world of Kubernetes and Containerlab (creators named this Clabernetes) to help you build containerized virtual network environments from the groun...]]></description><link>https://blog.nerdylyonsden.io/kubernetes-and-containerlab-part-1-building-a-cluster</link><guid isPermaLink="true">https://blog.nerdylyonsden.io/kubernetes-and-containerlab-part-1-building-a-cluster</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[containerlab]]></category><category><![CDATA[kubespray]]></category><category><![CDATA[ansible]]></category><dc:creator><![CDATA[Jeffrey Lyon]]></dc:creator><pubDate>Sat, 05 Oct 2024 19:26:28 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-intro">Intro</h1>
<p>Hello again to all my longtime readers 😉, and welcome to my new series where we’ll dive into the world of <strong>Kubernetes</strong> and <strong>Containerlab</strong> (creators named this <strong>Clabernetes</strong>) to help you build containerized virtual network environments from the ground up. As a former network engineer, my goal with this series is to help those with little to no Kubernetes experience get containerized labs up and running quickly, using minimal resources. Advanced platform engineers seeking a streamlined approach to deploying Kubernetes may also find these posts useful.</p>
<p>We’ll kick off this series by laying the foundation: building a small Kubernetes cluster. We'll use tools like Kubespray (Ansible) along with some of my custom playbooks to streamline the process. By the end of this post, we’ll transform three freshly deployed servers into a fully operational Kubernetes cluster.</p>
<p>Next, I’ll be configuring our existing cluster to deploy a shared MicroCeph storage pool, exploring its use cases, and transforming our cluster into a hyper-converged infrastructure.</p>
<p>From there, we'll explore Clabernetes, a Kubernetes based solution created by the makers of Containerlab. Clabernetes deploys Containerlab topologies into a Kubernetes cluster allowing them to scale beyond a single node. It’s designed to allow for some pretty robust networking labs, and I’ll show you how to create some example topologies. Each post in this series will build on the previous, providing practical, hands-on examples to guide you through every step.</p>
<p>By the end of this series, you’ll have a fully functional Kubernetes setup with advanced storage, networking, and simulation capabilities—perfect for learning, experimenting, or even scaling into production.</p>
<p>Let’s get started!</p>
<h1 id="heading-scenario">Scenario</h1>
<p>Let’s imagine you’re a network engineer (maybe you are!) exploring alternatives to traditional virtual lab solutions. While tools like EVE-NG, GNS3, and Cisco’s CML are popular, they often struggle to scale efficiently. You want to build larger, more complex topologies to enhance both your day-to-day work and your networking knowledge. You've heard a lot about Containerlab recently and are eager to experiment, but you're unsure where to start and your resources are limited. Although you have extensive experience with physical and virtual network setups, container-based environments—especially Kubernetes—are new territory. You've heard about the scalability and flexibility that Kubernetes and Containerlab offer for creating virtual labs and testing advanced topologies, and you want to integrate that power into your workflow.</p>
<p>Let’s begin by discussing the devices used in this post to build the cluster. All of them are lower-spec virtual machines running <strong>Rocky Linux 9.4</strong>. My goal is to demonstrate the power of Kubernetes and Containerlab, even when deployed with minimal, cost-effective resources. There are four devices in total: one server dedicated to Ansible and three Kubernetes nodes. The server specifications are as follows:</p>
<ul>
<li><p>Ansible Host - 1 CPU core, 4Gb RAM, 50GB OS HD</p>
</li>
<li><p>3x Kubernetes Hosts - 2 CPU cores, 16Gb RAM, 50GB OS HD, 250GB HD for microCEPH storage pool (covered in Part 2 of this series)</p>
</li>
</ul>
<p>These systems will be communicating over the same 172.16.99.x MGMT network.</p>
<p><strong>NOTE:</strong> 1 host is dedicated as the control plane node, and also the etcd server. All three, even the control plane node, will be set as worker nodes and can take on workloads.</p>
<p><strong>NOTE:</strong> <strong>The kubespray automation in this setup is using all other defaults and no additional plugins will be installed.</strong> I may add sections to this post or separate posts in the future covering use of additional plugins.</p>
<h2 id="heading-requirements"><strong>Requirements</strong></h2>
<p>I will include required packages, configuration, and setup for the systems involved in this automation.</p>
<p><strong>NOTE:</strong> Unless specified I am working as a non-root user (“jeff”) and in my home directory.</p>
<h3 id="heading-ansible-host">Ansible host</h3>
<p>You will need the following:</p>
<ul>
<li><p><strong>Update OS</strong></p>
<pre><code class="lang-bash">  sudo dnf update -y
</code></pre>
</li>
<li><p><strong>Python</strong> (3.9 or greater suggested)</p>
<p>  Default on Rocky 9 is Python 3.9. This setup is using 3.10.9. Python installation on Rocky is a little more involved. I’ve included the steps on how to do so below:</p>
<p>  Install Dependencies</p>
<pre><code class="lang-bash">  dnf install tar curl gcc openssl-devel bzip2-devel libffi-devel zlib-devel wget make -y
</code></pre>
<p>  Install Python</p>
<pre><code class="lang-bash">  <span class="hljs-comment"># Download and unzip</span>
  wget https://www.python.org/ftp/python/3.10.9/Python-3.10.9.tar.xz
  tar -xf Python-3.10.9.tar.xz
  <span class="hljs-comment"># Change directory and configure Python</span>
  <span class="hljs-built_in">cd</span> Python-3.10.9
  ./configure --enable-optimizations
  <span class="hljs-comment"># Start and complete the build process</span>
  make -j 2
  nproc
  <span class="hljs-comment"># Install Python</span>
  make altinstall
  <span class="hljs-comment"># Verify install using</span>
  python3.10 --version
</code></pre>
</li>
<li><p><strong>Python Virtual Environment</strong></p>
<p>  I suggest using a virtual environment for this setup. Makes it easier to keep ansible, its modules, and kubespray separate from anything else the host is being used for.</p>
<pre><code class="lang-bash">  <span class="hljs-comment"># Create the virtual environment</span>
  python3.10 -m venv kubespray_env
  <span class="hljs-comment"># Activate the virtual environment</span>
  <span class="hljs-built_in">source</span> kubespray_env/bin/activate

  <span class="hljs-comment"># To deactivate</span>
  deactivate
</code></pre>
</li>
<li><p><strong>Download Kubespray</strong></p>
<pre><code class="lang-bash">  <span class="hljs-comment"># Change into virtual environment directory</span>
  <span class="hljs-built_in">cd</span> kubespray_env
  <span class="hljs-comment"># Pull down Kubespray from Github</span>
  git <span class="hljs-built_in">clone</span> https://github.com/kubernetes-sigs/kubespray.git
</code></pre>
</li>
<li><p><strong>Install Ansible and Kubespray packages within the virtual environment</strong></p>
<pre><code class="lang-bash">  <span class="hljs-comment"># From within the virtual environment main folder</span>
  <span class="hljs-built_in">cd</span> kubespray
  <span class="hljs-comment"># Install packages (Ansible mainly)</span>
  pip3.10 install -r requirements.txt
</code></pre>
</li>
<li><p><strong>Tweak Ansible configuration</strong></p>
<p>  Modify your <code>ansible.cfg</code> file to ignore <code>host_key_checking</code>. Usually located in <code>/etc/ansible/</code> Create a new file if none exists.</p>
<pre><code class="lang-ini">  <span class="hljs-section">[defaults]</span>
  <span class="hljs-attr">host_key_checking</span> = <span class="hljs-literal">False</span>
</code></pre>
<p>  <strong>NOTE</strong>: If your unsure where to find your ansible.cfg, just run <code>ansible --version</code> as shown below:</p>
<pre><code class="lang-bash">  ansible --version

  ansible [core 2.16.3]
    config file = /etc/ansible/ansible.cfg
</code></pre>
</li>
<li><p><strong>Download my custom kubespray-addons repository</strong></p>
<pre><code class="lang-bash">  <span class="hljs-comment"># Change directory to root virtual environment folder</span>
  <span class="hljs-built_in">cd</span> ~/kubespray_env
  <span class="hljs-comment"># Pull down kubespray-addons from Github</span>
  git <span class="hljs-built_in">clone</span> https://github.com/leothelyon17/kubespray-addons.git
</code></pre>
</li>
</ul>
<h3 id="heading-kubernetes-nodes-freshly-created-vms">Kubernetes Nodes (freshly created VMs)</h3>
<ul>
<li><p><strong>Upgrade OS</strong> (same as Ansible host)</p>
<pre><code class="lang-bash">  sudo dnf update -y

  <span class="hljs-comment"># Optional</span>
  sudo dnf install nano -y
</code></pre>
<p>  <strong>NOTE:</strong> I also include Nano text editor on these for quick file editing if needed. The default Python 3.9 included in the OS install works just fine.</p>
</li>
</ul>
<p>Thats it! Automation takes care of everything else.</p>
<h2 id="heading-getting-into-the-weeds"><strong>Getting into the Weeds</strong></h2>
<h3 id="heading-automation-overview-and-breakdown"><strong>Automation Overview and Breakdown</strong></h3>
<p>We’ll start with a quick overview of Kubespray, then go over my custom add-on automation and what it strives to accomplish. Then we’ll go over a breakdown of both Addons playbooks—Pre and Post.</p>
<h3 id="heading-kubespray">Kubespray</h3>
<p>Kubespray is an open-source tool that automates the deployment of highly available Kubernetes clusters. It uses Ansible playbooks to install and configure Kubernetes across various environments, including bare-metal servers, virtual machines, or cloud infrastructures. Kubespray simplifies the deployment process, providing a robust, flexible, and scalable solution for setting up production-grade Kubernetes clusters.</p>
<p>Kubespray is undoubtedly a powerful tool. However, as I worked through various tutorials to get started, I noticed the number of steps required, such as setting up the inventory, configuring server settings, and addressing issues like fixing <code>kubeadm</code> on the control nodes once the cluster is up and running. This is something I felt needed to be addressed. This brings us to the next section…</p>
<h3 id="heading-kubespray-addons-custom-automation">Kubespray-Addons (Custom Automation)</h3>
<p>I wanted to make using Kubespray and getting a K8s cluster up and running easier than it already is. This is especially true for my fellow network engineers who might be new to it all things Kubernetes, or just anyone that doesn’t want to spend the extra time messing with the additional setup required for running Kubespray.</p>
<p>The initial setup for Kubespray requires users to define environment variables, which are then passed into a Python script to generate the necessary <code>inventory.yml</code> file. This approach, outlined in the official Kubespray documentation and many online tutorials, produces an inventory file with numerous predefined defaults. However, users often still need to manually modify the Kubespray inventory file afterward. My goal was to create a more intuitive and streamlined solution—one that not only generates the required Kubespray inventory file but will also be used for the Addon playbooks as well.</p>
<h3 id="heading-inventory-inventoryyml">Inventory - <code>inventory.yml</code></h3>
<p>Let’s breakdown the inventory file with an example:</p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-attr">all:</span>
  <span class="hljs-attr">hosts:</span>
    <span class="hljs-attr">rocky9-lab-node1:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.25</span>
      <span class="hljs-attr">domain_name:</span> <span class="hljs-string">jjland.local</span>
      <span class="hljs-attr">master_node:</span> <span class="hljs-literal">True</span>
      <span class="hljs-attr">worker_node:</span> <span class="hljs-literal">True</span>
      <span class="hljs-attr">etcd_node:</span> <span class="hljs-literal">True</span>
    <span class="hljs-attr">rocky9-lab-node2:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.26</span>
      <span class="hljs-attr">domain_name:</span> <span class="hljs-string">jjland.local</span>
      <span class="hljs-attr">master_node:</span> <span class="hljs-literal">False</span>
      <span class="hljs-attr">worker_node:</span> <span class="hljs-literal">True</span>
      <span class="hljs-attr">etcd_node:</span> <span class="hljs-literal">False</span>
    <span class="hljs-attr">rocky9-lab-node3:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.27</span>
      <span class="hljs-attr">domain_name:</span> <span class="hljs-string">jjland.local</span>
      <span class="hljs-attr">master_node:</span> <span class="hljs-literal">False</span>
      <span class="hljs-attr">worker_node:</span> <span class="hljs-literal">True</span>
      <span class="hljs-attr">etcd_node:</span> <span class="hljs-literal">False</span>
    <span class="hljs-attr">rocky9-lab-mgmt:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.20</span>
      <span class="hljs-attr">domain_name:</span> <span class="hljs-string">jjland.local</span>

  <span class="hljs-attr">children:</span>
    <span class="hljs-attr">k8s_nodes:</span>
      <span class="hljs-attr">hosts:</span>
        <span class="hljs-attr">rocky9-lab-node1:</span>
        <span class="hljs-attr">rocky9-lab-node2:</span>
        <span class="hljs-attr">rocky9-lab-node3:</span>
    <span class="hljs-attr">ansible_nodes:</span>
      <span class="hljs-attr">hosts:</span>
        <span class="hljs-attr">rocky9-lab-mgmt:</span>

  <span class="hljs-attr">vars:</span>
    <span class="hljs-attr">ansible_user:</span> <span class="hljs-string">jeff</span>
</code></pre>
<p>The file, which users need to customize, is based on the official Kubespray inventory file but with some key improvements. My version allows users to predefine the roles of each node—something the official method doesn’t provide. It also specifies individual host names, used not only in the Addons playbooks but also to properly name the Kubernetes nodes, instead of using the default 'node' from the Kubespray file. Additionally, it defines the domain name, used for updating the <code>/etc/hosts</code> file on all hosts during a task in the Pre-Kubespray playbook. It also sets the <code>ansible_host</code> variable for device connections and configures the <code>ansible_user</code> for all Addons playbook tasks.</p>
<h3 id="heading-pre-kubespray-playbook-pre-kubespray-setup-pbyml">Pre-Kubespray Playbook - <code>pre-kubespray-setup-pb.yml</code></h3>
<p>This playbook consists of two plays and is designed to fully prepare a set of hosts for Kubernetes deployment using Kubespray. It installs required Ansible collections, sets up SSH key-based authentication, modifies system configurations (disables swap, configures sysctl settings), ensures required kernel modules are loaded, and configures firewall rules.</p>
<p><strong>Play 1 - Pre Kubespray Setup</strong></p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Pre</span> <span class="hljs-string">Kubespray</span> <span class="hljs-string">Setup</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">all</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">false</span>

  <span class="hljs-attr">tasks:</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Install</span> <span class="hljs-string">collections</span> <span class="hljs-string">from</span> <span class="hljs-string">requirements.yml</span>
      <span class="hljs-attr">ansible.builtin.command:</span>
        <span class="hljs-attr">cmd:</span> <span class="hljs-string">ansible-galaxy</span> <span class="hljs-string">collection</span> <span class="hljs-string">install</span> <span class="hljs-string">-r</span> <span class="hljs-string">requirements.yml</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">localhost</span>
      <span class="hljs-attr">run_once:</span> <span class="hljs-literal">true</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Generate</span> <span class="hljs-string">SSH</span> <span class="hljs-string">key</span> <span class="hljs-string">pair</span>
      <span class="hljs-attr">openssh_keypair:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">"/home/<span class="hljs-template-variable">{{ ansible_user }}</span>/.ssh/kubespray_ansible"</span>
        <span class="hljs-attr">type:</span> <span class="hljs-string">rsa</span>
        <span class="hljs-attr">size:</span> <span class="hljs-number">2048</span>
        <span class="hljs-attr">state:</span> <span class="hljs-string">present</span>
        <span class="hljs-attr">mode:</span> <span class="hljs-string">'0600'</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">ssh_keypair_result</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">localhost</span>
      <span class="hljs-attr">run_once:</span> <span class="hljs-literal">true</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Ensure</span> <span class="hljs-string">the</span> <span class="hljs-string">SSH</span> <span class="hljs-string">public</span> <span class="hljs-string">key</span> <span class="hljs-string">is</span> <span class="hljs-string">present</span> <span class="hljs-string">on</span> <span class="hljs-string">the</span> <span class="hljs-string">remote</span> <span class="hljs-string">host</span>
      <span class="hljs-attr">authorized_key:</span>
        <span class="hljs-attr">user:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ ansible_user }}</span>"</span>
        <span class="hljs-attr">state:</span> <span class="hljs-string">present</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('file', '/home/{{ ansible_user }}</span>/.ssh/kubespray_ansible.pub') }}"</span>
      <span class="hljs-attr">when:</span> <span class="hljs-string">inventory_hostname</span> <span class="hljs-string">not</span> <span class="hljs-string">in</span> <span class="hljs-string">groups['ansible_nodes']</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Add</span> <span class="hljs-string">entries</span> <span class="hljs-string">to</span> <span class="hljs-string">/etc/hosts</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">lineinfile:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">/etc/hosts</span>
        <span class="hljs-attr">state:</span> <span class="hljs-string">present</span>
        <span class="hljs-attr">line:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ hostvars[item].ansible_host }}</span> <span class="hljs-template-variable">{{ hostvars[item].inventory_hostname }}</span>.<span class="hljs-template-variable">{{ hostvars[item].domain_name }}</span> <span class="hljs-template-variable">{{ hostvars[item].inventory_hostname }}</span>"</span>
        <span class="hljs-attr">backup:</span> <span class="hljs-literal">yes</span>
      <span class="hljs-attr">loop:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ groups['all'] }}</span>"</span>
      <span class="hljs-attr">loop_control:</span>
        <span class="hljs-attr">loop_var:</span> <span class="hljs-string">item</span>
</code></pre>
<p><strong>Purpose:</strong><br />This playbook ensures that all hosts are prepared for Kubespray by installing required Ansible collections, generating SSH keys, and configuring the environment.</p>
<p><strong>Hosts</strong>:<br />Targets the <strong>ALL</strong> host unless specified in task.</p>
<p><strong>Tasks:</strong></p>
<ol>
<li><p><strong>Install collections from</strong> <code>requirements.yml</code></p>
<p> Installs required Ansible collections from <code>requirements.yml</code>. This is only run once on the <a target="_blank" href="http://localhost">localhost</a>. Right now, the only requirements are <code>community.crypto</code> and <code>ansible.posix</code>.</p>
</li>
<li><p><strong>Generate SSH key pair</strong><br /> Generates an RSA SSH key pair for Ansible on the <a target="_blank" href="http://localhost">localhost</a> for later access to remote hosts. The private key is stored in the <code>.ssh</code> directory under <code>kubespray_ansible</code>.</p>
</li>
<li><p><strong>Ensure the SSH public key is present on the remote host</strong><br /> Adds the generated SSH public key to the remote hosts to allow passwordless access. It applies this only to hosts not in the <code>ansible_nodes</code> group.</p>
<p> <strong>NOTE:</strong> Will still need ‘sudo’ password. This also allows for flexibility to add an additional task for passwordless sudo. May add this feature task later.</p>
</li>
<li><p><strong>Add entries to</strong> <code>/etc/hosts</code><br /> Adds entries to the <code>/etc/hosts</code> file on each host to ensure proper DNS resolution between them. It loops through all hosts in the inventory and updates their hosts file with IP addresses and hostnames.</p>
</li>
</ol>
<p><strong>Play 2 - Build Kubespray inventory and additional k8s node setup</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Build</span> <span class="hljs-string">Kubespray</span> <span class="hljs-string">inventory</span> <span class="hljs-string">and</span> <span class="hljs-string">additional</span> <span class="hljs-string">k8s</span> <span class="hljs-string">node</span> <span class="hljs-string">setup</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">k8s_nodes</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">false</span>
  <span class="hljs-attr">tasks:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">inventory</span> <span class="hljs-string">directory</span> <span class="hljs-string">if</span> <span class="hljs-string">it</span> <span class="hljs-string">does</span> <span class="hljs-string">not</span> <span class="hljs-string">exist</span>
      <span class="hljs-attr">ansible.builtin.file:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">../kubespray/inventory/</span>
        <span class="hljs-attr">state:</span> <span class="hljs-string">directory</span>
        <span class="hljs-attr">mode:</span> <span class="hljs-string">'0755'</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">localhost</span>
      <span class="hljs-attr">run_once:</span> <span class="hljs-literal">true</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Generate</span> <span class="hljs-string">inventory.yml</span> <span class="hljs-string">for</span> <span class="hljs-string">kubespray</span> <span class="hljs-string">using</span> <span class="hljs-string">Jinja2</span>
      <span class="hljs-attr">template:</span>
        <span class="hljs-attr">src:</span> <span class="hljs-string">./templates/kubespray-inventory-yaml.j2</span>
        <span class="hljs-attr">dest:</span> <span class="hljs-string">./k8s-hosts.yml</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">localhost</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Copy</span> <span class="hljs-string">completed</span> <span class="hljs-string">template</span> <span class="hljs-string">to</span> <span class="hljs-string">kubespray</span> <span class="hljs-string">inventory</span> <span class="hljs-string">folder</span>
      <span class="hljs-attr">ansible.builtin.copy:</span>
        <span class="hljs-attr">src:</span> <span class="hljs-string">./k8s-hosts.yml</span>
        <span class="hljs-attr">dest:</span> <span class="hljs-string">../kubespray/inventory</span>
        <span class="hljs-attr">mode:</span> <span class="hljs-string">'0755'</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">localhost</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Disable</span> <span class="hljs-string">swap</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">ansible.builtin.command:</span> <span class="hljs-string">swapoff</span> <span class="hljs-string">-a</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Remove</span> <span class="hljs-string">swap</span> <span class="hljs-string">entry</span> <span class="hljs-string">from</span> <span class="hljs-string">/etc/fstab</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">ansible.builtin.replace:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">/etc/fstab</span>
        <span class="hljs-attr">regexp:</span> <span class="hljs-string">'(^.*swap.*$)'</span>
        <span class="hljs-attr">replace:</span> <span class="hljs-string">'# \1'</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Load</span> <span class="hljs-string">necessary</span> <span class="hljs-string">kernel</span> <span class="hljs-string">modules</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">ansible.builtin.modprobe:</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item }}</span>"</span>
      <span class="hljs-attr">loop:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">br_netfilter</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">overlay</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Ensure</span> <span class="hljs-string">kernel</span> <span class="hljs-string">modules</span> <span class="hljs-string">are</span> <span class="hljs-string">loaded</span> <span class="hljs-string">on</span> <span class="hljs-string">boot</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">ansible.builtin.copy:</span>
        <span class="hljs-attr">dest:</span> <span class="hljs-string">/etc/modules-load.d/kubernetes.conf</span>
        <span class="hljs-attr">content:</span> <span class="hljs-string">|
          br_netfilter
          overlay
</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Configure</span> <span class="hljs-string">sysctl</span> <span class="hljs-string">for</span> <span class="hljs-string">Kubernetes</span> <span class="hljs-string">networking</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">ansible.builtin.copy:</span>
        <span class="hljs-attr">dest:</span> <span class="hljs-string">/etc/sysctl.d/kubernetes.conf</span>
        <span class="hljs-attr">content:</span> <span class="hljs-string">|
          net.bridge.bridge-nf-call-ip6tables = 1
          net.bridge.bridge-nf-call-iptables = 1
          net.ipv4.ip_forward = 1
</span>      <span class="hljs-attr">notify:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">Reload</span> <span class="hljs-string">sysctl</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Apply</span> <span class="hljs-string">sysctl</span> <span class="hljs-string">settings</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">ansible.builtin.command:</span> <span class="hljs-string">sysctl</span> <span class="hljs-string">--system</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Configure</span> <span class="hljs-string">firewall</span> <span class="hljs-string">rules</span> <span class="hljs-string">for</span> <span class="hljs-string">Kubernetes</span>
      <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">ansible.builtin.firewalld:</span>
        <span class="hljs-attr">service:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item }}</span>"</span>
        <span class="hljs-attr">permanent:</span> <span class="hljs-literal">yes</span>
        <span class="hljs-attr">state:</span> <span class="hljs-string">enabled</span>
        <span class="hljs-attr">immediate:</span> <span class="hljs-literal">yes</span>
      <span class="hljs-attr">loop:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">ssh</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">http</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">https</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-api</span> 
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-apiserver</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-control-plane</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-control-plane-secure</span> 
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-controller-manager</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-controller-manager-secure</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-nodeport-services</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-scheduler</span> 
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-scheduler-secure</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kube-worker</span> 
        <span class="hljs-bullet">-</span> <span class="hljs-string">kubelet</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">kubelet-readonly</span> 
        <span class="hljs-bullet">-</span> <span class="hljs-string">kubelet-worker</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">etcd-server</span>
      <span class="hljs-attr">notify:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">Reload</span> <span class="hljs-string">firewalld</span>
</code></pre>
<p><strong>Purpose:</strong><br />This playbook sets up the environment for Kubernetes and configures the nodes for a Kubespray deployment.</p>
<p><strong>Hosts</strong>:<br />Targets the 3 Kubernetes nodes only.</p>
<p><strong>Tasks:</strong></p>
<ol>
<li><p><strong>Create inventory directory if it does not exist</strong><br /> Creates the inventory directory required by Kubespray to store the <code>inventory.yml</code> file.</p>
</li>
<li><p><strong>Generate</strong> <code>inventory.yml</code> <strong>for Kubespray using Jinja2</strong><br /> Uses a Jinja2 template to create the <code>inventory.yml</code> file that Kubespray needs, based on the defined hosts.</p>
</li>
<li><p><strong>Copy completed template to Kubespray inventory folder</strong><br /> Copies the generated <code>inventory.yml</code> file into the Kubespray directory for further use.</p>
</li>
<li><p><strong>Disable swap</strong><br /> Disables swap on the hosts as required by Kubernetes.</p>
</li>
<li><p><strong>Remove swap entry from</strong> <code>/etc/fstab</code><br /> Removes any entries related to swap in <code>/etc/fstab</code> to prevent it from re-enabling at boot.</p>
</li>
<li><p><strong>Load necessary kernel modules</strong><br /> Loads required kernel modules (<code>br_netfilter</code> and <code>overlay</code>) for Kubernetes networking.</p>
</li>
<li><p><strong>Ensure kernel modules are loaded on boot</strong><br /> Adds kernel modules to <code>/etc/modules-load.d/kubernetes.conf</code> to ensure they are loaded on boot.</p>
</li>
<li><p><strong>Configure sysctl for Kubernetes networking</strong><br /> Configures sysctl settings to enable IP forwarding and ensure proper Kubernetes networking (<code>net.bridge.bridge-nf-call-iptables</code>, <code>net.ipv4.ip_forward</code>).</p>
</li>
<li><p><strong>Apply sysctl settings</strong></p>
<p> Applies the sysctl settings to ensure they are active immediately.</p>
</li>
<li><p><strong>Configure firewall rules for Kubernetes</strong><br />Configures firewalld to allow traffic on essential Kubernetes services like <code>ssh</code>, <code>kube-api</code>, and more.</p>
</li>
</ol>
<p><strong>Handlers</strong></p>
<pre><code class="lang-yaml"> <span class="hljs-attr">handlers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Reload</span> <span class="hljs-string">firewalld</span>
    <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">ansible.builtin.command:</span> <span class="hljs-string">systemctl</span> <span class="hljs-string">reload</span> <span class="hljs-string">firewalld</span>

  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Reload</span> <span class="hljs-string">sysctl</span>
    <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">ansible.builtin.command:</span> <span class="hljs-string">sysctl</span> <span class="hljs-string">--system</span>
</code></pre>
<p><strong>Purpose:</strong><br />The handlers will reload the firewall and apply sysctl settings when triggered.</p>
<ol>
<li><p><strong>Reload firewalld</strong></p>
<p> Reloads the firewalld service to apply the newly configured rules.</p>
</li>
<li><p><strong>Reload sysctl</strong></p>
<p> Reloads the sysctl configurations to apply networking changes.</p>
</li>
</ol>
<h3 id="heading-post-kubespray-playbook-post-kubespray-setup-pbyml">Post-Kubespray Playbook - <code>post-kubespray-setup-pb.yml</code></h3>
<p>The Post playbook currently performs a single function, though it involves several tasks. It downloads the latest version of <code>kubeadm</code>, sets the correct configuration, and ensures proper file ownership for the user. The playbook primarily uses Ansible's <code>file</code> and <code>shell</code> modules, essentially turning a series of steps from the documentation into an automated process. Notably, these <code>kubeadm</code> tasks only run on hosts designated as Kubernetes control-plane nodes. I plan to expand this playbook in the future to include additional tasks, such as tests and more.</p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Setup</span> <span class="hljs-string">kubectl</span> <span class="hljs-string">on</span> <span class="hljs-string">control</span> <span class="hljs-string">plane</span> <span class="hljs-string">nodes</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">k8s_nodes</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">false</span>
  <span class="hljs-attr">tasks:</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Kubectl</span> <span class="hljs-string">block</span>
      <span class="hljs-attr">block:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Download</span> <span class="hljs-string">kubectl</span> <span class="hljs-string">files</span> <span class="hljs-string">(latest)</span>
          <span class="hljs-attr">ansible.builtin.shell:</span>
            <span class="hljs-attr">cmd:</span> <span class="hljs-string">curl</span> <span class="hljs-string">-LO</span> <span class="hljs-string">https://storage.googleapis.com/kubernetes-release/release/`curl</span> <span class="hljs-string">-s</span> <span class="hljs-string">https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl</span>
            <span class="hljs-attr">chdir:</span> <span class="hljs-string">/home/{{</span> <span class="hljs-string">ansible_user</span> <span class="hljs-string">}}</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Copy</span> <span class="hljs-string">kubernetes</span> <span class="hljs-string">admin</span> <span class="hljs-string">configuration</span>
          <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
          <span class="hljs-attr">ansible.builtin.shell:</span>
            <span class="hljs-attr">cmd:</span> <span class="hljs-string">cp</span> <span class="hljs-string">/etc/kubernetes/admin.conf</span> <span class="hljs-string">/home/{{</span> <span class="hljs-string">ansible_user</span> <span class="hljs-string">}}/config</span>
            <span class="hljs-attr">chdir:</span> <span class="hljs-string">/home/{{</span> <span class="hljs-string">ansible_user</span> <span class="hljs-string">}}</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Remove</span> <span class="hljs-string">existing</span> <span class="hljs-string">.kube</span> <span class="hljs-string">directory</span>
          <span class="hljs-attr">ansible.builtin.file:</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">/home/{{</span> <span class="hljs-string">ansible_user</span> <span class="hljs-string">}}/.kube</span>
            <span class="hljs-attr">state:</span> <span class="hljs-string">absent</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">fresh</span> <span class="hljs-string">.kube</span> <span class="hljs-string">directory</span>
          <span class="hljs-attr">ansible.builtin.file:</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">/home/{{</span> <span class="hljs-string">ansible_user</span> <span class="hljs-string">}}/.kube</span>
            <span class="hljs-attr">state:</span> <span class="hljs-string">directory</span>
            <span class="hljs-attr">mode:</span> <span class="hljs-string">'0755'</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Move</span> <span class="hljs-string">kubernetes</span> <span class="hljs-string">admin</span> <span class="hljs-string">configuration</span>
          <span class="hljs-attr">ansible.builtin.shell:</span>
            <span class="hljs-attr">cmd:</span> <span class="hljs-string">mv</span> <span class="hljs-string">config</span> <span class="hljs-string">.kube/</span>
            <span class="hljs-attr">chdir:</span> <span class="hljs-string">/home/{{</span> <span class="hljs-string">ansible_user</span> <span class="hljs-string">}}</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Correct</span> <span class="hljs-string">ownership</span> <span class="hljs-string">of</span> <span class="hljs-string">.kube</span> <span class="hljs-string">config</span>
          <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
          <span class="hljs-attr">ansible.builtin.file:</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">/home/{{</span> <span class="hljs-string">ansible_user</span> <span class="hljs-string">}}/.kube/config</span>
            <span class="hljs-attr">owner:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ ansible_user }}</span>"</span>
            <span class="hljs-attr">group:</span> <span class="hljs-number">1000</span>

      <span class="hljs-attr">when:</span> <span class="hljs-string">hostvars[inventory_hostname]['master_node']</span>
</code></pre>
<h2 id="heading-building-the-k8s-cluster-running-the-playbook"><strong>Building the K8s Cluster (Running the Playbook)</strong></h2>
<h3 id="heading-pre-kubespray-setup-pbyml"><code>pre-kubespray-setup-pb.yml</code></h3>
<p>To start, the inventory file needs to be created or modified. If it was pulled down from the Github repository, you will only need to modify it according to your needs. If not, the example <code>inventory.yml</code> file above in this post can be used. In the example file there is just a single control-plane node and etcd server (rocky9-lab-node1). All 3 nodes are set as worker nodes.</p>
<p>To execute the playbook run the following command -</p>
<p><code>ansible-playbook pre-kubespray-setup-pb.yml -i inventory.yml --ask-become-pass --ask-pass</code></p>
<p><strong>NOTE:</strong> This assumes all previous setup was completed, the python virtual environment is active, and kubespray-addons folder is adjacent to the main kubespray folder. Otherwise this playbook will fail.</p>
<p>The ssh/sudo passwords for the K8s nodes will need inputted.</p>
<p>Below is an example output of successful Pre Kubespray playbook run -</p>
<pre><code class="lang-typescript">(kubespray_env) [jeff<span class="hljs-meta">@rocky9</span>-lab-mgmt kubespray-addons]$ ansible-playbook pre-kubespray-setup-pb.yml -i inventory.yml --ask-become-pass --ask-pass
SSH password: 
BECOME password[defaults to SSH password]: 

PLAY [Pre Kubespray Setup] ***************************************************************************************************************************

TASK [Install collections <span class="hljs-keyword">from</span> requirements.yml] *****************************************************************************************************
changed: [rocky9-lab-node1 -&gt; localhost]

TASK [Generate SSH key pair] *************************************************************************************************************************
ok: [rocky9-lab-node1 -&gt; localhost]

TASK [Ensure the SSH <span class="hljs-keyword">public</span> key is present on the remote host] ***************************************************************************************
skipping: [rocky9-lab-mgmt]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]

TASK [Add entries to /etc/hosts] *********************************************************************************************************************
changed: [rocky9-lab-node2] =&gt; (item=rocky9-lab-node1)
changed: [rocky9-lab-node1] =&gt; (item=rocky9-lab-node1)
changed: [rocky9-lab-node3] =&gt; (item=rocky9-lab-node1)
changed: [rocky9-lab-node1] =&gt; (item=rocky9-lab-node2)
changed: [rocky9-lab-node2] =&gt; (item=rocky9-lab-node2)
changed: [rocky9-lab-node3] =&gt; (item=rocky9-lab-node2)
ok: [rocky9-lab-mgmt] =&gt; (item=rocky9-lab-node1)
changed: [rocky9-lab-node2] =&gt; (item=rocky9-lab-node3)
changed: [rocky9-lab-node1] =&gt; (item=rocky9-lab-node3)
changed: [rocky9-lab-node3] =&gt; (item=rocky9-lab-node3)
changed: [rocky9-lab-node2] =&gt; (item=rocky9-lab-mgmt)
changed: [rocky9-lab-node1] =&gt; (item=rocky9-lab-mgmt)
ok: [rocky9-lab-mgmt] =&gt; (item=rocky9-lab-node2)
changed: [rocky9-lab-node3] =&gt; (item=rocky9-lab-mgmt)
ok: [rocky9-lab-mgmt] =&gt; (item=rocky9-lab-node3)
ok: [rocky9-lab-mgmt] =&gt; (item=rocky9-lab-mgmt)

PLAY [Build Kubespray inventory and additional k8s node setup] ***************************************************************************************

TASK [Create inventory directory <span class="hljs-keyword">if</span> it does not exist] ***********************************************************************************************
ok: [rocky9-lab-node1 -&gt; localhost]

TASK [Generate inventory.yml <span class="hljs-keyword">for</span> kubespray using Jinja2] *********************************************************************************************
ok: [rocky9-lab-node2 -&gt; localhost]
ok: [rocky9-lab-node3 -&gt; localhost]
ok: [rocky9-lab-node1 -&gt; localhost]

TASK [Copy completed template to kubespray inventory folder] *****************************************************************************************
changed: [rocky9-lab-node1 -&gt; localhost]
changed: [rocky9-lab-node2 -&gt; localhost]
changed: [rocky9-lab-node3 -&gt; localhost]

TASK [Disable swap] **********************************************************************************************************************************
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]
changed: [rocky9-lab-node2]

TASK [Remove swap entry <span class="hljs-keyword">from</span> /etc/fstab] *************************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]

TASK [Load necessary kernel modules] *****************************************************************************************************************
changed: [rocky9-lab-node1] =&gt; (item=br_netfilter)
changed: [rocky9-lab-node3] =&gt; (item=br_netfilter)
changed: [rocky9-lab-node2] =&gt; (item=br_netfilter)
changed: [rocky9-lab-node1] =&gt; (item=overlay)
changed: [rocky9-lab-node2] =&gt; (item=overlay)
changed: [rocky9-lab-node3] =&gt; (item=overlay)

TASK [Ensure kernel modules are loaded on boot] ******************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Configure sysctl <span class="hljs-keyword">for</span> Kubernetes networking] ****************************************************************************************************
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]

TASK [Apply sysctl settings] *************************************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]

TASK [Configure firewall rules <span class="hljs-keyword">for</span> Kubernetes] *******************************************************************************************************
ok: [rocky9-lab-node1] =&gt; (item=ssh)
ok: [rocky9-lab-node2] =&gt; (item=ssh)
ok: [rocky9-lab-node3] =&gt; (item=ssh)
changed: [rocky9-lab-node3] =&gt; (item=http)
changed: [rocky9-lab-node1] =&gt; (item=http)
changed: [rocky9-lab-node2] =&gt; (item=http)
...output ommitted <span class="hljs-keyword">for</span> brevity...
changed: [rocky9-lab-node2] =&gt; (item=etcd-server)
changed: [rocky9-lab-node3] =&gt; (item=kubelet-worker)
changed: [rocky9-lab-node3] =&gt; (item=etcd-server)

RUNNING HANDLER [Reload firewalld] *******************************************************************************************************************
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node3]

RUNNING HANDLER [Reload sysctl] **********************************************************************************************************************
changed: [rocky9-lab-node2]
changed: [rocky9-lab-node1]
changed: [rocky9-lab-node3]

PLAY RECAP *******************************************************************************************************************************************
rocky9-lab-mgmt            : ok=<span class="hljs-number">1</span>    changed=<span class="hljs-number">0</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">1</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
rocky9-lab-node1           : ok=<span class="hljs-number">16</span>   changed=<span class="hljs-number">13</span>   unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
rocky9-lab-node2           : ok=<span class="hljs-number">13</span>   changed=<span class="hljs-number">12</span>   unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
rocky9-lab-node3           : ok=<span class="hljs-number">13</span>   changed=<span class="hljs-number">12</span>   unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>
</code></pre>
<p>There should be many changes for the K8s nodes and no failures.</p>
<p>The Kubernetes nodes should now be ready to setup/run Kubernetes via Kubespray.</p>
<h3 id="heading-kubespray-1"><code>Kubespray</code></h3>
<p>The next step is executing the Kubespray cluster build playbook, which should be very easy now. We will use the <code>k8s-hosts.yml</code> that was created from the Pre-Kubespray playbook as the Kubespray required inventory. It is located in the main Kubespray directory within the <code>inventory</code> folder. You can see the contents of this file below -</p>
<pre><code class="lang-yaml"><span class="hljs-attr">all:</span>
  <span class="hljs-attr">hosts:</span>
    <span class="hljs-attr">rocky9-lab-node1:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.25</span>
      <span class="hljs-attr">ip:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.25</span>
      <span class="hljs-attr">access_ip:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.25</span>
    <span class="hljs-attr">rocky9-lab-node2:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.26</span>
      <span class="hljs-attr">ip:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.26</span>
      <span class="hljs-attr">access_ip:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.26</span>
    <span class="hljs-attr">rocky9-lab-node3:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.27</span>
      <span class="hljs-attr">ip:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.27</span>
      <span class="hljs-attr">access_ip:</span> <span class="hljs-number">172.16</span><span class="hljs-number">.99</span><span class="hljs-number">.27</span>

  <span class="hljs-attr">children:</span>
    <span class="hljs-attr">kube_control_plane:</span>
      <span class="hljs-attr">hosts:</span>
        <span class="hljs-attr">rocky9-lab-node1:</span>

    <span class="hljs-attr">kube_node:</span>
      <span class="hljs-attr">hosts:</span>
        <span class="hljs-attr">rocky9-lab-node1:</span>
        <span class="hljs-attr">rocky9-lab-node2:</span>
        <span class="hljs-attr">rocky9-lab-node3:</span>

    <span class="hljs-attr">etcd:</span>
      <span class="hljs-attr">hosts:</span>
        <span class="hljs-attr">rocky9-lab-node1:</span>

    <span class="hljs-attr">k8s_cluster:</span>
      <span class="hljs-attr">children:</span>
        <span class="hljs-attr">kube_control_plane:</span>
        <span class="hljs-attr">kube_node:</span>
    <span class="hljs-attr">calico_rr:</span>
      <span class="hljs-attr">hosts:</span> {}
</code></pre>
<p>Change into the main Kubespray directory and execute the playbook like below -</p>
<p><code>ansible-playbook -i inventory/k8s-hosts.yml --ask-pass --become --ask-become-pass cluster.yml</code></p>
<p><strong>NOTE:</strong> Kubespray/Kubernetes requires ‘root’ access to run successfully hence the <code>--become</code>. SSH/Sudo passwords are also again required.</p>
<p>Kubespray can take 15-20 mins to finish execution. The output is vast so I won’t be pasting an example in here. A successful run should look like the below output, once it finishes -</p>
<pre><code class="lang-typescript">PLAY RECAP ***********************************************************************************************************************************************
rocky9-lab-node1           : ok=<span class="hljs-number">649</span>  changed=<span class="hljs-number">88</span>   unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">1090</span> rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">6</span>   
rocky9-lab-node2           : ok=<span class="hljs-number">415</span>  changed=<span class="hljs-number">36</span>   unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">625</span>  rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">1</span>   
rocky9-lab-node3           : ok=<span class="hljs-number">416</span>  changed=<span class="hljs-number">37</span>   unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">624</span>  rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">1</span>   

Saturday <span class="hljs-number">05</span> October <span class="hljs-number">2024</span>  <span class="hljs-number">13</span>:<span class="hljs-number">39</span>:<span class="hljs-number">17</span> <span class="hljs-number">-0400</span> (<span class="hljs-number">0</span>:<span class="hljs-number">00</span>:<span class="hljs-number">00.115</span>)       <span class="hljs-number">0</span>:<span class="hljs-number">07</span>:<span class="hljs-number">36.442</span> ****** 
=============================================================================== 
kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------- <span class="hljs-number">21.11</span>s
kubernetes/control-plane : Kubeadm | Initialize first control plane node ------------------------------------------------------------------------- <span class="hljs-number">20.15</span>s
download : Download_container | Download image <span class="hljs-keyword">if</span> required --------------------------------------------------------------------------------------- <span class="hljs-number">11.65</span>s
download : Download_container | Download image <span class="hljs-keyword">if</span> required --------------------------------------------------------------------------------------- <span class="hljs-number">10.34</span>s
container-engine/runc : Download_file | Download item --------------------------------------------------------------------------------------------- <span class="hljs-number">8.51</span>s
container-engine/containerd : Download_file | Download item --------------------------------------------------------------------------------------- <span class="hljs-number">8.25</span>s
container-engine/crictl : Download_file | Download item ------------------------------------------------------------------------------------------- <span class="hljs-number">8.19</span>s
container-engine/nerdctl : Download_file | Download item ------------------------------------------------------------------------------------------ <span class="hljs-number">8.16</span>s
download : Download_container | Download image <span class="hljs-keyword">if</span> required ---------------------------------------------------------------------------------------- <span class="hljs-number">7.65</span>s
etcd : Reload etcd -------------------------------------------------------------------------------------------------------------------------------- <span class="hljs-number">6.14</span>s
container-engine/crictl : Extract_file | Unpacking archive ---------------------------------------------------------------------------------------- <span class="hljs-number">6.08</span>s
container-engine/nerdctl : Extract_file | Unpacking archive --------------------------------------------------------------------------------------- <span class="hljs-number">5.62</span>s
download : Download_container | Download image <span class="hljs-keyword">if</span> required ---------------------------------------------------------------------------------------- <span class="hljs-number">5.23</span>s
etcd : Configure | Check <span class="hljs-keyword">if</span> etcd cluster is healthy ----------------------------------------------------------------------------------------------- <span class="hljs-number">5.23</span>s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates ---------------------------------------------------------------------------- <span class="hljs-number">4.75</span>s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources --------------------------------------------------------------------------------------- <span class="hljs-number">4.05</span>s
download : Download_container | Download image <span class="hljs-keyword">if</span> required ---------------------------------------------------------------------------------------- <span class="hljs-number">4.02</span>s
download : Download_container | Download image <span class="hljs-keyword">if</span> required ---------------------------------------------------------------------------------------- <span class="hljs-number">3.58</span>s
network_plugin/cni : CNI | Copy cni plugins ------------------------------------------------------------------------------------------------------- <span class="hljs-number">3.25</span>s
download : Download_file | Download item ---------------------------------------------------------------------------------------------------------- <span class="hljs-number">3.00</span>s
</code></pre>
<p>Kubespray execution can sometimes fail due to connectivity issues or similar problems, especially when pulling down multiple container images, which might time out. If this happens, simply re-run the playbook as described earlier. It will pick up where it left off, skipping the tasks that have already been successfully completed.</p>
<p>If you want to wipe out the Kubespray/Kubernetes cluster, Kubespray does give you a playbook for that as well. That can be executed using the below example -</p>
<p><code>ansible-playbook -i inventory/k8s-hosts.yml --ask-pass --become --ask-become-pass reset.yml</code></p>
<h3 id="heading-post-kubespray-setup-pbyml"><code>post-kubespray-setup-pb.yml</code></h3>
<p>After successfully creating a K8s cluster using Kubespray the last piece required is configuring kubectl on the control-plane nodes. To do this change back into the <code>kubespray-addons</code> directory. The Post-Kubespray playbook can be executed as seen below -</p>
<p><code>ansible-playbook post-kubespray-setup-pb.yml -i inventory.yml --ask-pass --ask-become-pass</code></p>
<p>A successful execution output should look something like seen here -</p>
<pre><code class="lang-typescript">(kubespray_env) [jeff<span class="hljs-meta">@rocky9</span>-lab-mgmt kubespray-addons]$ ansible-playbook post-kubespray-setup-pb.yml -i inventory.yml --ask-pass --ask-become-pass 
SSH password: 
BECOME password[defaults to SSH password]: 

PLAY [Setup kubectl on control plane nodes] **********************************************************************************************************

TASK [Download kubectl files (latest)] ***************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Copy kubernetes admin configuration] ***********************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Remove existing .kube directory] ***************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
ok: [rocky9-lab-node1]

TASK [Create fresh .kube directory] ******************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Move kubernetes admin configuration] ***********************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

TASK [Correct ownership <span class="hljs-keyword">of</span> .kube config] *************************************************************************************************************
skipping: [rocky9-lab-node2]
skipping: [rocky9-lab-node3]
changed: [rocky9-lab-node1]

PLAY RECAP *******************************************************************************************************************************************
rocky9-lab-node1           : ok=<span class="hljs-number">6</span>    changed=<span class="hljs-number">5</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
rocky9-lab-node2           : ok=<span class="hljs-number">0</span>    changed=<span class="hljs-number">0</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">6</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
rocky9-lab-node3           : ok=<span class="hljs-number">0</span>    changed=<span class="hljs-number">0</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">6</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>
</code></pre>
<p><strong>NOTE:</strong> You should see ‘changed’ only for nodes designated as control-plane nodes.</p>
<h2 id="heading-closing-thoughts"><strong>Closing Thoughts</strong></h2>
<p>If all 3 playbooks ran successfully, CONGRATULATIONS, you should have a fully working Kubernetes cluster. To confirm this, log into any of the cluster control-plane nodes and run the command <code>kubectl get nodes</code>. You will hopefully see the following output result -</p>
<pre><code class="lang-bash">[jeff@rocky9-lab-node1 ~]$ kubectl get nodes
NAME               STATUS   ROLES           AGE   VERSION
rocky9-lab-node1   Ready    control-plane   39m   v1.30.4
rocky9-lab-node2   Ready    &lt;none&gt;          39m   v1.30.4
rocky9-lab-node3   Ready    &lt;none&gt;          39m   v1.30.4
</code></pre>
<p>The Kubernetes cluster is fully set up, providing a solid foundation for what’s coming in the next post, and eventually Containerlab/Clabernetes. You can also use this cluster to dive deeper into the world of Kubernetes beyond what we’re covering here. Experiment, expand the cluster, tear it down, and rebuild it—become an expert if you wish. Hopefully, this post makes the entry into Kubernetes a bit easier for those starting out.</p>
<h2 id="heading-whats-next"><strong>What’s next?</strong></h2>
<p>I have at least two more pieces I will be adding to this series -</p>
<ol>
<li><p><s>Building a Cluster (Part 1)</s></p>
</li>
<li><p>Adding Built-in Storage Cluster using MicroCeph (Part 2)</p>
</li>
<li><p>Setting up and exploring Containerlab/Clabernetes (Part 3)</p>
</li>
</ol>
<p>I also plan to add posts covering specific topology examples, integration with other tools, and network automation testing. These topics may either extend this series or become their own separate posts. There’s always a wealth of topics to explore and write about.</p>
<p>You can find the code that goes along with this post <a target="_blank" href="https://github.com/leothelyon17/kubespray-addons">here</a> (Github).</p>
<p>Thoughts, questions, and comments are appreciated. Please follow me here at Hashnode or connect with me on <a target="_blank" href="https://www.linkedin.com/in/jeffrey-m-lyon/">Linkedin</a>.</p>
<p>Thank you for reading fellow techies!</p>
]]></content:encoded></item><item><title><![CDATA[Unraid VM Snapshot Automation with Ansible: Part 2 - Restoring Snapshots]]></title><description><![CDATA[Intro
Hello again, and welcome to the second post in my Unraid snapshot automation series!
In my first post, we explored how to use Ansible to automate the creation of VM snapshots on Unraid, simplifying the backup process for home lab setups or even...]]></description><link>https://blog.nerdylyonsden.io/unraid-vm-snapshot-automation-with-ansible-part-2</link><guid isPermaLink="true">https://blog.nerdylyonsden.io/unraid-vm-snapshot-automation-with-ansible-part-2</guid><category><![CDATA[unraid]]></category><category><![CDATA[ansible]]></category><category><![CDATA[automation]]></category><category><![CDATA[snapshot]]></category><dc:creator><![CDATA[Jeffrey Lyon]]></dc:creator><pubDate>Mon, 16 Sep 2024 14:21:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1726412209635/fecf7a1e-0e6c-4bdd-b86b-af02374f79c2.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-intro"><strong>Intro</strong></h2>
<p>Hello again, and welcome to the second post in my Unraid snapshot automation series!</p>
<p>In my first post, we explored how to use Ansible to automate the creation of VM snapshots on Unraid, simplifying the backup process for home lab setups or even more advanced environments. Now, it's time to complete the picture by diving into <strong>snapshot restoration</strong>. In this post, I'll show you how to leverage those snapshots we created earlier to quickly and efficiently roll back VMs to a previous state.</p>
<p>Whether you're testing, troubleshooting, or simply maintaining a reliable baseline for your VMs, automated snapshot restoration will save you time and effort. Like before, this is designed with the home lab community in mind, but the process can easily be adapted for other Linux-based systems.</p>
<p>The first post can be found here:<br /><a target="_blank" href="https://thenerdylyonsden.hashnode.dev/unraid-vm-snapshot-automation-with-ansible-part-1">https://thenerdylyonsden.hashnode.dev/unraid-vm-snapshot-automation-with-ansible-part-1</a></p>
<p>Let’s get started!</p>
<h2 id="heading-scenario-and-requirements">Scenario and Requirements</h2>
<p>This section largely mirrors the previous post. I'll be using the snapshot files created earlier—both the remote <code>.img</code> and local <code>.tar</code> files. The setup remains the same: I'll use the Ubuntu Ansible host, the Unraid server for local snapshots, and the Synology DiskStation for remote storage. For local restores, the Unraid server will act as both the source and destination. No additional packages or configurations are required on any of the systems.</p>
<h2 id="heading-lets-automate"><strong>Let's Automate!</strong></h2>
<h3 id="heading-overview-and-setup"><strong>Overview and Setup</strong></h3>
<p>Let's review the playbook directory structure from the previous post. It looks like this:</p>
<pre><code class="lang-bash">
├── README.md
├── create-snapshot-pb.yml
├── defaults
│   └── inventory.yml
├── files
│   ├── backup-playbook-old.yml
│   └── snapshot-creation-unused.yml
├── handlers
├── meta
├── restore-from-local-tar-pb.yml
├── restore-from-snapshot-pb.yml
├── tasks
│   ├── shutdown-vm.yml
│   └── start-vm.yml
├── templates
├── tests
│   ├── debug-tests-pb.yml
│   └── simple-debugs.yml
└── vars
    ├── snapshot-creation-vars.yml
    └── snapshot-restore-vars.yml
</code></pre>
<p>Most of this was covered in the previous post. I will cover the new files here:</p>
<ul>
<li><p><code>vars/snapshot-restore-vars.yml</code> Similar to the create file, this file is where users specify the list of VMs and their corresponding disks for snapshot restoration. It primarily consists of a dictionary outlining the VMs and the disks to be restored. Additionally, it includes variables for configuring the connection to the destination NAS device.</p>
</li>
<li><p><code>restore-from-snapshot-pb.yml</code> This playbook manages the restoration process from the remote snapshot repository and is composed of three plays. The first play serves two functions: it verifies the targeted Unraid VMs and disks, and builds additional data structures along with dynamic host groups. The second play locates the correct snapshots, transfers them to the Unraid server, and handles file comparison, VM shutdown, and replacing the original disk with the snapshot. The third play restarts the VMs once all other tasks are completed.</p>
</li>
<li><p><code>restore-from-local-tar-pb.yml</code> Same as above. Does everything local to the Unraid server using <code>.tar</code> files instead of remote snapshots.</p>
</li>
</ul>
<h3 id="heading-inventory-defaultsinventoryyml">Inventory - <code>defaults/inventory.yml</code></h3>
<p>Covered in Part 1. Shown here again for reference:</p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-attr">nodes:</span>
  <span class="hljs-attr">hosts:</span>
    <span class="hljs-attr">diskstation:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'DISKSTATION_IP_ADDRESS') }}</span>"</span>
      <span class="hljs-attr">ansible_user:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'DISKSTATION_USER') }}</span>"</span>
      <span class="hljs-attr">ansible_password:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'DISKSTATION_PASS') }}</span>"</span>
    <span class="hljs-attr">unraid:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'UNRAID_IP_ADDRESS') }}</span>"</span>
      <span class="hljs-attr">ansible_user:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'UNRAID_USER') }}</span>"</span>
      <span class="hljs-attr">ansible_password:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'UNRAID_PASS') }}</span>"</span>
</code></pre>
<p>Defines the connection variables for the unraid and diskstation hosts.</p>
<h3 id="heading-variables-varssnapshot-restore-varsyml">Variables - <code>vars/snapshot-restore-vars.yml</code></h3>
<p>Much like the snapshot creation automation, this playbook relies on a single variable file that serves as the primary point of interaction for the user. In this file, you’ll list the VMs, specify the disks to be restored for each VM, provide the path to the existing disk <code>.img</code> file, and indicate the snapshot you wish to restore from. If a snapshot name is not specified, the playbook will automatically search for and restore from the most recent snapshot associated with the disk.</p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-attr">snapshot_repository_base_directory:</span> <span class="hljs-string">volume1/Home\</span> <span class="hljs-string">Media/Backup</span>
<span class="hljs-attr">repository_user:</span> <span class="hljs-string">unraid</span>

<span class="hljs-attr">snapshot_restore_list:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">vm_name:</span> <span class="hljs-string">Rocky9-TESTNode</span>
    <span class="hljs-attr">disks_to_restore:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">vm_disk_to_restore:</span> <span class="hljs-string">vdisk1.img</span>
        <span class="hljs-attr">vm_disk_directory:</span> <span class="hljs-string">/mnt/cache/domains</span>
        <span class="hljs-attr">snapshot_to_restore_from:</span> <span class="hljs-string">test-snapshot</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">vm_disk_to_restore:</span> <span class="hljs-string">vdisk2.img</span>
        <span class="hljs-attr">vm_disk_directory:</span> <span class="hljs-string">/mnt/disk1/domains</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">vm_name:</span> <span class="hljs-string">Rocky9-LabNode3</span>
    <span class="hljs-attr">disks_to_restore:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">vm_disk_to_restore:</span> <span class="hljs-string">vdisk1.img</span>
        <span class="hljs-attr">vm_disk_directory:</span> <span class="hljs-string">/mnt/nvme_cache/domains</span>
        <span class="hljs-attr">snapshot_to_restore_from:</span> <span class="hljs-string">kubernetes-baseline</span>
</code></pre>
<p>Let's examine this file. It's similar to the one used for creation. I used a little more descriptive key names this time:</p>
<ul>
<li><p><code>snapshot_restore_list</code> - the main data structure for defining your list of VMs and disks. Within this there are two main variables <code>vm_name</code> and <code>disks_to_restore</code></p>
</li>
<li><p><code>vm_name</code> - used to define your the name of your VM. Must coincide with the name of the VM used within the Unraid system itself.</p>
</li>
<li><p><code>disks_to_restore</code> - a per VM list consisting of the disks that will be restored. This list requires two variables—<code>vm_disk_to_restore</code> and <code>vm_disk_directory</code>, with <code>snapshot_to_restore_from</code> as an ‘optional’ third variable.</p>
</li>
<li><p><code>vm_disk_to_restore</code> - consists the existing <code>.img</code> file name for that VM disk, i.e <code>vdisk1.img</code></p>
</li>
<li><p><code>vm_disk_directory</code> - consists of the absolute directory root path where the per VM files are stored. An example of a full path to an <code>.img</code> file within Unraid would be: <code>/mnt/cache/domains/Rocky9-TESTNode/vdisk1.img</code></p>
</li>
<li><p><code>snapshot_to_restore_from</code> - is an optional attribute that allows the user to specify the name of the snapshot for restoration. If this attribute is not provided, the playbook will automatically search for and use the latest snapshot that matches the disk.</p>
</li>
<li><p><code>snapshot_repository_base_directory</code> and <code>repository_user</code> are used within the playbook's rsync task. These variables offer flexibility, allowing the user to specify their own remote user and target destination for the rsync operation. These are used only if the snapshots are being sent to remote location upon creation.</p>
</li>
</ul>
<p>Following the provided example you can define your VMs, disk names, locations , and restoration snapshot name when running the playbook.</p>
<h3 id="heading-playbooks"><strong>Playbooks</strong></h3>
<p>Two distinct playbooks were created to manage disk restoration. The <code>restore-from-snapshot-pb.yml</code> playbook handles restoration from the remote repository (DiskStation) using <code>rsync</code>. Meanwhile, local restoration is managed by <code>restore-from-local-tar-pb.yml</code>. Combining these processes proved to be too complex and unwieldy, so it was simpler and more manageable to build, test, and understand them separately.</p>
<p><strong>NOTE:</strong> Snapshot restoration is much trickier to automate than creation. There are a lot more tasks/conditionals related to error handling in these playbooks.</p>
<h3 id="heading-restore-from-snapshot-pbyml"><code>restore-from-snapshot-pb.yml</code></h3>
<p><strong>Restore Snapshot Preparation Play</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Restore</span> <span class="hljs-string">Snapshot</span> <span class="hljs-string">Preparation</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">unraid</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">no</span>
  <span class="hljs-attr">vars_files:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">./vars/snapshot-restore-vars.yml</span>

  <span class="hljs-attr">tasks:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Retrieve</span> <span class="hljs-string">List</span> <span class="hljs-string">of</span> <span class="hljs-string">All</span> <span class="hljs-string">Existing</span> <span class="hljs-string">VMs</span> <span class="hljs-string">on</span> <span class="hljs-string">UnRAID</span> <span class="hljs-string">Hypervisor</span>
      <span class="hljs-attr">shell:</span> <span class="hljs-string">virsh</span> <span class="hljs-string">list</span> <span class="hljs-string">--all</span> <span class="hljs-string">|</span> <span class="hljs-string">tail</span> <span class="hljs-string">-n</span> <span class="hljs-string">+3</span> <span class="hljs-string">|</span> <span class="hljs-string">awk</span> <span class="hljs-string">'{ print $2}'</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">hypervisor_existing_vm_list</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Generate</span> <span class="hljs-string">VM</span> <span class="hljs-string">and</span> <span class="hljs-string">Disk</span> <span class="hljs-string">Lists</span> <span class="hljs-string">for</span> <span class="hljs-string">Validated</span> <span class="hljs-string">VMs</span> <span class="hljs-string">in</span> <span class="hljs-string">User</span> <span class="hljs-string">Inputted</span> <span class="hljs-string">Data</span>
      <span class="hljs-attr">set_fact:</span> 
        <span class="hljs-attr">vms_map:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_restore_list | map(attribute='vm_name') }}</span>"</span>
        <span class="hljs-attr">disks_map:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_restore_list | map(attribute='disks_to_restore') }}</span>"</span>
      <span class="hljs-attr">when:</span> <span class="hljs-string">item.vm_name</span> <span class="hljs-string">in</span> <span class="hljs-string">hypervisor_existing_vm_list.stdout_lines</span>
      <span class="hljs-attr">with_items:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_restore_list }}</span>"</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Build</span> <span class="hljs-string">Data</span> <span class="hljs-string">Structure</span> <span class="hljs-string">for</span> <span class="hljs-string">Snapshot</span> <span class="hljs-string">Restoration</span>
      <span class="hljs-attr">set_fact:</span> 
        <span class="hljs-attr">snapshot_data_map:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ dict(vms_map | zip(disks_map)) | dict2items(key_name='vm_name', value_name='disks_to_restore') | subelements('disks_to_restore') }}</span>"</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Verify</span> <span class="hljs-string">Snapshot</span> <span class="hljs-string">Data</span> <span class="hljs-string">is</span> <span class="hljs-string">Available</span> <span class="hljs-string">for</span> <span class="hljs-string">Restoration</span>
      <span class="hljs-attr">assert:</span>
        <span class="hljs-attr">that:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">snapshot_data_map</span>
        <span class="hljs-attr">fail_msg:</span> <span class="hljs-string">"Restore operation failed. Not enough data to proceed."</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Dynamically</span> <span class="hljs-string">Create</span> <span class="hljs-string">Host</span> <span class="hljs-string">Group</span> <span class="hljs-string">for</span> <span class="hljs-string">Disks</span> <span class="hljs-string">to</span> <span class="hljs-string">be</span> <span class="hljs-string">Restored</span>
      <span class="hljs-attr">ansible.builtin.add_host:</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[0]['vm_name'] }}</span>-<span class="hljs-template-variable">{{ item[1]['vm_disk_to_restore'][:-4] }}</span>"</span>
        <span class="hljs-attr">groups:</span> <span class="hljs-string">disks</span>
        <span class="hljs-attr">vm_name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[0]['vm_name'] }}</span>"</span>
        <span class="hljs-attr">disk_name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[1]['vm_disk_to_restore'] }}</span>"</span>
        <span class="hljs-attr">source_directory:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[1]['vm_disk_directory'] }}</span>"</span>
        <span class="hljs-attr">snapshot_to_restore_from:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[1]['snapshot_to_restore_from'] | default('latest') }}</span>"</span>
      <span class="hljs-attr">loop:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_data_map }}</span>"</span>
</code></pre>
<p><strong>Purpose</strong>:<br />Designed to prepare for the restoration of VM snapshots on an Unraid hypervisor. It gathers information about existing VMs, validates user input, structures the data for restoration, and dynamically creates host groups for managing the restore process.</p>
<p><strong>Hosts</strong>:<br />Targets the <code>unraid</code> host.</p>
<p><strong>Variables File</strong>:<br />Loads additional variables from <code>./vars/snapshot_restore_vars.yml</code>. Mainly the user's modified <code>snapshot_restore_list</code>.</p>
<p><strong>Tasks</strong>:</p>
<ol>
<li><p><strong>Retrieve List of All Existing VMs on Unraid Hypervisor</strong>:<br /> Executes a shell command to list all VMs on the Unraid hypervisor and registers the result. It extracts VM names using <code>virsh</code> and formats the output for further use.</p>
</li>
<li><p><strong>Generate VM and Disk Lists for Validated VMs in User Inputted Data:</strong><br /> Constructs lists of VM names and disks to restore from the user input data, but only includes those VMs that exist on the hypervisor. Only runs if user inputted variable data matches at least 1 existing VM name. Otherwise playbook fails due to lack of data.</p>
<ul>
<li><p><code>vms_map</code>: List of VM names.</p>
</li>
<li><p><code>disks_map</code>: List of disks to restore.</p>
</li>
</ul>
</li>
</ol>
<p>    These lists are then used to create the larger <code>snapshot_data_map</code>.</p>
<ol start="3">
<li><p><strong>Build Data Structure for Snapshot Restoration:</strong><br /> Creates a nested data structure that maps each VM to its corresponding disks to restore, preparing it for subsequent tasks.</p>
<ul>
<li><code>snapshot_data_map</code>: Merges the VM and disk maps into a more structured data format, making it easier to access and manage the VM/disk information programmatically. My goal was to keep the inventory files simple for users to understand and modify. However, this approach didn’t work well with the looping logic I needed, so I created this new data map for better flexibility and control.</li>
</ul>
</li>
<li><p><strong>Verify Snapshot Data is Available for Restoration:</strong></p>
<p> Checks that <code>snapshot_data_map</code> has been populated correctly and ensures that there is enough data to proceed with the restoration. If not, it triggers a failure message to indicate insufficient data and halts the playbook.</p>
</li>
<li><p><strong>Dynamically Create Host Group for Disks to be Restored:</strong></p>
<p> Creates dynamic host entries for each disk that needs to be restored. Each host is added to the <code>disks</code> group with relevant information about the VM, disk, and optional snapshot name.</p>
</li>
</ol>
<p><strong>Disk Restore From Snapshot Play</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Disk</span> <span class="hljs-string">Restore</span> <span class="hljs-string">From</span> <span class="hljs-string">Snapshot</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">disks</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">no</span>
  <span class="hljs-attr">vars_files:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">./vars/snapshot-restore-vars.yml</span>

  <span class="hljs-attr">tasks:</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Find</span> <span class="hljs-string">files</span> <span class="hljs-string">in</span> <span class="hljs-string">the</span> <span class="hljs-string">VM</span> <span class="hljs-string">folder</span> <span class="hljs-string">containing</span> <span class="hljs-string">the</span> <span class="hljs-string">target</span> <span class="hljs-string">VM</span> <span class="hljs-string">disk</span> <span class="hljs-string">name</span>
      <span class="hljs-attr">find:</span>
        <span class="hljs-attr">paths:</span> <span class="hljs-string">"/<span class="hljs-template-variable">{{ snapshot_repository_base_directory | regex_replace('\\\\', '')}}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>/"</span>
        <span class="hljs-attr">patterns:</span> <span class="hljs-string">"*<span class="hljs-template-variable">{{ disk_name[:-4] }}</span>*"</span>
        <span class="hljs-attr">recurse:</span> <span class="hljs-literal">yes</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">found_files</span> 
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">diskstation</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Ensure</span> <span class="hljs-string">that</span> <span class="hljs-string">files</span> <span class="hljs-string">were</span> <span class="hljs-string">found</span>
      <span class="hljs-attr">assert:</span>
        <span class="hljs-attr">that:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">found_files.matched</span> <span class="hljs-string">&gt;</span> <span class="hljs-number">0</span>
        <span class="hljs-attr">fail_msg:</span> <span class="hljs-string">"No files found matching disk <span class="hljs-template-variable">{{ disk_name[:-4] }}</span> for VM <span class="hljs-template-variable">{{ vm_name }}</span>."</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">a</span> <span class="hljs-string">file</span> <span class="hljs-string">list</span> <span class="hljs-string">from</span> <span class="hljs-string">the</span> <span class="hljs-string">target</span> <span class="hljs-string">VM</span> <span class="hljs-string">folder</span> <span class="hljs-string">with</span> <span class="hljs-string">only</span> <span class="hljs-string">file</span> <span class="hljs-string">names</span>
      <span class="hljs-attr">set_fact:</span> 
        <span class="hljs-attr">file_list:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ found_files.files | map(attribute='path') | map('regex_replace','^.*/(.*)$','\\1') | list }}</span>"</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Stitch</span> <span class="hljs-string">together</span> <span class="hljs-string">full</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">name.</span> <span class="hljs-string">Replace</span> <span class="hljs-string">dashes</span> <span class="hljs-string">and</span> <span class="hljs-string">remove</span> <span class="hljs-string">special</span> <span class="hljs-string">characters</span>
      <span class="hljs-attr">set_fact:</span> 
        <span class="hljs-attr">full_snapshot_name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ disk_name[:-4] }}</span>.<span class="hljs-template-variable">{{ snapshot_to_restore_from | regex_replace('\\-', '_') | regex_replace('\\W', '') }}</span>.img"</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Find</span> <span class="hljs-string">and</span> <span class="hljs-string">set</span> <span class="hljs-string">correct</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">if</span> <span class="hljs-string">file</span> <span class="hljs-string">found</span> <span class="hljs-string">in</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">folder</span>
      <span class="hljs-attr">set_fact:</span> 
        <span class="hljs-attr">found_snapshot:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ full_snapshot_name }}</span>"</span>
      <span class="hljs-attr">when:</span> <span class="hljs-string">full_snapshot_name</span> <span class="hljs-string">in</span> <span class="hljs-string">file_list</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Find</span> <span class="hljs-string">and</span> <span class="hljs-string">set</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">to</span> <span class="hljs-string">latest</span> <span class="hljs-string">if</span> <span class="hljs-string">undefined</span> <span class="hljs-string">or</span> <span class="hljs-string">error</span> <span class="hljs-string">handle</span> <span class="hljs-string">block</span>
      <span class="hljs-attr">block:</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Sort</span> <span class="hljs-string">found</span> <span class="hljs-string">files</span> <span class="hljs-string">by</span> <span class="hljs-string">modification</span> <span class="hljs-string">time</span> <span class="hljs-string">(newest</span> <span class="hljs-string">first)</span> <span class="hljs-bullet">-</span> <span class="hljs-string">LATEST</span> <span class="hljs-string">Block</span>
          <span class="hljs-attr">set_fact:</span>
            <span class="hljs-attr">sorted_files:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ found_files.files | sort(attribute='mtime', reverse=True) | map(attribute='path') | map('regex_replace','^.*/(.*)$','\\1') | list  }}</span>"</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Find</span> <span class="hljs-string">and</span> <span class="hljs-string">set</span> <span class="hljs-string">correct</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">for</span> <span class="hljs-string">newest</span> <span class="hljs-string">found</span> <span class="hljs-string">.img</span> <span class="hljs-string">file</span> <span class="hljs-bullet">-</span> <span class="hljs-string">LATEST</span> <span class="hljs-string">Block</span>
          <span class="hljs-attr">set_fact:</span> 
            <span class="hljs-attr">found_snapshot:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ sorted_files | first }}</span>"</span>

      <span class="hljs-attr">when:</span> <span class="hljs-string">found_snapshot</span> <span class="hljs-string">is</span> <span class="hljs-string">undefined</span> <span class="hljs-string">or</span> <span class="hljs-string">found_snapshot</span> <span class="hljs-string">==</span> <span class="hljs-string">None</span>  

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Ensure</span> <span class="hljs-string">that</span> <span class="hljs-string">the</span> <span class="hljs-string">desired</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">file</span> <span class="hljs-string">was</span> <span class="hljs-string">found</span>
      <span class="hljs-attr">assert:</span>
        <span class="hljs-attr">that:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">found_snapshot</span> <span class="hljs-string">is</span> <span class="hljs-string">defined</span> <span class="hljs-string">and</span> <span class="hljs-string">found_snapshot</span> <span class="hljs-type">!=</span> <span class="hljs-string">None</span>
        <span class="hljs-attr">fail_msg:</span> <span class="hljs-string">"The snapshot to restore was not found. May not exist or user date was entered incorrectly."</span>
        <span class="hljs-attr">success_msg:</span> <span class="hljs-string">"Snapshot found! Will begin restore process NOW."</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Transfer</span> <span class="hljs-string">snapshots</span> <span class="hljs-string">to</span> <span class="hljs-string">VM</span> <span class="hljs-string">hypervisor</span> <span class="hljs-string">server</span> <span class="hljs-string">via</span> <span class="hljs-string">rsync</span>
      <span class="hljs-attr">command:</span> <span class="hljs-string">rsync</span> {{ <span class="hljs-string">repository_user</span> }}<span class="hljs-string">@{{</span> <span class="hljs-string">hostvars['diskstation']['ansible_host']</span> <span class="hljs-string">}}:/{{</span> <span class="hljs-string">snapshot_repository_base_directory</span> <span class="hljs-string">}}/{{</span> <span class="hljs-string">vm_name</span> <span class="hljs-string">}}/{{</span> <span class="hljs-string">found_snapshot</span> <span class="hljs-string">}}</span> {{ <span class="hljs-string">found_snapshot</span> }}
      <span class="hljs-attr">args:</span>
        <span class="hljs-attr">chdir:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>"</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Get</span> <span class="hljs-string">attributes</span> <span class="hljs-string">of</span> <span class="hljs-string">original</span> <span class="hljs-string">stored</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">.img</span> <span class="hljs-string">file</span>
      <span class="hljs-attr">stat:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">"/<span class="hljs-template-variable">{{ snapshot_repository_base_directory | regex_replace('\\\\', '')}}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>/<span class="hljs-template-variable">{{ found_snapshot }}</span>"</span>
        <span class="hljs-attr">get_checksum:</span> <span class="hljs-literal">false</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">file1</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">diskstation</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Get</span> <span class="hljs-string">attributes</span> <span class="hljs-string">of</span> <span class="hljs-string">newly</span> <span class="hljs-string">transfered</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">.img</span> <span class="hljs-string">file</span>
      <span class="hljs-attr">stat:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>/<span class="hljs-template-variable">{{ found_snapshot }}</span>"</span>
        <span class="hljs-attr">get_checksum:</span> <span class="hljs-literal">false</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">file2</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Ensure</span> <span class="hljs-string">original</span> <span class="hljs-string">and</span> <span class="hljs-string">tranferred</span> <span class="hljs-string">file</span> <span class="hljs-string">sizes</span> <span class="hljs-string">are</span> <span class="hljs-string">the</span> <span class="hljs-string">same</span>
      <span class="hljs-attr">assert:</span>
        <span class="hljs-attr">that:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">file1.stat.size</span> <span class="hljs-string">==</span> <span class="hljs-string">file2.stat.size</span>
        <span class="hljs-attr">fail_msg:</span> <span class="hljs-string">"Files failed size comparison post transfer. Aborting operation for <span class="hljs-template-variable">{{ inventory_hostname }}</span>"</span>
        <span class="hljs-attr">success_msg:</span> <span class="hljs-string">File</span> <span class="hljs-string">size</span> <span class="hljs-string">comparison</span> <span class="hljs-string">passed.</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Shutdown</span> <span class="hljs-string">VM(s)</span>
      <span class="hljs-attr">include_tasks:</span> <span class="hljs-string">./tasks/shutdown-vm.yml</span>
      <span class="hljs-attr">loop:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ hostvars['unraid']['vms_map'] }}</span>"</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Delete</span> {{ <span class="hljs-string">disk_name</span> }} <span class="hljs-string">for</span> <span class="hljs-string">VM</span> {{ <span class="hljs-string">vm_name</span> }}
      <span class="hljs-attr">ansible.builtin.file:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>/<span class="hljs-template-variable">{{ disk_name }}</span>"</span>
        <span class="hljs-attr">state:</span> <span class="hljs-string">absent</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Transfer</span> <span class="hljs-string">snapshots</span> <span class="hljs-string">to</span> <span class="hljs-string">VM</span> <span class="hljs-string">hypervisor</span> <span class="hljs-string">server</span> <span class="hljs-string">via</span> <span class="hljs-string">rsync</span>
      <span class="hljs-attr">command:</span> <span class="hljs-string">mv</span> {{ <span class="hljs-string">found_snapshot</span> }} {{ <span class="hljs-string">disk_name</span> }}
      <span class="hljs-attr">args:</span>
        <span class="hljs-attr">chdir:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>"</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>
</code></pre>
<p><strong>Purpose</strong>:<br />This play facilitates the restoration of VM disk snapshots on an Unraid server. It searches for the required snapshot, validates the snapshot file, and transfers it back to the hypervisor for restoration, ensuring the integrity of the restored disk.</p>
<p><strong>Hosts</strong>:<br />Targets the <code>disks</code> host group.</p>
<p><strong>Variables File</strong>:<br />Loads additional variables from <code>./vars/snapshot_restore_vars.yml</code>. Mainly the user's modified <code>snapshot_restore_list</code>.</p>
<p><strong>Tasks</strong>:</p>
<ol>
<li><p><strong>Find Files in the VM Folder Containing the Target VM Disk Name:</strong><br /> Searches the snapshot repository for files that match the target VM disk name i.e. <code>vdisk1</code>, <code>vdisk2</code>, etc , and recursively lists them. Stores in <code>found_files</code> variable.</p>
</li>
<li><p><strong>Ensure That Files Were Found:</strong></p>
<p> Verifies that at least one file matching the disk name was found. If no files are found, will produce failure message and playbook will fail for host disk.</p>
</li>
<li><p><strong>Create a File List From the Target VM Folder with only File Names:</strong></p>
<p> Extracts and stores only the file names from the <code>found_files</code> list.</p>
</li>
<li><p><strong>Stitch Together Full Snapshot Name:</strong></p>
<p> Constructs the full snapshot name by combining the disk name, user inputted snapshot name (if available), and “<code>.img</code>”. Also replaces dashes with underscores and removes any special characters.</p>
</li>
<li><p><strong>Find and Set the Correct Snapshot if File Found in Snapshot Folder:</strong></p>
<p> If the constructed snapshot name is found in the list of files, it sets <code>found_snapshot</code> to this name.</p>
</li>
<li><p><strong>Find and Set Snapshot to Latest if Undefined or Error Handling (Block):</strong></p>
<p> If no specific snapshot is found or defined, this block sorts the found files by modification time (newest first) and sets the snapshot to the latest available one.</p>
</li>
<li><p><strong>Ensure the Desired Snapshot File Was Found:</strong></p>
<p> Confirms that a snapshot was found and is ready for restoration. It will fail with an error message if not found. Playbook will also fail for the host disk.</p>
</li>
<li><p><strong>Transfer Snapshots to VM Hypervisor Server via rsync:</strong></p>
<p> Uses <code>rsync</code> to transfer the found snapshot from the remote DiskStation to the Unraid server, where the VM is located. Changes into the correct disk directory prior to transfer.</p>
</li>
<li><p><strong>Get Attributes of the Snapshot Files and Compare Size:</strong></p>
<p> The next three tasks retrieves the attributes of both the DiskStation snapshot followed by the newly transferred snapshot on the Unraid server. It then compares the file sizes of the original and transferred snapshots to ensure the transfer was successful. Playbook fails for host disk if sizes are not equal.</p>
</li>
<li><p><strong>Shutdown VMs</strong>:</p>
<p>Shuts down the VMs in preparation for the restoration process by calling a separate task file (<code>/tasks/shutdown-vm.yml</code>). For more details on the shutdown tasks, refer to the previous post.</p>
</li>
<li><p><strong>Delete the Original Disk for the VM:</strong></p>
<p>Deletes the original disk file for the VM to so the snapshot file can be properly renamed to the correct disk name.</p>
</li>
<li><p><strong>Rename Snapshot to Proper Disk Name</strong></p>
<p>Renames the restored snapshot file to match the original disk file name, completing the restoration process.</p>
</li>
</ol>
<p><strong>Restart Affected VMs Play</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Restart</span> <span class="hljs-string">Affected</span> <span class="hljs-string">VMs</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">unraid</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">no</span>
  <span class="hljs-attr">vars_files:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">./vars/snapshot-restore-vars.yml</span>

  <span class="hljs-attr">tasks:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Start</span> <span class="hljs-string">VM(s)</span> <span class="hljs-string">back</span> <span class="hljs-string">up</span>
      <span class="hljs-attr">include_tasks:</span> <span class="hljs-string">./tasks/start-vm.yml</span>
      <span class="hljs-attr">loop:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_restore_list }}</span>"</span>
</code></pre>
<p><strong>Purpose</strong>:<br />This play’s only purpose is to start the targeted VM’s after the restore process has completed for all disks.</p>
<p><strong>Hosts</strong>:<br />Targets the <code>unraid</code> host.</p>
<p><strong>Variables File</strong>:<br />Loads additional variables from <code>./vars/snapshot_restore_vars.yml</code>. Mainly the user's modified <code>snapshot_restore_list</code>.</p>
<p><strong>Tasks</strong>:</p>
<ol>
<li><strong>Start VM(s) Back Up:</strong><br /> Starts up the VMs once the restoration process has completed by calling a separate task file (<code>/tasks/start-vm.yml</code>). For more details on the startup tasks, refer to the previous post.</li>
</ol>
<p><strong>NOTE:</strong> This was intentionally made a separate play at the end of the playbook to ensure all disk restore operations are completed beforehand. By looping over the VMs using the <code>snapshot_restore_list</code> variable, only one start command per VM is sent, reducing the chance of errors.</p>
<h3 id="heading-restore-from-local-tar-pbyml"><code>restore-from-local-tar-pb.yml</code></h3>
<p>NOTE: This playbook is quite similar to the <code>restore-from-snapshot-pb.yml</code> playbook, but focuses on local restoration using the <code>.tar</code> files. All tasks are executed either on the Ansible host or the Unraid server. In this breakdown, I'll only highlight the key task differences from the previous playbook</p>
<p><strong>Restore Snapshot Preparation Play</strong></p>
<p>Exactly the same as the <code>restore-from-snapshot-pb.yml</code> playbook. Nothing to do.</p>
<p><strong>Disk Restore From TAR file Play</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Find</span> <span class="hljs-string">files</span> <span class="hljs-string">in</span> <span class="hljs-string">the</span> <span class="hljs-string">VM</span> <span class="hljs-string">folder</span> <span class="hljs-string">containing</span> <span class="hljs-string">the</span> <span class="hljs-string">target</span> <span class="hljs-string">VM</span> <span class="hljs-string">disk</span> <span class="hljs-string">name</span>
      <span class="hljs-attr">find:</span>
        <span class="hljs-attr">paths:</span> <span class="hljs-string">"/<span class="hljs-template-variable">{{ source_directory | regex_replace('\\\\', '')}}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>/"</span>
        <span class="hljs-attr">patterns:</span> <span class="hljs-string">"*<span class="hljs-template-variable">{{ disk_name[:-4] }}</span>*"</span>
        <span class="hljs-attr">recurse:</span> <span class="hljs-literal">yes</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">found_files</span> 
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Filter</span> <span class="hljs-string">files</span> <span class="hljs-string">matching</span> <span class="hljs-string">patterns</span> <span class="hljs-string">for</span> <span class="hljs-string">.tar</span> <span class="hljs-string">files</span>
      <span class="hljs-attr">set_fact:</span>
        <span class="hljs-attr">matched_tar_files:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ found_files.files | selectattr('path', 'search', '.*\\.(tar)$') | list }}</span>"</span>
</code></pre>
<p><strong>Tasks</strong>:</p>
<ol>
<li><p><strong>Find Files in the VM Folder Containing the Target VM Disk Name:</strong></p>
<p> Similar to the other playbook. This task searches through the VM's directory to locate any files that match the target disk name, regardless of file type (e.g., .img, .tar).</p>
</li>
<li><p><strong>Filter Files Matching Patterns for .tar Files</strong></p>
<p> After locating files in the previous task, this task filters out only the <code>.tar</code> files from the list of found files. Uses <code>set_fact</code> to store list in variable <code>matched_tar_files</code>.</p>
</li>
</ol>
<p>Everything is the same until unzip task (below).</p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Unzip</span> <span class="hljs-string">.tar</span> <span class="hljs-string">file</span>
      <span class="hljs-attr">command:</span> <span class="hljs-string">tar</span> <span class="hljs-string">-xf</span> {{ <span class="hljs-string">found_snapshot</span> }}
      <span class="hljs-attr">args:</span>
        <span class="hljs-attr">chdir:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>"</span>
      <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>
</code></pre>
<p>Pretty straight forward here. Just unzips the correct snapshot <code>.tar</code> file back to a usable <code>.img</code> file.</p>
<p>The remaining tasks follow the same process as the <code>restore-from-snapshot-pb.yml</code> playbook. They gather the attributes of both the original and newly unzipped files, verify that their sizes match, shut down the required VMs, delete the original disk file, rename the snapshot to the appropriate disk name, and finally, restart the VMs.</p>
<h3 id="heading-restoring-in-action-running-the-playbook"><strong>Restoring in Action (Running the Playbook)</strong></h3>
<p>Like the create playbook in the previous post, these playbooks are also very simple to run. Run them in the root of the playbook directory:</p>
<pre><code class="lang-yaml"><span class="hljs-string">ansible-playbook</span> <span class="hljs-string">restore-from-snapshot-pb.yml</span> <span class="hljs-string">-i</span> <span class="hljs-string">defaults/inventory.yml</span>
</code></pre>
<pre><code class="lang-yaml"><span class="hljs-string">ansible-playbook</span> <span class="hljs-string">restore-from-local-tar-pb.yml</span> <span class="hljs-string">-i</span> <span class="hljs-string">defaults/inventory.yml</span>
</code></pre>
<p>Below are the results of successful playbook runs, tested using a single 2GB disk for both local and remote restores. One run uses a static snapshot name, while the other demonstrates the process of finding the 'latest' snapshot when the name is not defined.</p>
<p><strong>Restore from snapshot w/ finding the latest (omitting python version warnings):</strong></p>
<pre><code class="lang-typescript">PLAY [Restore Snapshot Preparation] ******************************************************************************************************************

TASK [Retrieve List <span class="hljs-keyword">of</span> All Existing VMs on UnRAID Hypervisor] ****************************************************************************************
changed: [unraid]

TASK [Generate VM and Disk Lists <span class="hljs-keyword">for</span> Validated VMs <span class="hljs-keyword">in</span> User Inputted Data] ****************************************************************************
ok: [unraid] =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_restore'</span>: [{<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>}]})

TASK [Build Data Structure <span class="hljs-keyword">for</span> Snapshot Restoration] *************************************************************************************************
ok: [unraid]

TASK [Verify Snapshot Data is Available <span class="hljs-keyword">for</span> Restoration] *********************************************************************************************
ok: [unraid] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"All assertions passed"</span>
}

TASK [Dynamically Create Host Group <span class="hljs-keyword">for</span> Disks to be Restored] ****************************************************************************************
changed: [unraid] =&gt; (item=[{<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_restore'</span>: [{<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>}]}, {<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>}])

PLAY [Disk Restore From Snapshot] ********************************************************************************************************************

TASK [Find files <span class="hljs-keyword">in</span> the VM folder containing the target VM disk name] ********************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -&gt; diskstation({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'DISKSTATION_IP_ADDRESS'</span>) }})]

TASK [Ensure that files were found] ******************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"All assertions passed"</span>
}

TASK [Create a file list <span class="hljs-keyword">from</span> the target VM folder <span class="hljs-keyword">with</span> only file names] *****************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Stitch together full snapshot name. Replace dashes and remove special characters] **************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot <span class="hljs-keyword">if</span> file found <span class="hljs-keyword">in</span> snapshot folder] ********************************************************************************
skipping: [Rocky9-TESTNode-vdisk2]

TASK [Sort found files by modification time (newest first) - LATEST Block] ***************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot <span class="hljs-keyword">for</span> newest found .img file - LATEST Block] ***********************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Ensure that the desired snapshot file was found] ***********************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"Snapshot found! Will begin restore process NOW."</span>
}

TASK [Transfer snapshots to VM hypervisor server via rsync] ******************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Get attributes <span class="hljs-keyword">of</span> original stored snapshot .img file] ******************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -&gt; diskstation({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'DISKSTATION_IP_ADDRESS'</span>) }})]

TASK [Get attributes <span class="hljs-keyword">of</span> newly transfered snapshot .img file] *****************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Ensure original and tranferred file sizes are the same] ****************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"File size comparison passed."</span>
}

TASK [Shutdown VM(s)] ********************************************************************************************************************************
included: <span class="hljs-regexp">/mnt/</span>c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml <span class="hljs-keyword">for</span> Rocky9-TESTNode-<span class="hljs-function"><span class="hljs-params">vdisk2</span> =&gt;</span> (item=Rocky9-TESTNode)

TASK [Shutdown VM - Rocky9-TESTNode] *****************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
FAILED - RETRYING: [Rocky9-TESTNode-vdisk2 -&gt; unraid]: Get VM status - Rocky9-TESTNode (<span class="hljs-number">5</span> retries left).
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Delete vdisk2.img <span class="hljs-keyword">for</span> VM Rocky9-TESTNode] ******************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Rename snapshot to proper disk name] ***********************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

PLAY [Restart Affected VMs] **************************************************************************************************************************

TASK [Start VM(s) back up] ***************************************************************************************************************************
included: <span class="hljs-regexp">/mnt/</span>c/Dev/Git/unraid-vm-snapshots/tasks/start-vm.yml <span class="hljs-keyword">for</span> unraid =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_restore'</span>: [{<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>}]})

TASK [Start VM - Rocky9-TESTNode] ********************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
changed: [unraid]

TASK [Ensure VM <span class="hljs-string">'running'</span> status] ********************************************************************************************************************
ok: [unraid] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"Rocky9-TESTNode has successfully started. Restore from snapshot complete."</span>
}

PLAY RECAP *******************************************************************************************************************************************
Rocky9-TESTNode-vdisk2     : ok=<span class="hljs-number">16</span>   changed=<span class="hljs-number">5</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">1</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
unraid                     : ok=<span class="hljs-number">9</span>    changed=<span class="hljs-number">4</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>
</code></pre>
<p><strong>Restore from local .tar using defined snapshot name (omitting python version warnings):</strong></p>
<pre><code class="lang-typescript">PLAY [Restore Snapshot Preparation] ******************************************************************************************************************

TASK [Retrieve List <span class="hljs-keyword">of</span> All Existing VMs on UnRAID Hypervisor] ****************************************************************************************
changed: [unraid]

TASK [Generate VM and Disk Lists <span class="hljs-keyword">for</span> Validated VMs <span class="hljs-keyword">in</span> User Inputted Data] ****************************************************************************
ok: [unraid] =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_restore'</span>: [{<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>, <span class="hljs-string">'snapshot_to_restore_from'</span>: <span class="hljs-string">'test-snapshot'</span>}]})

TASK [Build Data Structure <span class="hljs-keyword">for</span> Snapshot Restoration] *************************************************************************************************
ok: [unraid]

TASK [Verify Snapshot Data is Available <span class="hljs-keyword">for</span> Restoration] *********************************************************************************************
ok: [unraid] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"All assertions passed"</span>
}

TASK [Dynamically Create Host Group <span class="hljs-keyword">for</span> Disks to be Restored] ****************************************************************************************
changed: [unraid] =&gt; (item=[{<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_restore'</span>: [{<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>, <span class="hljs-string">'snapshot_to_restore_from'</span>: <span class="hljs-string">'test-snapshot'</span>}]}, {<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>, <span class="hljs-string">'snapshot_to_restore_from'</span>: <span class="hljs-string">'test-snapshot'</span>}])

PLAY [Disk Restore From TAR file] ********************************************************************************************************************

TASK [Find files <span class="hljs-keyword">in</span> the VM folder containing the target VM disk name] ********************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Filter files matching patterns <span class="hljs-keyword">for</span> .tar files] *************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Ensure that files were found] ******************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"All assertions passed"</span>
}

TASK [Create a file list <span class="hljs-keyword">from</span> the target VM folder <span class="hljs-keyword">with</span> only file names] *****************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Stitch together full snapshot name. Replace dashes and remove special characters] **************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot <span class="hljs-keyword">if</span> file found <span class="hljs-keyword">in</span> snapshot folder] ********************************************************************************
skipping: [Rocky9-TESTNode-vdisk2]

TASK [Sort found files by modification time (newest first) - LATEST Block] ***************************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Find and set correct snapshot <span class="hljs-keyword">for</span> newest found .img file - LATEST Block] ***********************************************************************
ok: [Rocky9-TESTNode-vdisk2]

TASK [Ensure that the desired snapshot file was found] ***********************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"Snapshot found! Will begin restore process NOW."</span>
}

TASK [Unzip .tar file] *******************************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Get attributes <span class="hljs-keyword">of</span> unzipped .img file] **********************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Get attributes <span class="hljs-keyword">of</span> original disk .img file] *****************************************************************************************************
ok: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Ensure original and unzipped .img file sizes are the same] *************************************************************************************
ok: [Rocky9-TESTNode-vdisk2] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"File size comparison passed."</span>
}

TASK [Shutdown VM(s)] ********************************************************************************************************************************
included: <span class="hljs-regexp">/mnt/</span>c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml <span class="hljs-keyword">for</span> Rocky9-TESTNode-<span class="hljs-function"><span class="hljs-params">vdisk2</span> =&gt;</span> (item=Rocky9-TESTNode)

TASK [Shutdown VM - Rocky9-TESTNode] *****************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
FAILED - RETRYING: [Rocky9-TESTNode-vdisk2 -&gt; unraid]: Get VM status - Rocky9-TESTNode (<span class="hljs-number">5</span> retries left).
FAILED - RETRYING: [Rocky9-TESTNode-vdisk2 -&gt; unraid]: Get VM status - Rocky9-TESTNode (<span class="hljs-number">4</span> retries left).
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Delete vdisk2.img <span class="hljs-keyword">for</span> VM Rocky9-TESTNode] ******************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Rename unzipped snapshot to proper disk name] **************************************************************************************************
changed: [Rocky9-TESTNode-vdisk2 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

PLAY [Restart Affected VMs] **************************************************************************************************************************

TASK [Start VM(s) back up] ***************************************************************************************************************************
included: <span class="hljs-regexp">/mnt/</span>c/Dev/Git/unraid-vm-snapshots/tasks/start-vm.yml <span class="hljs-keyword">for</span> unraid =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_restore'</span>: [{<span class="hljs-string">'vm_disk_to_restore'</span>: <span class="hljs-string">'vdisk2.img'</span>, <span class="hljs-string">'vm_disk_directory'</span>: <span class="hljs-string">'/mnt/disk1/domains'</span>, <span class="hljs-string">'snapshot_to_restore_from'</span>: <span class="hljs-string">'test-snapshot'</span>}]})

TASK [Start VM - Rocky9-TESTNode] ********************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
changed: [unraid]

TASK [Ensure VM <span class="hljs-string">'running'</span> status] ********************************************************************************************************************
ok: [unraid] =&gt; {
    <span class="hljs-string">"changed"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-string">"msg"</span>: <span class="hljs-string">"Rocky9-TESTNode has successfully started. Restore from snapshot complete."</span>
}

PLAY RECAP *******************************************************************************************************************************************
Rocky9-TESTNode-vdisk2     : ok=<span class="hljs-number">17</span>   changed=<span class="hljs-number">5</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">1</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
unraid                     : ok=<span class="hljs-number">9</span>    changed=<span class="hljs-number">4</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>
</code></pre>
<h2 id="heading-closing-thoughts"><strong>Closing Thoughts</strong></h2>
<p><img src="https://media1.tenor.com/m/QpbU3Jf0HL0AAAAd/happy-gilmore-my-fingers-hurt.gif" alt="an elderly woman is sitting in a hospital bed with her fingers hurt and says `` my fingers hurt '' ." class="image--center mx-auto" /></p>
<p>Aside from the fingers hurting situation this was another enjoyable mini-project. With both snapshot creation and restoration now fully functional, it’s going to be incredibly useful. It will save a ton of time on larger projects I have planned, eliminating the need to manually roll back configurations.</p>
<h2 id="heading-whats-next"><strong>What’s next?</strong></h2>
<p>I have one more piece planned for this series - Cleaning up old snapshots on your storage, whether the local <code>.tar</code> files or a remote repo <code>.img</code> files (DiskStation).</p>
<p>Some thoughts and drafts I have for future posts include Kubernetes, Containerlab, Network Automation testing, Nautobot, a few more. We’ll see!</p>
<p>You can find the code that goes along with this post <a target="_blank" href="https://github.com/leothelyon17/unraid-vm-snapshots">here</a> (Github).</p>
<p>Thoughts, questions, and comments are appreciated. Please follow me here at Hashnode or connect with me on <a target="_blank" href="https://www.linkedin.com/in/jeffrey-m-lyon/">Linkedin</a>.</p>
<p>Thank you for reading fellow techies!</p>
]]></content:encoded></item><item><title><![CDATA[Unraid VM Snapshot Automation with Ansible: Part 1 - Creating Snapshots]]></title><description><![CDATA[Intro
Hello! Welcome to my very first blog post EVER!
In this series, I’ll dive into how you can leverage Ansible to automate snapshot creation and restoration in Unraid, helping to streamline your backup and recovery processes. Whether you’re new to...]]></description><link>https://blog.nerdylyonsden.io/unraid-vm-snapshot-automation-with-ansible-part-1</link><guid isPermaLink="true">https://blog.nerdylyonsden.io/unraid-vm-snapshot-automation-with-ansible-part-1</guid><category><![CDATA[unraid]]></category><category><![CDATA[ansible]]></category><category><![CDATA[automation]]></category><category><![CDATA[snapshot]]></category><dc:creator><![CDATA[Jeffrey Lyon]]></dc:creator><pubDate>Mon, 09 Sep 2024 22:57:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724988764761/f44e4cc6-c222-48d1-b382-591ecdd6fa86.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-intro"><strong>Intro</strong></h2>
<p>Hello! Welcome to my very first blog post EVER!</p>
<p>In this series, I’ll dive into how you can leverage Ansible to automate snapshot creation and restoration in Unraid, helping to streamline your backup and recovery processes. Whether you’re new to Unraid or looking for ways to optimize your existing Unraid setup, this post will provide some insight starting with what I did to create snapshots when no official solution is provided (that I know of...). Recovery using our created snapshots will come in the next post.</p>
<p>This is a warm-up series/post to help me start my blogging journey. It's mainly aimed at the home lab community, but this post or parts of it can definitely be useful for other scenarios and various Linux based systems as well.</p>
<h2 id="heading-scenario">Scenario</h2>
<p>Unraid is a great platform for managing storage, virtualization, and Docker containers, but it doesn't have built-in support for taking snapshots of virtual machines (VMs). Snapshots are important because they let you save the state of a VM disk at a specific time, so you can easily restore the disk if something goes wrong, like errors, updates, or failures. Without this feature, users who depend on VMs for important tasks or development need to find other ways or use third-party tools to handle snapshots. This makes automating backup and recovery harder, especially in setups where snapshots are key for keeping the system stable and protecting data.</p>
<p>I will be using an Ubuntu Ansible host, my Unraid server as the snapshot source, and my Synology DiskStation as the remote storage destination to backup the snapshots. Unraid will be transferring these snapshots using the rsync synchronization protocol. Local snapshot creation as TAR files will also be covered, which allows for faster restore.</p>
<ul>
<li><p>Ansible host (Ubuntu 24.04)</p>
</li>
<li><p>Unraid server (v6.12) - Runs custom Linux OS based on Slackware Linux</p>
</li>
<li><p>Synology DiskStation (v7.1) - Runs custom Linux OS - Synology DiskStation Manager (DSM)</p>
</li>
</ul>
<p>These systems will be communicating over the same 192.168.x.x MGMT network.</p>
<p><strong>NOTE:</strong> Throughout this post (and in future related posts), I’ll refer to the DiskStation as the "destination" or "NAS" device. I’m keeping these terms more generic to accommodate those who might be following along with different system setups, ensuring the concepts apply broadly across various environments. I also wont be going into much detail on specific ansible modules, structured data, or Jinja2 templating syntax. There are plenty of great resources/documentation out there to cover that.</p>
<h2 id="heading-requirements"><strong>Requirements</strong></h2>
<p>I will include required packages, configuration, and setup for the systems involved in this automation.</p>
<h3 id="heading-ansible-host">Ansible host</h3>
<p>You will need the following:</p>
<ul>
<li><p>Python (3.10 or greater suggested)</p>
</li>
<li><p>Ansible core</p>
<pre><code class="lang-bash">  sudo apt install -y ansible-core python3
</code></pre>
</li>
<li><p>Modify your ansible.cfg file to ignore host_key_checking. Usually located in /etc/ansible/</p>
<pre><code class="lang-ini">  <span class="hljs-section">[defaults]</span>
  <span class="hljs-attr">host_key_checking</span> = <span class="hljs-literal">False</span>
</code></pre>
</li>
</ul>
<p><strong>NOTE</strong>: If your unsure where to find your ansible.cfg, just run <code>ansible --version</code> as shown below:</p>
<pre><code class="lang-bash">ansible --version

ansible [core 2.16.3]
  config file = /etc/ansible/ansible.cfg
</code></pre>
<h3 id="heading-unraid-server">Unraid server</h3>
<p>This setup is not terrible but not as flexible:</p>
<p><strong>NOTE:</strong> All commands in Unraid I'm running as user 'root'. Not the most secure, yes, but easiest for now.</p>
<ul>
<li><p>Python (only supports version 3.8) - Needs to be installed from the 'Nerd Tools' plugin and enabled in GUI.</p>
</li>
<li><p>Nerd Tools plugin</p>
<p>  <strong>To Install</strong> - In GUI click on APPs -&gt; Search for 'nerd tools' -&gt; Click <s>'Actions'</s> 'Install'.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725922028194/f88d367a-26a8-4be0-9b23-1ed7dc88ed7f.png" alt class="image--center mx-auto" /></p>
<p>  Once installed click on 'Settings' -&gt; Scroll down until you see 'Nerd Tools' and click on it</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725922100995/928751a3-4279-4618-b4f3-b3daf023f909.png" alt class="image--center mx-auto" /></p>
<p>  Once it loads find the Python3 option and flip it to 'On' -&gt; Scroll down to the bottom of that page and click 'Apply'.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725922211068/698a2c66-c37b-4e0c-ad20-924cfd9502e9.png" alt class="image--center mx-auto" /></p>
<p>  <strong>NOTE</strong>: Will install 'pip' and 'python-setuptools' automatically as well.</p>
</li>
<li><p>rsync (enabled by default)</p>
</li>
</ul>
<h3 id="heading-synology-diskstation">Synology DiskStation</h3>
<p>A few things are needed here. You can't really install packages in the CLI, everything is pulled down from the Package Center:</p>
<ul>
<li><p>Python (minimum version 3.8 - higher versions can be downloaded from the Package Center)</p>
<p>  NOTE: This isn't necessary for the automation covered in this post. Will be necessary for future posts when Ansible actually has to connect directly.</p>
</li>
<li><p>Enable SSH</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725921849793/527781cb-96d4-4dd8-9ced-3cb8115db661.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Enable rsync service</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725921926963/bb1f7ebd-5ef8-404c-b3f8-85a0d5dc22ef.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h2 id="heading-lets-automate"><strong>Let's Automate!</strong></h2>
<p><strong>...but first some more boring setup</strong></p>
<p>Most of the automation will be executed directly on the Unraid host. This means we need to configure proper Ansible credentials for both the Ansible host and Unraid to authenticate when connecting remotely to the DiskStation. Using rsync—particularly with Ansible's module—can be quite troublesome when setting up remote-to-remote authentication. To simplify this process, I'll be using SSH key-based authentication, enabling passwordless login and making remote connectivity much smoother.</p>
<p>As a prerequisite to this I have already setup a user 'unraid' on the DiskStation system. It is configured to be allowed to SSH into the DiskStation and to have read/write access to the Backup folder I have already created.</p>
<p><strong>To configure SSH key-based authentication (on Unraid server)</strong></p>
<ol>
<li><p>Generate SSH Key</p>
<p> <code>unraid# ssh-keygen</code></p>
<p> Follow the prompts. Name the key-pair something descriptive if you wish. Dont bother creating a password for it. Since I was doing this as the 'root' user, it dropped the new public and private key files in '/root/.ssh/'.</p>
</li>
<li><p>Copy SSH Key to DiskStation system.</p>
<p> <code>unraid# ssh-copy-id unraid@&lt;diskstation_ip&gt;</code></p>
<p> You will be prompted and need to provide the 'unraid' users SSH password. If successful you should see something similar to the below output.</p>
<p> <code>Number of key(s) added: 1</code></p>
<p> <code>Now try logging into the machine, with: "ssh 'unraid@&lt;diskstation_ip&gt;'" and check to make sure that only the key(s) you wanted were added.</code></p>
</li>
<li><p>Check the destination directory for the key files. Should be in the home directory of the user i.e. 'unraid' in their .ssh folder.</p>
</li>
<li><p>Test using the above mentioned command. In my case -</p>
<p> <code>ssh unraid@&lt;diskstation_ip&gt;</code></p>
</li>
</ol>
<p><strong>NOTE:</strong> A few gotchas I'd like to share -</p>
<ul>
<li><p><em>Destination NAS device (DiskStation) still asking for password</em></p>
<p>  Solved by modifying the .ssh folder rights on both the Unraid and destination NAS (DiskStation) devices as follows -</p>
<p>  <code>chmod g-w /&lt;absolute path&gt;/.ssh/</code></p>
<p>  <code>chmod o-wx /&lt;absolute path/.ssh/</code></p>
</li>
<li><p><em>Errors for modifying the 'ssh known_hosts file</em></p>
<p>  <code>hostfile_replace_entries: link /root/.ssh/known_hosts to /root/.ssh/known_hosts.old: Operation not permitted</code></p>
<p>  <code>update_known_hosts: hostfile_replace_entries failed for /root/.ssh/known_hosts: Operation not permitted</code></p>
<p>  Solved by running an ssh-keyscan from Unraid to destination NAS -</p>
<p>  <code>unraid# ssh-keyscan -H &lt;diskstation_ip&gt; ~/.ssh/known_hosts</code></p>
</li>
</ul>
<h3 id="heading-overview-and-breakdown"><strong>Overview and Breakdown</strong></h3>
<p>Let's start by discussing the playbook directory structure. It looks like this:</p>
<pre><code class="lang-bash">
├── README.md
├── create-snapshot-pb.yml
├── defaults
│   └── inventory.yml
├── files
│   ├── backup-playbook-old.yml
│   └── snapshot-creation-unused.yml
├── handlers
├── meta
├── restore-from-local-tar-pb.yml
├── restore-from-snapshot-pb.yml
├── tasks
│   ├── shutdown-vm.yml
│   └── start-vm.yml
├── templates
├── tests
│   ├── debug-tests-pb.yml
│   └── simple-debugs.yml
└── vars
    ├── snapshot-creation-vars.yml
    └── snapshot-restore-vars.yml
</code></pre>
<p>I copied the standard Ansible Role directory structure in case I wanted to publish it as a role in the future. Let's go over the breakout:</p>
<ul>
<li><p><code>defaults/inventory.yml</code> The main static inventory. Consists of unraid and diskstation hosts with their ansible connection variables and ssh credentials</p>
</li>
<li><p><code>vars/snapshot-creation-vars.yml</code> This file is where users define the list of VMs and their associated disks for snapshot creation. It's mainly a dictionary specifying the targeted VMs and their disks to be snapshotted. Additionally, it includes a few variables related to the connection with the destination NAS device.</p>
</li>
<li><p><code>tasks/shutdown-vm.yml</code> Consists of tasks used to shutdown targeted VMs gracefully and poll until shutdown status is confirmed.</p>
</li>
<li><p><code>tasks/start-vm.yml</code> Consists of tasks used to start up targeted VMs, poll their status, and assert they are running before moving on.</p>
</li>
<li><p><code>create-snapshot-pb.yml</code> The main playbook we are covering in this post. Consists of 2 plays. The first play has two purposes - to perform checks on the targeted Unraid VMs/disks and also to build additional data structures/dynamic hosts. The other play then will create the snapshots and push them to the destination.</p>
</li>
<li><p>Tests and Files folder - (Files) Consists of unused files/tasks I used to create and test the main playbook. (Tests) Contains some simple debug tasks I could copy and paste in quickly to get output from playbook execution.</p>
</li>
<li><p><code>restore-from-local-tar-pb.yml, restore-from-snapshot-pb.yml, and snapshot-restore-vars.yml</code> These are files related to restoring the disks once the snapshots are created. They will be covered in the next article of this series.</p>
</li>
</ul>
<h3 id="heading-inventory-defaultsinventoryyml">Inventory - <code>defaults/inventory.yml</code></h3>
<p>The inventory file is pretty straight forward, as shown here:</p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-attr">nodes:</span>
  <span class="hljs-attr">hosts:</span>
    <span class="hljs-attr">diskstation:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'DISKSTATION_IP_ADDRESS') }}</span>"</span>
      <span class="hljs-attr">ansible_user:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'DISKSTATION_USER') }}</span>"</span>
      <span class="hljs-attr">ansible_password:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'DISKSTATION_PASS') }}</span>"</span>
    <span class="hljs-attr">unraid:</span>
      <span class="hljs-attr">ansible_host:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'UNRAID_IP_ADDRESS') }}</span>"</span>
      <span class="hljs-attr">ansible_user:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'UNRAID_USER') }}</span>"</span>
      <span class="hljs-attr">ansible_password:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ lookup('env', 'UNRAID_PASS') }}</span>"</span>
</code></pre>
<p>This file defines two hosts—unraid and diskstation—along with the essential connection variables Ansible requires to establish SSH access to these devices. For more details on the various types of connection variables, refer to the link provided below:<br /><a target="_blank" href="https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html#connecting-to-hosts-behavioral-inventory-parameters">Ansible Connection Variables</a></p>
<p>To keep things simple (and enhance security), I’m using environment variables to store the Ansible connection values. These variables need to be set up on the Ansible host before running the playbook. If you’re new to automation or Linux, you can create environment variables using the examples provided below:<br /><code>ansible_host# export UNRAID_USER=root</code><br /><code>ansible_host# export DISKSTATION_IP=192.168.1.100</code></p>
<h3 id="heading-variables-varssnapshot-creation-varsyml">Variables - <code>vars/snapshot-creation-vars.yml</code></h3>
<p>This playbook uses a single variable file, which serves as the main file the user will interact with. In this file, you'll define your list of VMs, specify the disks associated with each VM that need snapshots, and provide the path to the directory where each VM's existing disk <code>.img</code> files are stored.</p>
<pre><code class="lang-yaml"><span class="hljs-meta">---</span>
<span class="hljs-attr">snapshot_repository_base_directory:</span> <span class="hljs-string">volume1/Home\</span> <span class="hljs-string">Media/Backup</span>
<span class="hljs-attr">repository_user:</span> <span class="hljs-string">unraid</span>

<span class="hljs-attr">snapshot_create_list:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">vm_name:</span> <span class="hljs-string">Rocky9-TESTNode</span>
    <span class="hljs-attr">disks_to_snapshot:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">disk_name:</span> <span class="hljs-string">vdisk1.img</span>
        <span class="hljs-attr">source_directory:</span> <span class="hljs-string">/mnt/cache/domains</span>
        <span class="hljs-attr">desired_snapshot_name:</span> <span class="hljs-string">test-snapshot</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">disk_name:</span> <span class="hljs-string">vdisk2.img</span>
        <span class="hljs-attr">source_directory:</span> <span class="hljs-string">/mnt/disk1/domains</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">vm_name:</span> <span class="hljs-string">Rocky9-LabNode3</span>
    <span class="hljs-attr">disks_to_snapshot:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">disk_name:</span> <span class="hljs-string">vdisk1.img</span>
        <span class="hljs-attr">source_directory:</span> <span class="hljs-string">/mnt/nvme_cache/domains</span>
        <span class="hljs-attr">desired_snapshot_name:</span> <span class="hljs-string">kuberne&amp;&lt;tes-baseline</span>
</code></pre>
<p>Let's break this down:</p>
<ul>
<li><p><code>snapshot_create_list</code> - the main data structure for defining your list of VMs and disks. Within this there are two main variables 'vm_name' and 'disks_to_snapshot'</p>
</li>
<li><p><code>vm_name</code> - used to define your the name of your VM. Must coincide with the name of the VM used within the Unraid system itself.</p>
</li>
<li><p><code>disks_to_snapshot</code> - a per VM list consisting of the disks that will be snapshot. This list requires two variables—<code>disk_name</code> and <code>source_directory</code>, with <code>desired_snapshot_name</code> as an ‘optional’ third variable.</p>
</li>
<li><p><code>disk_name</code> - consists the existing <code>.img</code> file name for that VM disk, i.e <code>vdisk1.img</code></p>
</li>
<li><p><code>source_directory</code> - consists of the absolute directory root path where the per VM files are stored. An example of a full path to an <code>.img</code> file within Unraid would be: <code>/mnt/cache/domains/Rocky9-TESTNode/vdisk1.img</code></p>
</li>
<li><p><code>desired_snapshot_name</code> - is an optional attribute the user can define to customize the name of the snapshot. If left undefined, a timestamp of the current date/time will be used as the snapshot name, i.e <code>vdisk2.2024-09-12T03.09.17Z.img</code></p>
</li>
<li><p><code>snapshot_repository_base_directory</code> and <code>repository_user</code> are used within the playbook's rsync task. These variables offer flexibility, allowing the user to specify their own remote user and target destination for the rsync operation. These are used only if the snapshots are being sent to remote location upon creation.</p>
</li>
</ul>
<p>Following the provided example you can define your VMs, disk names, and locations when running the playbook.</p>
<h3 id="heading-the-playbook"><strong>The Playbook</strong></h3>
<p>The playbook file is called <code>create-snapshot-pb.yml</code>. The playbook consists of two plays and 2 additional task files.</p>
<p><strong>Snapshot Creation Prep Play</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Unraid</span> <span class="hljs-string">Snapshot</span> <span class="hljs-string">Creation</span> <span class="hljs-string">Preperation</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">unraid</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">yes</span>
  <span class="hljs-attr">vars:</span>
    <span class="hljs-attr">needs_shutdown:</span> []
    <span class="hljs-attr">confirmed_shutdown:</span> []
    <span class="hljs-attr">vms_map:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_create_list | map(attribute='vm_name') }}</span>"</span>
    <span class="hljs-attr">disks_map:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_create_list | map(attribute='disks_to_snapshot') }}</span>"</span>
    <span class="hljs-attr">snapshot_data_map:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ dict(vms_map | zip(disks_map)) | dict2items(key_name='vm_name', value_name='disks_to_snapshot') | subelements('disks_to_snapshot') }}</span>"</span>
  <span class="hljs-attr">vars_files:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">./vars/snapshot-creation-vars.yml</span>

  <span class="hljs-attr">tasks:</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Get</span> <span class="hljs-string">initial</span> <span class="hljs-string">VM</span> <span class="hljs-string">status</span>
      <span class="hljs-attr">shell:</span> <span class="hljs-string">virsh</span> <span class="hljs-string">list</span> <span class="hljs-string">--all</span> <span class="hljs-string">|</span> <span class="hljs-string">grep</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item.vm_name }}</span>"</span> <span class="hljs-string">|</span> <span class="hljs-string">awk</span> <span class="hljs-string">'{ print $3}'</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">cmd_res</span>
      <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
      <span class="hljs-attr">with_items:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_create_list }}</span>"</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">list</span> <span class="hljs-string">of</span> <span class="hljs-string">VMs</span> <span class="hljs-string">that</span> <span class="hljs-string">need</span> <span class="hljs-string">shutdown</span>
      <span class="hljs-attr">set_fact:</span>
        <span class="hljs-attr">needs_shutdown:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ needs_shutdown + [item.item.vm_name] }}</span>"</span>
      <span class="hljs-attr">when:</span> <span class="hljs-string">item.stdout</span> <span class="hljs-type">!=</span> <span class="hljs-string">'shut'</span>
      <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
      <span class="hljs-attr">with_items:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ cmd_res.results }}</span>"</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Shutdown</span> <span class="hljs-string">VM(s)</span>
      <span class="hljs-attr">include_tasks:</span> <span class="hljs-string">./tasks/shutdown-vm.yml</span>
      <span class="hljs-attr">loop:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ needs_shutdown }}</span>"</span>
      <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
      <span class="hljs-attr">when:</span> <span class="hljs-string">needs_shutdown</span>
</code></pre>
<p><strong>Purpose</strong>:<br />Prepares the Unraid server for VM snapshot creation by checking the status of VMs, identifying which need to be shut down, and initiating shutdowns where necessary.</p>
<p><strong>Hosts</strong>:<br />Targets the <code>unraid</code> host.</p>
<p><strong>Variables</strong>:</p>
<ul>
<li><p><code>needs_shutdown</code>: Placeholder list of VMs that require shutdown before snapshot creation.</p>
</li>
<li><p><code>confirmed_shutdown</code>: Placeholder list for VMs confirmed to be shut down.</p>
</li>
<li><p><code>vms_map</code> and <code>disks_map</code>: Maps (creates new lists) for just VM names and their individual disk data respectfully. These lists are then used to create the larger <code>snapshot_data_map</code>.</p>
</li>
<li><p><code>snapshot_data_map</code>: Merges the VM and disk maps into a more structured data format, making it easier to access and manage the VM/disk information programmatically. My goal was to keep the inventory files simple for users to understand and modify. However, this approach didn’t work well with the looping logic I needed, so I created this new data map for better flexibility and control.</p>
</li>
</ul>
<p><strong>Variables File</strong>:<br />Loads additional variables from <code>./vars/snapshot_creation_vars.yml</code>. Mainly the user's modified <code>snapshot_create_list</code>.</p>
<p><strong>Tasks</strong>:</p>
<ol>
<li><p><strong>Get Initial VM Status</strong>:<br /> Runs a shell command using <code>virsh list --all</code> to check the current status of each VM (running or shut down). Results are stored in <code>cmd_res</code>.</p>
</li>
<li><p><strong>Identify VMs Needing Shutdown</strong>:<br /> Uses a conditional check to add VMs that are not already shut down to the <code>needs_shutdown</code> list.</p>
</li>
<li><p><strong>Shutdown VMs</strong>:<br /> Includes an external task file (<code>shutdown-vm.yml</code>) to gracefully shut down the VMs listed in <code>needs_shutdown</code>. This task loops through the VMs in that list and executes the shutdown process. Using an external task file enables looping over a block of tasks while preserving error handling. If any task within the block fails, the entire block fails, ensuring that the VM is not added to the <code>confirmed_shutdown</code> list later in the play. This method provides better control and validation during the shutdown process.</p>
</li>
</ol>
<p><strong>NOTE</strong>: Tasks above all have the tag ‘always’ which is a special tag that ensures a task will always run, regardless of which tags are specified when you run a playbook.</p>
<p><strong>Shutdown VMs task block (within Snapshot Creation Preparation play)</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Shutdown</span> <span class="hljs-string">VMs</span> <span class="hljs-string">Block</span>
  <span class="hljs-attr">block:</span>

  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Shutdown</span> <span class="hljs-string">VM</span> <span class="hljs-bullet">-</span> {{ <span class="hljs-string">item</span> }}
    <span class="hljs-attr">command:</span> <span class="hljs-string">virsh</span> <span class="hljs-string">shutdown</span> {{ <span class="hljs-string">item</span> }}
    <span class="hljs-attr">ignore_errors:</span> <span class="hljs-literal">true</span>

  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Get</span> <span class="hljs-string">VM</span> <span class="hljs-string">status</span> <span class="hljs-bullet">-</span> {{ <span class="hljs-string">item</span> }}
    <span class="hljs-attr">shell:</span> <span class="hljs-string">virsh</span> <span class="hljs-string">list</span> <span class="hljs-string">--all</span> <span class="hljs-string">|</span> <span class="hljs-string">grep</span> {{ <span class="hljs-string">item</span> }} <span class="hljs-string">|</span> <span class="hljs-string">awk</span> <span class="hljs-string">'{ print $3}'</span>
    <span class="hljs-attr">register:</span> <span class="hljs-string">cmd_res</span>
    <span class="hljs-attr">retries:</span> <span class="hljs-number">5</span>
    <span class="hljs-attr">delay:</span> <span class="hljs-number">10</span>
    <span class="hljs-attr">until:</span> <span class="hljs-string">cmd_res.stdout</span> <span class="hljs-type">!=</span> <span class="hljs-string">'running'</span>

  <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>
  <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
</code></pre>
<p>Here's a breakdown of the task block to shut down the targeted VMs:</p>
<p><strong>Purpose</strong>:<br />This block is designed to gracefully shut down virtual machines (VMs) and verify their shutdown status. This block is also tagged as ‘always’, ensuring ALL tasks in the block run.</p>
<p><strong>Tasks</strong>:</p>
<ol>
<li><p><strong>Shutdown VM</strong>:<br /> Uses the <code>virsh shutdown</code> command to initiate the shutdown of the specified VM.</p>
</li>
<li><p><strong>Check VM Status</strong>:<br /> Runs a shell command to retrieve the VM's current status using <code>virsh list</code>. The status is checked by parsing the output to confirm whether the VM is no longer running. The task will retry up to 5 times, with a 10-second delay between attempts, until the VM is confirmed to have shut down (<code>cmd_res.stdout != 'running'</code>).</p>
</li>
</ol>
<p><strong>Snapshot Creation Preparation Play (continued)</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Get</span> <span class="hljs-string">VM</span> <span class="hljs-string">status</span>
      <span class="hljs-attr">shell:</span> <span class="hljs-string">virsh</span> <span class="hljs-string">list</span> <span class="hljs-string">--all</span> <span class="hljs-string">|</span> <span class="hljs-string">grep</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item.vm_name }}</span>"</span> <span class="hljs-string">|</span> <span class="hljs-string">awk</span> <span class="hljs-string">'{ print $3}'</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">cmd_res</span>
      <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
      <span class="hljs-attr">with_items:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_create_list }}</span>"</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">list</span> <span class="hljs-string">to</span> <span class="hljs-string">use</span> <span class="hljs-string">for</span> <span class="hljs-string">confirmation</span> <span class="hljs-string">of</span> <span class="hljs-string">VMs</span> <span class="hljs-string">being</span> <span class="hljs-string">shutdown</span>
  <span class="hljs-attr">set_fact:</span>
    <span class="hljs-attr">confirmed_shutdown:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ confirmed_shutdown + [item.item.vm_name] }}</span>"</span>
  <span class="hljs-attr">when:</span> <span class="hljs-string">item.stdout</span> <span class="hljs-string">==</span> <span class="hljs-string">'shut'</span>
  <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
  <span class="hljs-attr">with_items:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ cmd_res.results }}</span>"</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Add</span> <span class="hljs-string">host</span> <span class="hljs-string">to</span> <span class="hljs-string">group</span> <span class="hljs-string">'disks'</span> <span class="hljs-string">with</span> <span class="hljs-string">variables</span>
  <span class="hljs-attr">ansible.builtin.add_host:</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[0]['vm_name'] }}</span>-<span class="hljs-template-variable">{{ item[1]['disk_name'][:-4] }}</span>"</span>
    <span class="hljs-attr">groups:</span> <span class="hljs-string">disks</span>
    <span class="hljs-attr">vm_name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[0]['vm_name'] }}</span>"</span>
    <span class="hljs-attr">disk_name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[1]['disk_name'] }}</span>"</span>
    <span class="hljs-attr">source_directory:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[1]['source_directory'] }}</span>"</span>
    <span class="hljs-attr">desired_snapshot_name:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ item[1]['desired_snapshot_name'] | default('') }}</span>"</span>
  <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
  <span class="hljs-attr">loop:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ snapshot_data_map }}</span>"</span>
</code></pre>
<p><strong>Purpose</strong>:<br />This 2nd group of tasks (still within the Snapshot Prep play) checks the status of VMs, confirms which have been shut down, and adds their disks to a dynamic inventory group for snapshot creation.</p>
<p><strong>Tasks</strong>:</p>
<ol>
<li><p><strong>Get VM Status</strong>:<br /> Runs a shell command using <code>virsh list --all</code> to retrieve the current status (e.g., running, shut) of each VM in the <code>snapshot_create_list</code>. The result is stored in <code>cmd_res</code>.</p>
</li>
<li><p><strong>Confirm VM Shutdown</strong>:<br /> Updates the <code>confirmed_shutdown</code> list by adding VMs that are confirmed to be in the "shut" state. This ensures only properly shut down VMs proceed to the next steps.</p>
</li>
<li><p><strong>Add Disks to Group 'disks'</strong>:<br /> Dynamically adds VMs and their respective disks to the Ansible inventory group <code>disks</code>. It includes variables like <code>vm_name</code>, <code>disk_name</code>, and <code>source_directory</code>, which will be used for subsequent snapshot operations.</p>
</li>
</ol>
<p><strong>Other things to point out</strong>:</p>
<ul>
<li>Ansible lets you dynamically add inventory hosts during playbook execution, which I used to treat each disk as a "host" rather than relying solely on variables. This approach enables the playbook to leverage Ansible's native task batch execution, allowing snapshot creation tasks to run concurrently across all disks. Without this method, using standard variables and looping would result in snapshots being created and synced <strong>one at a time</strong>— UGH. That's the reason behind Task #3. Also, these tasks are also all tagged with ‘always’.</li>
</ul>
<p><strong>Snapshot Creation Play</strong></p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Unraid</span> <span class="hljs-string">Snapshot</span> <span class="hljs-string">Creation</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">disks</span>
  <span class="hljs-attr">gather_facts:</span> <span class="hljs-literal">no</span>
  <span class="hljs-attr">vars_files:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">./vars/snapshot-creation-vars.yml</span>

  <span class="hljs-attr">tasks:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Snapshot</span> <span class="hljs-string">Creation</span> <span class="hljs-string">Task</span> <span class="hljs-string">Block</span>
      <span class="hljs-attr">block:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">setup:</span>
            <span class="hljs-attr">gather_subset:</span>
              <span class="hljs-bullet">-</span> <span class="hljs-string">'min'</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">image</span> <span class="hljs-string">filename</span>
          <span class="hljs-attr">set_fact:</span>
            <span class="hljs-attr">snapshot_filename:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ disk_name[:-4] }}</span>.<span class="hljs-template-variable">{{ desired_snapshot_name | regex_replace('\\-', '_') | regex_replace('\\W', '') }}</span>.img"</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>
          <span class="hljs-attr">when:</span> <span class="hljs-string">desired_snapshot_name</span> <span class="hljs-string">is</span> <span class="hljs-string">defined</span> <span class="hljs-string">and</span> <span class="hljs-string">desired_snapshot_name</span> <span class="hljs-string">|</span> <span class="hljs-string">length</span> <span class="hljs-string">&gt;</span> <span class="hljs-number">0</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">image</span> <span class="hljs-string">filename</span> <span class="hljs-string">with</span> <span class="hljs-string">default</span> <span class="hljs-string">date/time</span> <span class="hljs-string">if</span> <span class="hljs-string">necessary</span>
          <span class="hljs-attr">set_fact:</span>
            <span class="hljs-attr">snapshot_filename:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ disk_name[:-4] }}</span>.<span class="hljs-template-variable">{{ ansible_date_time.iso8601|replace(':', '.')}}</span>.img"</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>
          <span class="hljs-attr">when:</span> <span class="hljs-string">desired_snapshot_name</span> <span class="hljs-string">is</span> <span class="hljs-string">not</span> <span class="hljs-string">defined</span> <span class="hljs-string">or</span> <span class="hljs-string">desired_snapshot_name</span> <span class="hljs-string">|</span> <span class="hljs-string">length</span> <span class="hljs-string">==</span> <span class="hljs-number">0</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Create</span> <span class="hljs-string">reflink</span> <span class="hljs-string">for</span> {{ <span class="hljs-string">vm_name</span> }}
          <span class="hljs-attr">command:</span> <span class="hljs-string">cp</span> <span class="hljs-string">--reflink</span> <span class="hljs-string">-rf</span> {{ <span class="hljs-string">disk_name</span> }} {{ <span class="hljs-string">snapshot_filename</span> }}
          <span class="hljs-attr">args:</span>
            <span class="hljs-attr">chdir:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>"</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Check</span> <span class="hljs-string">if</span> <span class="hljs-string">reflink</span> <span class="hljs-string">exists</span>
          <span class="hljs-attr">stat:</span> 
            <span class="hljs-attr">path:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>/<span class="hljs-template-variable">{{ snapshot_filename }}</span>"</span>
            <span class="hljs-attr">get_checksum:</span> <span class="hljs-literal">False</span>
          <span class="hljs-attr">register:</span> <span class="hljs-string">check_reflink_hd</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Backup</span> <span class="hljs-string">HD(s)</span> <span class="hljs-string">to</span> <span class="hljs-string">DiskStation</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">rsync</span> <span class="hljs-string">--progress</span> {{ <span class="hljs-string">snapshot_filename</span> }} {{ <span class="hljs-string">repository_user</span> }}<span class="hljs-string">@{{</span> <span class="hljs-string">hostvars['diskstation']['ansible_host']</span> <span class="hljs-string">}}:/{{</span> <span class="hljs-string">snapshot_repository_base_directory</span> <span class="hljs-string">}}/{{</span> <span class="hljs-string">vm_name</span> <span class="hljs-string">}}/</span>
          <span class="hljs-attr">args:</span>
            <span class="hljs-attr">chdir:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>"</span>
          <span class="hljs-attr">when:</span> <span class="hljs-string">check_reflink_hd.stat.exists</span> <span class="hljs-string">and</span> <span class="hljs-string">'use_local'</span> <span class="hljs-string">not</span> <span class="hljs-string">in</span> <span class="hljs-string">ansible_run_tags</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Backup</span> <span class="hljs-string">HD(s)</span> <span class="hljs-string">to</span> <span class="hljs-string">Local</span> <span class="hljs-string">VM</span> <span class="hljs-string">Folder</span> <span class="hljs-string">as</span> <span class="hljs-string">.tar</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">tar</span> <span class="hljs-string">cf</span> {{ <span class="hljs-string">snapshot_filename</span> }}<span class="hljs-string">.tar</span> {{ <span class="hljs-string">snapshot_filename</span> }}
          <span class="hljs-attr">args:</span>
            <span class="hljs-attr">chdir:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>"</span>
          <span class="hljs-attr">when:</span> <span class="hljs-string">check_reflink_hd.stat.exists</span> <span class="hljs-string">and</span> <span class="hljs-string">'use_local'</span> <span class="hljs-string">in</span> <span class="hljs-string">ansible_run_tags</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Delete</span> <span class="hljs-string">reflink</span> <span class="hljs-string">file</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">rm</span> <span class="hljs-string">"<span class="hljs-template-variable">{{ source_directory }}</span>/<span class="hljs-template-variable">{{ vm_name }}</span>/<span class="hljs-template-variable">{{ snapshot_filename }}</span>"</span>
          <span class="hljs-attr">when:</span> <span class="hljs-string">check_reflink_hd.stat.exists</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>

        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Start</span> <span class="hljs-string">VM</span> <span class="hljs-string">following</span> <span class="hljs-string">snapshot</span> <span class="hljs-string">transfer</span>
          <span class="hljs-attr">command:</span> <span class="hljs-string">virsh</span> <span class="hljs-string">start</span> {{ <span class="hljs-string">vm_name</span> }}
          <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
          <span class="hljs-attr">delegate_to:</span> <span class="hljs-string">unraid</span>


      <span class="hljs-attr">when:</span> <span class="hljs-string">vm_name</span> <span class="hljs-string">in</span> <span class="hljs-string">hostvars['unraid']['confirmed_shutdown']</span>
      <span class="hljs-attr">tags:</span> <span class="hljs-string">always</span>
</code></pre>
<p>Here's a breakdown of the second play in the playbook—<strong>Unraid Snapshot Creation</strong></p>
<p><strong>Purpose</strong>:<br />This play automates the creation of VM disk snapshots on the Unraid server, backing them up to a destination NAS via rsync or creating local snapshots as TAR files, stored in the same directory as the original disk.</p>
<p><strong>Hosts</strong>:</p>
<ul>
<li>Uses the dynamically created <code>disks</code> group made from the previous play. Also is able to use the <code>unraid</code> host still in memory from the previous play. <code>gather_facts</code> is set to 'no', as the <code>disks</code> group aren't actually hosts we connect to (explained in the previous play).</li>
</ul>
<p><strong>Variables</strong>:</p>
<ul>
<li>Loads variables from an external file <code>./vars/variables.yml</code>, specifically <code>destination_directory</code> and <code>destination_user</code>.</li>
</ul>
<p><strong>Tasks</strong>:</p>
<ol>
<li><p><strong>Setup Minimal Facts</strong>:<br /> Gathers a minimal fact subset from <code>unraid</code> host to prepare for snapshot creation, mainly used for <code>ansible_date_time_iso8601</code> variable.</p>
</li>
<li><p><strong>Create Snapshot Filename</strong>:<br /> Generates a unique snapshot filename based off the ‘desired_snapshot_name’ variable if defined by the user. Also sanitizes that data by replacing dashes with slashes and removing special characters.</p>
</li>
<li><p><strong>Create Snapshot Image Filename with Default Date/Time if necessary:</strong></p>
<p> Used as a default for creating snapshot name. Generates the snapshot filename based with ISO8601 date/time stamp if the filename wasn’t created with the previous task.</p>
</li>
<li><p><strong>Create Snapshot (Reflink)</strong>:<br /> Uses a <code>cp --reflink</code> command to create a snapshot (reflink) of the specified disk in the source directory.</p>
</li>
<li><p><strong>Verify Snapshot Creation</strong>:<br /> Checks if the snapshot (reflink) was successfully created in the target directory.</p>
</li>
<li><p><strong>Backup Snapshot to DiskStation</strong>:<br /> If the snapshot exists, it's transferred to the DiskStation NAS using rsync, executed via Ansible's <code>command</code> module. A downside is that there’s no live progress shown in the Ansible shell output, which can be frustrating for large or numerous disk files. In my case, I monitor the DiskStation GUI to track the snapshot's file size growth to confirm it’s still running. If you want better visibility, Ansible AWX provides progress tracking without this limitation. Conditionally runs only if Ansible finds an existing reflink for the disk and the playbook WASN’T run with the <code>use_local</code> tag.</p>
</li>
<li><p><strong>Backup HD(s) to Local VM Folder as .tar:</strong></p>
<p> Alternatively, if the <code>use_local</code> tag is present, the snapshot is archived locally as a <code>.tar</code> file. This option allows users to store the snapshot on the same server, in the same source disk folder, without needing external storage. The play provides a mechanism to skip this step if not required, offering tag-based control for local or remote backups. Conditionally runs only if Ansible finds an existing reflink for the disk.</p>
</li>
<li><p><strong>Delete Reflink File</strong>:<br /> Once the snapshot has been successfully backed up, it deletes the temporary reflink file on the <code>unraid</code> host.</p>
</li>
<li><p><strong>Start VM Following Successful Snapshot Creation</strong></p>
<p> Starts the impacted VMs back up once the snapshot creation process completes.</p>
</li>
</ol>
<p><strong>Conditional Execution</strong>:</p>
<ul>
<li>The play is only executed if the VM is confirmed to be in a shutdown state, based on the <code>vm_name</code> value being present in the <code>unraid.confirmed_shutdown</code> host variable list created in the previous play. This whole block is tagged with ‘always’. Every task will always run with the exception of <code>Backup HD(s) to DiskStation</code> (see above)</li>
</ul>
<p><strong>Other things to point out</strong>:</p>
<ul>
<li><p>All these tasks are being executed or <code>delegated_to</code> the <code>unraid</code> host itself. Nothing will run on the <code>disks</code> host group.</p>
</li>
<li><p>I opted to use <code>.tar</code> files to speed up both the creation and restoration of snapshots. Both <code>rsync</code> and traditional file copy methods took nearly as long as <code>rsync</code> for remote destinations. By using <code>.tar</code> files within the same disk source folder, I reduced the time required by 25-50%.</p>
</li>
</ul>
<h3 id="heading-creating-the-snapshots-running-the-playbook"><strong>Creating the Snapshots (Running the Playbook)</strong></h3>
<p>Finally we can move on to the most exciting piece, running the playbook. It's very simple to run. Just run the following command in the root of the playbook directory:</p>
<pre><code class="lang-bash">ansible-playbook create-snapshot-pb.yml -i defaults/inventory.yml
</code></pre>
<p>As long as your data and formatting is clean and all required setup was done you should see the playbook shutdown the VMs (if necessary) and quickly get to the Backup task for the disks. That's where it's going to spend the majority of its time.</p>
<p>Alternatively, you can run this play with the <code>use_local</code> tag to save snapshots as <code>.tar</code> files locally. This approach is ideal for faster recovery in a lab environment, where you're actively building or testing. Instead of rolling back multiple changes on a server, it's quicker and simpler to erase the disk and restore from a local baseline snapshot.</p>
<pre><code class="lang-bash">ansible-playbook create-snapshot-pb.yml -i defaults/inventory.yml --tags <span class="hljs-string">'use_local'</span>
</code></pre>
<p>Successful output should look like similar to the following:</p>
<pre><code class="lang-typescript">PLAY [Unraid Snapshot Creation Prep] *****************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************
ok: [unraid]

TASK [Get initial VM status] *************************************************************************************************************************
changed: [unraid] =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/cache/domains'</span>}]})
changed: [unraid] =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-LabNode3'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/nvme_cache/domains'</span>}]})

TASK [Create list <span class="hljs-keyword">of</span> VMs that need shutdown] *********************************************************************************************************
ok: [unraid]

TASK [Shutdown VM(s)] ********************************************************************************************************************************
included: <span class="hljs-regexp">/mnt/</span>c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml <span class="hljs-keyword">for</span> unraid =&gt; (item=Rocky9-TESTNode)
included: <span class="hljs-regexp">/mnt/</span>c/Dev/Git/unraid-vm-snapshots/tasks/shutdown-vm.yml <span class="hljs-keyword">for</span> unraid =&gt; (item=Rocky9-LabNode3)

TASK [Shutdown VM - Rocky9-TESTNode] *****************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-TESTNode] ***************************************************************************************************************
changed: [unraid]

TASK [Shutdown VM - Rocky9-LabNode3] *****************************************************************************************************************
changed: [unraid]

TASK [Get VM status - Rocky9-LabNode3] ***************************************************************************************************************
FAILED - RETRYING: [unraid]: Get VM status - Rocky9-LabNode3 (<span class="hljs-number">5</span> retries left).
changed: [unraid]

TASK [Get VM status] *********************************************************************************************************************************
changed: [unraid] =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/cache/domains'</span>}]})
changed: [unraid] =&gt; (item={<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-LabNode3'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/nvme_cache/domains'</span>}]})

TASK [Create list to use <span class="hljs-keyword">for</span> confirmation <span class="hljs-keyword">of</span> VMs being shutdown] *************************************************************************************
ok: [unraid] =&gt; (item={<span class="hljs-string">'changed'</span>: True, <span class="hljs-string">'stdout'</span>: <span class="hljs-string">'shut'</span>, <span class="hljs-string">'stderr'</span>: <span class="hljs-string">''</span>, <span class="hljs-string">'rc'</span>: <span class="hljs-number">0</span>, <span class="hljs-string">'cmd'</span>: <span class="hljs-string">'virsh list --all | grep "Rocky9-TESTNode" | awk \'{ print $3}\''</span>, <span class="hljs-string">'start'</span>: <span class="hljs-string">'2024-09-09 18:04:55.797046'</span>, <span class="hljs-string">'end'</span>: <span class="hljs-string">'2024-09-09 18:04:55.809047'</span>, <span class="hljs-string">'delta'</span>: <span class="hljs-string">'0:00:00.012001'</span>, <span class="hljs-string">'msg'</span>: <span class="hljs-string">''</span>, <span class="hljs-string">'invocation'</span>: {<span class="hljs-string">'module_args'</span>: {<span class="hljs-string">'_raw_params'</span>: <span class="hljs-string">'virsh list --all | grep "Rocky9-TESTNode" | awk \'{ print $3}\''</span>, <span class="hljs-string">'_uses_shell'</span>: True, <span class="hljs-string">'expand_argument_vars'</span>: True, <span class="hljs-string">'stdin_add_newline'</span>: True, <span class="hljs-string">'strip_empty_ends'</span>: True, <span class="hljs-string">'argv'</span>: None, <span class="hljs-string">'chdir'</span>: None, <span class="hljs-string">'executable'</span>: None, <span class="hljs-string">'creates'</span>: None, <span class="hljs-string">'removes'</span>: None, <span class="hljs-string">'stdin'</span>: None}}, <span class="hljs-string">'stdout_lines'</span>: [<span class="hljs-string">'shut'</span>], <span class="hljs-string">'stderr_lines'</span>: [], <span class="hljs-string">'failed'</span>: False, <span class="hljs-string">'item'</span>: {<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/cache/domains'</span>}]}, <span class="hljs-string">'ansible_loop_var'</span>: <span class="hljs-string">'item'</span>})
ok: [unraid] =&gt; (item={<span class="hljs-string">'changed'</span>: True, <span class="hljs-string">'stdout'</span>: <span class="hljs-string">'shut'</span>, <span class="hljs-string">'stderr'</span>: <span class="hljs-string">''</span>, <span class="hljs-string">'rc'</span>: <span class="hljs-number">0</span>, <span class="hljs-string">'cmd'</span>: <span class="hljs-string">'virsh list --all | grep "Rocky9-LabNode3" | awk \'{ print $3}\''</span>, <span class="hljs-string">'start'</span>: <span class="hljs-string">'2024-09-09 18:04:57.638402'</span>, <span class="hljs-string">'end'</span>: <span class="hljs-string">'2024-09-09 18:04:57.650150'</span>, <span class="hljs-string">'delta'</span>: <span class="hljs-string">'0:00:00.011748'</span>, <span class="hljs-string">'msg'</span>: <span class="hljs-string">''</span>, <span class="hljs-string">'invocation'</span>: {<span class="hljs-string">'module_args'</span>: {<span class="hljs-string">'_raw_params'</span>: <span class="hljs-string">'virsh list --all | grep "Rocky9-LabNode3" | awk \'{ print $3}\''</span>, <span class="hljs-string">'_uses_shell'</span>: True, <span class="hljs-string">'expand_argument_vars'</span>: True, <span class="hljs-string">'stdin_add_newline'</span>: True, <span class="hljs-string">'strip_empty_ends'</span>: True, <span class="hljs-string">'argv'</span>: None, <span class="hljs-string">'chdir'</span>: None, <span class="hljs-string">'executable'</span>: None, <span class="hljs-string">'creates'</span>: None, <span class="hljs-string">'removes'</span>: None, <span class="hljs-string">'stdin'</span>: None}}, <span class="hljs-string">'stdout_lines'</span>: [<span class="hljs-string">'shut'</span>], <span class="hljs-string">'stderr_lines'</span>: [], <span class="hljs-string">'failed'</span>: False, <span class="hljs-string">'item'</span>: {<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-LabNode3'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/nvme_cache/domains'</span>}]}, <span class="hljs-string">'ansible_loop_var'</span>: <span class="hljs-string">'item'</span>})

TASK [Add host to group <span class="hljs-string">'disks'</span> <span class="hljs-keyword">with</span> variables] ******************************************************************************************************
changed: [unraid] =&gt; (item=[{<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-TESTNode'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/cache/domains'</span>}]}, {<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/cache/domains'</span>}])
changed: [unraid] =&gt; (item=[{<span class="hljs-string">'vm_name'</span>: <span class="hljs-string">'Rocky9-LabNode3'</span>, <span class="hljs-string">'disks_to_snapshot'</span>: [{<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/nvme_cache/domains'</span>}]}, {<span class="hljs-string">'disk_name'</span>: <span class="hljs-string">'vdisk1.img'</span>, <span class="hljs-string">'source_directory'</span>: <span class="hljs-string">'/mnt/nvme_cache/domains'</span>}])

PLAY [Unraid Snapshot Creation] **********************************************************************************************************************

TASK [setup] *****************************************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]
ok: [Rocky9-LabNode3-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Create snapshot image filename] ****************************************************************************************************************
ok: [Rocky9-TESTNode-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]
ok: [Rocky9-LabNode3-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Create reflink <span class="hljs-keyword">for</span> Rocky9-TESTNode] ************************************************************************************************************
changed: [Rocky9-LabNode3-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]
changed: [Rocky9-TESTNode-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Check <span class="hljs-keyword">if</span> reflink exists] ***********************************************************************************************************************
ok: [Rocky9-LabNode3-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]
ok: [Rocky9-TESTNode-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Backup HD1 to DiskStation] *********************************************************************************************************************
changed: [Rocky9-TESTNode-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]
changed: [Rocky9-LabNode3-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

TASK [Delete reflink file] ***************************************************************************************************************************
changed: [Rocky9-LabNode3-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]
changed: [Rocky9-TESTNode-vdisk1 -&gt; unraid({{ lookup(<span class="hljs-string">'env'</span>, <span class="hljs-string">'UNRAID_IP_ADDRESS'</span>) }})]

PLAY RECAP *******************************************************************************************************************************************
Rocky9-LabNode3-vdisk1     : ok=<span class="hljs-number">6</span>    changed=<span class="hljs-number">3</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
Rocky9-TESTNode-vdisk1     : ok=<span class="hljs-number">6</span>    changed=<span class="hljs-number">3</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>   
unraid                     : ok=<span class="hljs-number">12</span>   changed=<span class="hljs-number">7</span>    unreachable=<span class="hljs-number">0</span>    failed=<span class="hljs-number">0</span>    skipped=<span class="hljs-number">0</span>    rescued=<span class="hljs-number">0</span>    ignored=<span class="hljs-number">0</span>
</code></pre>
<p><strong>From DiskStation:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725921373213/cecf909a-cc21-482c-9eab-2766675783e2.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725921432966/e8e6127a-c9f7-43c9-a22e-a396c9f0c6c1.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-closing-thoughts"><strong>Closing Thoughts</strong></h2>
<p>Well, that was fun. Creating and backing up snapshots, especially in a home lab where tools might be less advanced, is incredibly useful. I plan to leverage this for more complex automation (Kubernetes anyone?) since restoring from a snapshot is far simpler than undoing multiple changes. Again, the main drawback is using the raw rsync command through Ansible lacks progress visibility. Also pushing backups to the NAS can be slow when dealing with hundreds of GBs or more. Takes roughly 4-5 mins to push 25GB image file over 1 Gbp/s connection.</p>
<h2 id="heading-whats-next"><strong>What’s next?</strong></h2>
<p>I have two more pieces I will hopefully be adding to this series -</p>
<ol>
<li><p>Restoring from a snapshot (whether its a specific snapshot or the latest).</p>
</li>
<li><p>Cleaning up old snapshots on your storage, in my case the DiskStation.</p>
</li>
</ol>
<p>Down the road I may look at updating this using the rclone utility instead of rsync. Also might turn all this into a published Ansible role.</p>
<p>You can find the code that goes along with this post <a target="_blank" href="https://github.com/leothelyon17/unraid-vm-snapshots">here</a> (Github).</p>
<p>Thoughts, questions, and comments are appreciated. Please follow me here at Hashnode or connect with me on <a target="_blank" href="https://www.linkedin.com/in/jeffrey-m-lyon/">Linkedin</a>.</p>
<p>Thank you for reading fellow techies!</p>
]]></content:encoded></item></channel></rss>