TL;DR: k8smaker on Github

For those of you who have read my escapades or even noticed the cloud-related topics on my website, many of them are documenting my struggles with building Kubernetes clusters, learning Docker, and trying to use helpful 3rd party software to do so.  If you want a piece of advice, in a word, don't.

I spent a year trying to use Rancher's tech in various ways.  RancherOS (ROS) was a neat stripped down operating system meant to be the minimal Linux OS similar to CoreOS, that would allow you to stop worrying about OS configurations.  In theory, it was great.  All I wanted to do is network boot a machine and install it from scratch via iPXE.  In practice, things never work out like the theory says.  First, the OS failed with out of memory errors because it refused to use the local hard drive and instead tried to run everything from RAM.  After giving up on iPXE installing from boot, I manually installed ROS to my hard drive, and while that kind of worked, I really needed iSCSI installed and it simply didn't want to work.

Eventually giving up on ROS, I installed Ubuntu Server 18.04LTS, which immediately simplified my setup.  Drivers installed swiftly.  I built a repeatable process for installing Docker in the recommended method.  Then I tried to use Rancher to install Kubernetes onto the machines.  Since Rancher's k3s system is listed as the ideal way to configure the Kubernetes control plane, I followed their instructions and setup the head control node.  I had a bare metal cluster and an AWS cluster setup and things looked good.  The Rancher UI is nice, and I had some of my apps moved over to the cluster.

After only a few days, the baremetal system failed in a way that deadlocked the cluster when I decommissioned one of the servers, forcing me to do a full recovery of the hard drives and rebuild it from scratch (for like the 15th time, truthfully, but I thought I was in a production rhythm at this point).  Only days later, the AWS cluster filled a 20GB hosted MySQL database and I blew a day diagnosing and trying to get the system running again.  Turns out the k3s system is just not well-designed, since it attempts to use a relational DB as a key-value store, and they aren't really compatible concepts.

The restoration of the storage system on my baremetal cluster was difficult, since I was also using Rancher's Longhorn storage system.  It isn't a bad system.  However, the restoration process leaves a lot to be desired, given you have to figure out the right incantations to provide it the partial names of S3 bucket folders, pull down the whole data volume as a QEMU virtual hard drive, then mount it on a system with QEMU installed just so you can actually get to your files.  Not a straightforward process by any means.  At this point, it became clear that I was employing too many wild west technologies and needed to be a lot more conservative.  And learn more.  So I dumped all the Rancher tech and proceeded to make some simpler choices.

My Baremetal Cluster

I'm proud to say I now have a production-worthy baremetal cluster running smoothly for a month now.  What makes it so stronk?

kubectl get nodes
NAME     STATUS   ROLES    AGE   VERSION
atlas    Ready    worker   31d   v1.18.4
hermes   Ready    worker   31d   v1.18.4
sauron   Ready    master   31d   v1.18.4
  • FreeNAS with ZFS handles all my storage needs with redundancy and high performance.  Everything is hosted via NFS shares, which gives me many-read/many-write and file locking.  If a server goes down again, or the whole cluster catches fire, my data is safe and in a trivial form to access.
  • I only have three servers, so I only need one to be the control plane.  Making all three servers into control planes is actually less stable than just one.  While running Rancher, when one node went down, etcd refused to elect a leader and would no longer write new keys with only two live control nodes.  In a cloud environment, you can just spin up a new instance and add it to replace the missing control node, but in a baremetal setting, unless you have spare hardware, you just locked your cluster.  Rather than being sensitive to all nodes being healthy, I'd rather allow 2/3 to be up or down and only protect the control node with everything I've got.
  • Knowledge.  I understand how everything is put together now, as I'll explain below.

(Finally) Introducing k8smaker

After stubbing my toe on every product Rancher offered, I decided to just learn how to do a simple Kubernetes install myself.  Having some familiarity with kubectl and spent enough time becoming familiar with building YAML files, not everything was a complete surprise with the process.  Still, there are many steps and careful considerations required to setup a cluster without a 3rd party tool.  The Kubernetes team actually provides a tool called kubeadm to perform the complex steps of creating a cluster and adding workers to it, so I lean on it heavily.  After a few weeks of walking through the various configuration steps, discovering my mistakes and correcting them, I captured all of the steps in some very simple scripts and published it on Github.

Seeing as how the majority of hobbyist users will have a spare machine or two laying about, as I do, I made specific scripts for building Baremetal clusters.  Other hobbyists who might be interested in building a cluster on AWS will find similar scripts that build a cluster on AWS.  Although it took me weeks to figure out, it only takes a few minutes to build a k8s cluster with these scripts.  Just follow the Quickstart in the Github repo, and you're on your way.

Unlike Rancher, my scripts aren't trying to manage every aspect of your cluster for you... on AWS, Rancher will create new nodes and install everything for you.  With k8smaker, you have to create the node and install Ubuntu 18.04LTS, then run one command to automatically install everything and join it with the cluster.  Rancher has many network overlay options and plugins.  My configuration is built specifically for Calico and Istio.  Although I can appreciate the power Rancher gives a knowledgeable user, I feel like it caused me a lot of heartburn simply because it didn't help bridge the knowledge gap that a simpler system would have, such as k8smaker.

Anyway, if you're curious about trying out a complete Kubernetes cluster in an afternoon, whether baremetal or on Amazon's cloud, k8smaker is a cluster-in-a-bag and even comes with a fully self-contained Calico/Istio/CertManager/Let's Encrypt/website that you can setup as an example of how all those pieces work together.

If you find it useful, please let me know.  There are probably plenty of issues with the scripts.  Send me a pull request, or educate me on how it could be better.  I'm all ears.