At my organization, we’ve deployed Kubernetes on a bunch of rented bare metal servers. We’ve come up with an architecture that uses modern IPv6-only networking (with interfaces to the IPv4 internet), doesn’t require overlay networks, and in general is fairly lean and hence quite simple to administer.
I’m writing about the architecture here, in the hopes that it can help others who are looking into deploying Kubernetes in similar environments.
Why deploy Kubernetes on a bare-metal cluster? ๐
Let’s start with a brief explanation of why we’re even doing this. This question is really made up of two separate ones: first, why use bare metal servers to host our services, and second, why use Kubernetes?
Why use bare-metal? ๐
The alternative being virtual machines in a cloud. The reason is partly due to cost, and partly due to functionality. We anticipate the workloads to have little variance, so the cost savings of not running in a traditional cloud can be significant. We also rely on hardware accelerated virtualization for our product, so we need to use actual physical servers to avoid the performance overhead of nested virtualization, which would be considerable if we were to use virtualization from within our cloud virtual machines.
Why Kubernetes? ๐
My organization is small. We have only a few servers, which could technically be managed by hand. However, for ease of management, we were looking for a declarative workload manager.
Initially, we deployed Nomad, since it is very easy to get started with (single binary, excellent docs docs). Unfortunately, we quickly came to the realization that is was not a sustainable choice. It is too bare-bones (and some features like IPv6 with service discovery don’t work well out of the box), and a few things that are available in Kubernetes either don’t exist or need to be installed as add-ons. Also, the ecosystem isn’t as large as Kubernetes, so its future is just less certain from an outsider’s perspective.
Hence, we looked into deploying Kubernetes. The learning curve is a bit steep, but once you get familiar with it, it’s really not that complicated. The benefits however of being able to deploy and manage your services declaratively are immense.
Note that while our clusters may be tiny compared to what Kubernetes is designed for, the reason we use it is for uniformity of its API. We could potentially also have used a self-contained Kubernetes distribution, such as k3s, which is more lightweight, but once you understand how a full-fledged vanilla k8s system works, setting one up isn’t that much more complex, and brings with it a stronger guarantee of upstream maintenance.
Cluster architecture ๐
Let’s get into the actual architecture we’re using.
As a general rule, we keep the cluster architecture as simple as possible, and eschew many of the optional addons from the Kubernetes ecosystem.
Essentially, our Kubernetes setup consists of a vanilla Kubernetes distribution, made up of the following components:
- Control plane:
- etcd
- kube-apiserver
- kube-scheduler
- kube-controller-manager
- Worker nodes
- kubelet
- kube-proxy
Note that we don’t run a cloud-controller-manager
, since we are not in a
cloud.
In our architecture, all the control plane components are assumed to be collocated on the same host and speak to each other via localhost. With multiple control plane nodes, we’ll run an instance of each component on each node.
A kubelet
is run on every host, and on control plane nodes contains static pod
manifests for the other components. This way, the kubelet is the only
application that needs to be spawned by the host’s init system.
[!NOTE] Multiple control plane components should ensure failure resilience if one control-plane node dies:
- etcd is of course designed for this (the whole point of it)
- the controller manager and scheduler use etcd leases to ensure only one is ever active
- the apiserver is stateless
Unfortunately however, as of this writing, there is a bootstrapping problem in Kubernetes workers, which require more effort than simply relying on etcd to manage failure tolerance. The problem is that kubelets can only connect to one apiserver, and hence some kind of load balancing is required. The most straight-forward load balancer is of course a virtual cluster IP, but that is implemented with kube-proxy, which itself needs an apiserver. So unless kube-proxy can talk to multiple apiservers (which it can’t as of this writing), an external load balancer solution is necessary.
Until https://github.com/Kubernetes/Kubernetes/issues/18174 is resolved, the cluster can only have one control node.
Personal note: this is a HUGE missing feature in Kubernetes, and shows how it really pushes users to cloud offerings, since the common solution to this question is to “use an external load balancer”. Unless you want to manage that yourself outside of Kubernetes, the go-to answer is to use a “cloud load balancer”. A failure-tolerant control plane should be the default in my opinion.
System services ๐
We spawn kube-proxy
as a DaemonSet on every node (see below for network and
service configuration).
As container network interface (CNI), we rely on the built-in host-local plugin and delegate whole IP prefixes to nodes and container bridges.
We also run coredns
on a few nodes at a hardcoded cluster IP address, in order
to provide service discovery via DNS.
Networking ๐
Since we’re IPv6 only, and the network space is vast, we can assign whole prefixes to hosts and let them handle routing to pods (a form of “prefix delegation”). This way, we don’t need any advanced CNI plugins or overlay networks.
Note that as of this writing, the internal network is implemented as a wireguard mesh network. However, Kubernetes is agnostic to the nature of the internal network, so we could change this easily to become a “regular” network directly behind some routers in the future.
Cluster ๐
All machines in the cluster share a common /64
unique local address (ULA)
prefix. They don’t share globally unique address (GUA) prefixes in this
architecture, as these are assigned per-machine by the hosting provider (again,
if one day we have our own GUA prefix and network, we could use that).
Assume <cluster>
represents the cluster’s /64
ULA prefix in the following.
Services ๐
The first /80
prefix in the cluster (i.e. <cluster>:0000::/80
) is reserved
for virtual IP addresses, and specifically <cluster>::/112
is used for
Kubernetes service IPs.
Host ๐
Every machine in the cluster has at least two IP addresses assigned:
- A
/80
GUA on a physical interface, assigned by the hosting provider - A
/80
ULA on a wireguard mesh network interface, assigned manually
The GUA is assigned by the hosting provider, and can have completely different
prefixes for all machines. We only care about this address for making machines
and workloads accessible to the internet. Note that the /80
GUA can be smaller
(for example, Hetzner assigns a /64
to every machine), in which case we
logically augment it to a /80
by adding zeros, so that the ULA and GUA
prefixes have the same length.
The ULA starts at <cluster>:0001::/80
. Assume that <nodeula>
and <nodegua>
represent the node’s /80
ULA and GUA prefixes respectively.
Pods ๐
Pods on a machine all run in a bridge network, which assigns addresses out of
the machine’s prefixes: <nodeula>:0001::/96
and <nodegua>:0001::/96
.
With this hierarchical approach, the node a pod is running on is implicitly known by its IP address, and there is no need for an overlay networking or additional broadcasting of routing tables.
The pod’s ULA will be used for internal communication (e.g. pod-to-pod, or service IP to pod via the host), and the GUA will be used if the pod needs to make any connections to the public internet, outside of the cluster.
This setup is achieved with a CNI config similar to the following.
{
"cniVersion": "0.4.0",
"name": "cluster",
"plugins": [
{
"type": "loopback"
},
{
"type": "bridge",
"bridge": "cluster0",
"isGateway": true,
"isDefaultGateway": true,
"hairpinMode": true,
"forceAddress": true,
"ipam": {
"type": "host-local",
"ranges": [
[
{
"subnet": "<nodeula>:1::/96",
"rangeStart": "<nodeula>:1::10",
"rangeEnd": "<nodeula>:1:ffff:ffff"
}
],
[
{
"subnet": "<nodegua>:1::/96",
"rangeStart": "<nodegua>:1::10",
"rangeEnd": "<nodegua>:1:ffff:ffff"
}
]
]
}
},
{
"type": "firewall",
"backend": "iptables"
}
]
}
The CNI config is set up when machines are provisioned and added to the internal network.
If you are interested more in how CNI works, I suggest reading this article. It goes into details, and even shows how you can write your own CNI plugin.
Load balancing (getting traffic in and out of the cluster) ๐
This is one of the limitations with a simple setup like ours.
Since we don’t have control over any routers in front of our servers, and we
can’t announce (virtual) addresses over BGP, there is no support for external
load balancers to get traffic into the cluster, and hence we can’t define
services of type LoadBalancer
in Kubernetes.
Instead, we rely on a mix of:
- client-side load balancing via DNS (multiple records for various public facing-hosts)
- making services accessible externally by adding
externalIP
s to services of typeClusterIP
, that correspond to the GUAs of hosts - running reverse proxies in externally-accessible hosts’ network namespaces
If you think about it, this isn’t really any different of how you expose services traditionally on individually managed servers.
IPv4 ๐
Some hosts have been assigned a public IPv4 address, for allowing inbound traffic from IPv4-only clients, and for allowing outbound traffic to IPv4-only services (e.g. github). On these hosts, we run a NAT64 service which will be used for both directions.
Essentially, every host with an IPv4 address will reserve a /96 prefix from it’s
ULA which will be used by the NAT64 address to map source IPv4 addresses to IPv4
addresses. This prefix is <nodeula>:64::/96
.
Inbound ๐
Inbound IPv4 traffic is mapped to the hosts’s IPv6 address, which itself may
forward to cluster externalIP
s, or be used by services listening directly on
the host network.
Outbound ๐
Outbound traffic works from any pod or node via a DNS64+NAT64 deployment. CoreDNS is configured to respond with a synthetic AAAA record for sites that only advertise an A record. This synthetic AAAA record points to a NAT64 gateway, running on the IPv4-enabled host.
The NAT64 gateway then strips the prefix off the destination address and converts it to an IPv4 address.
For example, assume a pod in our cluster wants to contact github.com
, and
let’s assume that this site only has an A record pointing to 1.1.1.1
. The
cluster’s DNS server would respond with an AAAA record that it creates on the
fly <nodeula>:64:1.1.1.1
. The request is then sent to said node, where the
NAT64 service finds the original destination 1.1.1.1
, by stripping off the
prefix again.
The gateway also masquerades the source address of IPv4 packets leaving the host, so that they have a valid routable return path.
Conclusion ๐
The architecture I presented here is simple to understand and easy to deploy on cheap rented servers, where you do not control any network equipment. It can’t do “proper” load balancing with the internet, but that is no different than dealing with these kinds of servers in the first place.
As we expand, we may augment our infrastructure to using proper routers, and getting actual public IP ranges that we could announce to peers. This would make proper load balancing possible.