Equal-cost multi-path routing (ECMP) is a way to distribute network traffic with a given destination over multiple different paths. In this post, I’ll show how to experiment with ECMP configuration on Linux, using Docker.
Motivation ๐
(this section is about the background leading me to investigate ECMP; you can skip it if all you care is experimenting with it)
I’m running an IPv6-only Kubernetes cluster on bare-metal, and while IPv6-only simplifies managing the internal network, some services running on the cluster need to interface with the legacy IPv4 internet. To achieve this, I’ve deployed a NAT64 implementation (using Tayga) on one of the hosts in the cluster which has a public IPv4 address. As of now, this NAT64 uses a /96 subnet of the host’s /80 (the host has a whole /80 subnet routed to it), and DNS64 is configured to serve all A records as AAAAs in that subnet. For example, an A record to 10.0.0.1 will be served as 2001:db8::64:10.0.0.1, where 2001:db8::64:0:0/96 is the subnet used by the NAT64.
With this setup, routing happens automatically and correctly, since all IPs within the NAT64 subnet are also in the host’s subnet which is running the NAT64. However, since I’m using “real” host-based subnets in DNS64, there are also a couple of drawbacks:
- as-is, I have a single point of failure. If my IPv4 host goes down, then NAT64 won’t work cluster-wide.
- I can remedy this with redundant IPv4 hosts and NAT64 gateways, however I then need my DNS64 solution to support serving different /96 prefixes (in round-robin or the like). This would seem to be the most obvious solution, but I’d like the configuration of DNS64 to be the same across all my clusters (in the future), and don’t want to mess with dynamically generating configuration for discovered IPv4 nodes.
Hence, I’ve decided to instead use a virtual subnet for the NAT64 gateway (the “well-known” 64:ff9b::/96 prefix) and set up ECMP on all cluster hosts to load-balance traffic to IPv4-hosts. 1
Setup ๐
We’ll use a Linux host and two Docker containers in order to test ECMP. You can read more about ECMP on Linux on FHR’s Blog and jkbs' blog. In this article we’ll focus on using it to implement a virtual IP that resolves to two different real IPs.
The containers will each run an nginx instance, representing a backend of a virtual IP, that we’ll arbitrarily define as ::2. We’ll then configure ECMP on the host to target both containers with this IP address. Each container serves a web page displaying “container 1” and “container 2” respectively. Hence, if everything is set up correctly, we should be seeing “container 1” and “container 2” more-or-less equally when querying ::2.
A virtual IP ::2 is routed to two different backends via ECMP.
Steps ๐
-
Create an IPv6 docker network.
docker network create --ipv6 ip6net --subnet fd00:dddd::/64
By default docker does not work with IPv6. This command will essentially create a bridge device at
fd00:dddd::1
, add a host route to that bridge forfd00:dddd::/64
, and configure containers to use addresses within that subnet. -
Spawn two containers.
In two terminals, run
docker run --cap-add=NET_ADMIN --network ip6net -it --rm --entrypoint /bin/bash debian:trixie
Note that
--cap-add=NET_ADMIN
is necessary for manually adding IP addresses to interfaces within the container’s network namespace. -
Configure containers.
In each container, install the following packages
apt install iproute2 nginx
Then, edit
/var/www/html/index.nginx-debian.html
to show “container 1” and “container 2” respectively.Also, take note of the container’s IP addresses (run
ip addr
in the containers). They should befd00:dddd::2
andfd00:dddd::3
. -
Next, check that you can contact the containers using their “real” IP.
curl "[fd00:dddd::2]"
should show “container 1” andcurl "[fd00:dddd::3]"
should show “container 2” -
Now, let’s configure the virtual IP address for each container.
In each container, run
ip add add ::2 dev lo
This will make sure that the container will accept packets targeted at ::2. Note that using the loopback interface is for demonstration purposes only. In the motivating example, the host node will already have an interface with 64:ff9b::/96 routed to it, and hence we won’t need to use the loopback interface.
-
Finally, configure the ECMP routes on the host.
sudo ip route add ::2/128 nexthop via fd00:dddd::2 weight 1 nexthop via fd00:dddd::3 weight 1
Note that you’ll also need to run the command
sysctl -w net.ipv6.fib_multipath_hash_policy=1
to enable layer 4 (TCP) connection hashing. By default, Linux will choose ECMP destinations based on the 3-tuple(source IP, destination IP, flow label)
, and hence all connections originating from the host will always use the same real IP. With this command, it will use the 5-tuple(source IP, source port, destination IP, destination port, protocol)
, allowing us to actually test ECMP from within the same host (since the OS will use different source ports for every outbound connection). -
Test it
Running
curl "[::2]"
repeatedly should give different “container 1” and “container 2” responses. You can also add an--local-port
option to curl, to change the source port used, which should give you deterministic resolutions of the virtual IP. E.g.curl --local-port 4000 "[::2]"
.
Conclusion ๐
The simple setup makes it easy to test ECMP configuration on Linux. It relies on network namespaces, made easy with Docker, to emulate a network topology.
I was surprised how simple it was to get ECMP working on Linux, however I am not sure if it will actually be a good use-case for my motivating example. The main issue is that while ECMP does a great job at load balancing, it does not handle failures of backends at all. Basically, as per the motivating example, if one NAT64 gateway host were to fail, the routes to that host would still exist and a portion of traffic would still be lost. To fix this, I’d need to have some kind of control loop which monitors node state and configures ECMP routes on all hosts. In this case, I could consider dynamically configuring DNS64 and using host-based subnets for NAT64 gateways; it would also be more efficient, since DNS64 configuration can be centralized to just a few nodes, whereas ECMP routes need to be set on all nodes.
-
Note that I considered using kube-proxy, which is already running on all cluster nodes, to resolve the NAT64 virtual subnet to actual hosts. However, as far as I can tell, kube-proxy, and in fact Kubernetes as a whole, only has the machinery to manage virtual individual IPs, not subnets. Hence I thought that adding static ECMP routes was the next-best solution. ↩︎