Kubernetes is a complicated system with multiple components interacting with each other in complex ways. As you may already know, Kubernetes is made of master and node components.
Master components such as kube-scheduler, kube-controller-manager, etcd, and kube-apiserver are part of the Kubernetes Control Plane that runs on K8s master/s. The Plane is responsible for managing the cluster lifecycle, K8s API access, data persistence (etcd) and the maintenance of the desired cluster state.
In their turn, node components such as kubelet, container runtime (e.g., Docker), and kube-proxy run on the nodes and are responsible for managing containerized workloads (kubelet) and Services and for enabling communication between Pods (kube-proxy).
Kube-proxy is one of the most important node components that participates in managing Pod-to-Service and External-to-Service networking. Kubernetes has great documentation about Services that mentions kube-proxy and its modes. However, we would like to discuss this component in depth using practical examples. This will help you understand how Kubernetes Services work under the hood and how kube-proxy manages them by interacting with the networking frameworks inside the Linux kernel. Let’s get started!
A proxy server is any server/host that works as an intermediary between clients requesting resources from some servers and these servers. There are three basic types of proxy servers: (a) tunneling proxies; (b) forward proxies; and (c) reverse proxies.
A tunneling proxy passes unmodified requests from clients to servers on some network. It works as a gateway that enables packets from one network access servers on another network.
A forward proxy is an Internet-facing proxy that mediates client connections to web resources/servers on the Internet. It manages outgoing connections and can service a wide range of resource types.
Finally, a reverse proxy is an internal-facing proxy. It may be thought of as a frontend that controls access to servers on a private network. A reverse proxy takes incoming requests and redirects them to some internal server without the client knowing which one he/she is accessing. This is often done to protect a private network against direct exposure to external users. Reverse proxies can also perform load balancing, authentication, as well as caching and/or decryption.
Kube-proxy is the closest to the reverse proxy model in its concept and design (at least in the userspace mode as we’ll see later). As a reverse proxy, kube-proxy is responsible for watching client requests to some IP:port and forwarding/proxying them to the corresponding service/application on the private network. However, the difference between the kube-proxy and a normal reverse proxy is that the kube-proxy proxies requests to Kubernetes Services and their backend Pods, not hosts. There are some other important differences that we will discuss.
So, as we just noted, the kube-proxy proxies client requests to backend Pods managed by a Service. Its main task is to translate Virtual IPs of Services into IPs of backend Pods controlled by Services. This way, the clients accessing the Service do not need to know which Pods are available for that Service.
Kube-proxy can also work as a load balancer for the Service’s Pods. It can do simple TCP, UDP, and SCTP stream forwarding or round-robin TCP, UDP, and SCTP forwarding across a set of backends.
Network Address Translation (NAT) helps forward packets between different networks. More specifically, it allows packets originating from one network find destinations on the other network. In Kubernetes, we need some sort of NAT to translate Virtual IPs/Cluster IPs of Services into IPs of backend Pods.
However, by default, kube-proxy does not know how to implement this kind of network packet forwarding. Moreover, it needs to account for the fact that Service endpoints, i.e., Pods, are constantly changing. Thus, kube-proxy needs to know the state of the Service network at each point of time to ensure that packets arrive at the right Pods. We will discuss how kube-proxy solves these two challenges in what follows.
When a new Service of the type “ClusterIP” is created, the system assigns a virtual IP to it. This IP is virtual because there is no network interface or MAC address associated with it. Thus, the network as a whole does not know how to route packets going to this VIP.
How then does kube-proxy know how to route traffic from this virtual IP to the correct Pod? On the Linux systems where Kubernetes runs, kube-proxy closely interacts with the Linux kernel network configuration tools called netfilter and iptables to configure packet routing rules for this VIP. Let’s see how these kernel tools work and how kube-proxy interacts with them.
Netfilter is a set of Linux kernel hooks that allow various kernel modules register callback functions intercepting network packets and changing their destination/routing. A registered callback function can be thought of as a set of rules tested against every packet passing the network. So the netfilter’s role is to provide the interface for the software working with network rules to match packets against these rules. When a packet matching a rule is found, netfilter takes the specified action (e.g., redirects the packet). In general, netfilter and other components of the Linux networking framework enable packet filtering, network address and port translation (NAPT), and other packet mangling.
To set network routing rules in the netfilter, kube-proxy uses the userspace program called iptables. This program can inspect, forward, modify, redirect, and /or drop IP packets. Iptables consists of five tables: raw , filter , nat , mangle and security that configure packets at various stages of their network travel. In its turn, each table has a set of chains – lists of rules followed in order. For example, the filter table consists of INPUT, OUTPUT, and FORWARD chains. When the packet gets to the table, it is first processed by the INPUT chain.
Each chain includes individual rules that consist of condition(s) and corresponding action(s) to take when the condition is met. Here is the example of setting an iptables rule that blocks connection from a specific IP address in the INPUT chain of the filter table: 15.15.15.51 .
sudo iptables -A INPUT -s 15.15.15.51 -j DROP
Here, INPUT is the chain of the table where the target (the IP address) is filtered and corresponding action (dropping the packet) is taken.
Note: This is a very simplified picture of how iptables work though. If you want to learn more about iptables, check this excellent article from the Arch Linux wiki.
So, we have established that kube-proxy configures the netfilter Linux kernel feature via its user interface – iptables.
However, configuring routing rules is not enough.
IP addresses churn frequently in the containerized environment like Kubernetes. Therefore, kube-proxy has to watch for the Kubernetes API changes such as creating or updating the Service, adding or removing backend Pods IPs and changing iptables rules accordingly so that the routing from the virtual IP always goes to the correct Pod. The details of the process of translating VIPs to real Pods IPs differs depending on the kube-proxy mode selected. Let’s discuss these modes now.
Kube-proxy can work in three different modes:
Why do we need all these modes? Well, these modes differ in how kube-proxy proxy interacts with the Linux userspace and kernelspace and what roles these spaces play in packet routing and load balancing of traffic to Service’s backends. To make the discussion clear, you should understand the difference between userspace and kernelspace.
In Linux, system memory can be divided into two distinct areas: kernel space and user space.
The core of the Operating System known as kernel executes its commands and provides OS services in the kernelspace. All user software and processes installed by users run in the userspace. When they need CPU time for computations, disk for I/O operations or fork the process, they send system calls to the kernel asking for its services.
In general, kernelspace modules and processes are much faster than userspace processes because they interact with the system’s hardware directly. Because the userspace programs should access the kernel services, they are much slower.
Now, that you understand the implications of userspace vs. kernelspace, we will discuss all kube-proxy modes.
In the userspace mode, most networking tasks, including setting packet rules and load balancing, are directly performed by the kube-proxy operating in the userspace. In this mode, kube-proxy comes the closest to the role of a reverse proxy that involves listening to traffic, routing traffic, and load balancing between traffic destinations. Also, in the userspace mode, kube-proxy must frequently switch context between userspace and kernelspace when it interacts with iptables and does load balancing.
Proxying traffic between the VIPs and backend Pods in the userspace mode is done in four steps:
As you see, in this mode kube-proxy works as a userspace proxy that opens a proxy port, listens on it, and redirects packets from the port to the backend Pods.
This approach involves much context-switching, however. Kube-proxy has to switch to the kernelspace when VIPs are redirected to the proxy port and then back to the userspace to load balance between the set of backend Pods. This is because it does not install iptables rules for load balancing between Service endpoints/backends. Thus, load balancing is done directly by the kube-proxy in the userspace. As a result of frequent context-switching, userspace mode is not as fast and scalable as other two modes we are about to describe.
Let’s illustrate how the userspace mode works using an example in the image above. Here, kube-proxy opens a random port (10400) on the node’s eth0 interface after the Service with the ClusterIP 10.104.141.67 is created.
Then, kube-proxy creates netfilter rules that reroute packets sent to the service VIP to the proxy port. After the packets get to this port, kube-proxy selects one of the backend Pods (e.g Pod 1) and forwards traffic to it. As you can imagine, a number of intermediary steps are involved in this process.
Iptables is the default kube-proxy mode since Kubernetes v1.2 and allows for faster packet resolution between Services and backend Pods than the userspace mode.
In the iptables mode, kube-proxy no longer works as a reverse proxy load balancing the traffic between backend Pods. This task is delegated to iptables/netfilter. Iptables is tightly integrated with netfilter, so there is no need to frequently switch context between the userspace and the kernelspace. Also, load balancing between backend Pods is done directly via the iptables rules.
This is how the entire process looks like (see the image below):
However, kube-proxy retains its role of keeping netfilter rules in sync. It constantly watches for Service and Endpoints updates and changes iptables rules accordingly.
Iptables mode is great, but it has one tangible limitation. Remember that in the userspace mode kube-proxy directly load balances between Pods? It can select another Pod if the one it’s trying to access does not respond. Iptables rules, however, don’t have the mechanism to automatically retry another Pod if the one it initially selects does not respond. Therefore, this mode depends on having working readiness probes.
In this example, we demonstrate how to access iptables rules created by kube-proxy for the HTTPD Service. This example was tested on Kubernetes 1.13.0 running on Minikube 0.33.1.
First, let’s create a HTTPD Deployment:
Thus, in the iptables mode, kube-proxy fully delegates the task of redirectin
g traffic and load balancing between the backend Pods to netfilter/iptables. All these tasks happen in the kernelspace, which makes the process much more faster than in the userspace mode.
kubectl run httpd-deployment --image=httpd --replicas=2
Next, expose it via Service:
kubectl expose deployment httpd-deployment --port=80
We need to know the ClusterIP of the Service to identify it later. It is 10.104.141.67 as the output below suggests:
kubectl describe svc httpd-deployment Name: httpd-deployment Namespace: default Labels: run=httpd-deployment Annotations: <none> Selector: run=httpd-deployment Type: ClusterIP IP: 10.104.141.67 Port: <unset> 80/TCP TargetPort: 80/TCP Endpoints: 172.17.0.5:80,172.17.0.6:80 Session Affinity: None Events: <none>
Iptables rules are installed by the kube-proxy Pod so we’ll need to get its name first.
kubectl get pods --namespace kube-system
NAME READY STATUS RESTARTS AGE
kube-proxy-pz9l9 1/1 Running 0 4m12s
Finally, get a shell to the running kube-proxy Pod:
kubectl exec -ti kube-proxy-pz9l9 --namespace kube-system -- /bin/sh
We can now access the iptables inside the kube-proxy. For example, you can list all rules in the nat table like this:
2 | iptables --table nat --list |
---|
This chain includes a list of rules for your K8s Services:
123456789 | Chain KUBE-SERVICES (2 references)target prot opt source destination KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domainKUBE-SVC-LC5QY66VUV2HJ6WZ tcp -- anywhere 10.99.201.218 /* kube-system/metrics-server: cluster IP */ tcp dpt:httpsKUBE-SVC-KO6WMUDK3F2YFERC tcp -- anywhere 10.104.141.67 /* default/httpd-deployment: cluster IP */ tcp dpt:httpKUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:httpsKUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domainKUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL |
---|
As the third rule suggests, the traffic to our Service with the ClusterIP 10.104.141.67 is forwarded to #default/httpd-deployment (the Service’s backend Pods) via TCP dpt:http forwarding. This forwarding is performed directly by iptables using random Pod selection.