前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >深入理解kubernetes(k8s)网络原理之六-同主机pod连接的几种方式及性能对比

深入理解kubernetes(k8s)网络原理之六-同主机pod连接的几种方式及性能对比

原创
作者头像
一生无聊
修改2021-09-08 17:25:11
1.2K0
修改2021-09-08 17:25:11
举报

深入理解kubernetes(k8s)网络原理之六-同主机pod连接的几种方式及性能对比

本来说这一篇是要撸个cni出来的,但感觉这个没什么好说的,cni的源码github一大堆了,大概的套路也就是把前面说的命令用go语言实现一下,于是本着想到啥写啥的原则,我决定介绍一下相同主机的pod连接的另外几种方式及性能对比。

在本系列的第一篇文章中,我们介绍过pod用veth网卡连接主机,其实同主机的pod的连接方式一共有以下几种:

  • veth连接主机,把主机当路由器用,第一篇文章就是介绍的这种方式,(下面我们简称:veth方式)
  • 用linux bridge连接各个pod,把网关地址挂在linux bridge,flannel就是使用这种方式
  • macvlan,使用场景有些限制,云平台一般用不了
  • ipvlan,对内核有要求,默认3.10是不支持的
  • 开启eBPF支持

在这一章中我们专门来对比一下上面这几种方式的区别和性能。

因为eBPF对内核版本有要求,所以我使用的环境linux内核版本是4.18,普通的PC机,8核32G的配置;

在执行下面的命令时,注意创建的网卡名不要与主机的物理网卡冲突,笔者使用的主机网卡是eno2,所以我都是创建名为叫eth0的网卡,但这个网卡名在你的环境中可能很容易冲突,所以注意先修改一下

veth方式

这种方式在第一篇文章中详细介绍过,所以就直接上命令了;

创建pod-a和pod-b:

代码语言:txt
复制
ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-a up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-a/proxy_arp
ip route add 192.168.10.10 dev veth-pod-a scope link

ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-b up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-b/proxy_arp
ip route add 192.168.10.11 dev veth-pod-b scope link


echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -I FORWARD -s 192.168.0.0/16 -d 192.168.0.0/16 -j ACCEPT

在pod-a中启动iperf3服务:

代码语言:txt
复制
ip netns exec pod-a iperf3 -s

另开一个终端,在pod-b中请求pod-a,测试连接性能:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 35014 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.32 GBytes  62.9 Gbits/sec    0    964 KBytes
[  5]   1.00-2.00   sec  7.90 GBytes  67.8 Gbits/sec    0    964 KBytes
[  5]   2.00-3.00   sec  7.79 GBytes  66.9 Gbits/sec    0   1012 KBytes
[  5]   3.00-4.00   sec  7.92 GBytes  68.0 Gbits/sec    0   1012 KBytes
[  5]   4.00-5.00   sec  7.89 GBytes  67.8 Gbits/sec    0   1012 KBytes
[  5]   5.00-6.00   sec  7.87 GBytes  67.6 Gbits/sec    0   1012 KBytes
[  5]   6.00-7.00   sec  7.68 GBytes  66.0 Gbits/sec    0   1.16 MBytes
[  5]   7.00-8.00   sec  7.79 GBytes  66.9 Gbits/sec    0   1.27 MBytes
[  5]   8.00-9.00   sec  7.75 GBytes  66.5 Gbits/sec    0   1.47 MBytes
[  5]   9.00-10.00  sec  7.90 GBytes  67.8 Gbits/sec    0   1.47 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  77.8 GBytes  66.8 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  77.8 GBytes  66.6 Gbits/sec                  receiver

因为大多数时候,在部署在容器的应用间相互访问都是使用clusterIP的,所以也要测试一下用clusterIP访问,先用iptables创建clusterIP,这个我们在第二篇文章介绍过:

代码语言:txt
复制
iptables -A PREROUTING -t nat -d 10.96.0.100 -j DNAT --to-destination 192.168.10.10

为了保证测试结果不受其它因素的干扰,最好确保主机上只有一条nat规则,这条nat规则可以全程使用,后面不再反复地删除并创建了

结果如下:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 44528 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  6.41 GBytes  55.0 Gbits/sec    0    725 KBytes
[  5]   1.00-2.00   sec  7.21 GBytes  61.9 Gbits/sec    0    902 KBytes
[  5]   2.00-3.00   sec  7.99 GBytes  68.6 Gbits/sec    0    902 KBytes
[  5]   3.00-4.00   sec  7.88 GBytes  67.7 Gbits/sec    0    902 KBytes
[  5]   4.00-5.00   sec  7.94 GBytes  68.2 Gbits/sec    0    947 KBytes
[  5]   5.00-6.00   sec  7.86 GBytes  67.5 Gbits/sec    0    947 KBytes
[  5]   6.00-7.00   sec  7.79 GBytes  66.9 Gbits/sec    0    947 KBytes
[  5]   7.00-8.00   sec  8.04 GBytes  69.1 Gbits/sec    0    947 KBytes
[  5]   8.00-9.00   sec  7.85 GBytes  67.5 Gbits/sec    0    996 KBytes
[  5]   9.00-10.00  sec  7.88 GBytes  67.7 Gbits/sec    0    996 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  76.9 GBytes  66.0 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  76.9 GBytes  65.8 Gbits/sec                  receiver

可以看到,在veth方式下,使用podIP或通过clusterIP访问pod在性能上区别是不大的,当然在iptables的规则很多的时候(例如2000个k8s服务),性能就会有影响了,但这并不是我们这一篇文章的重点;

清理一下现场,先停止iperf3服务,然后删除ns(上面的iptables规则留着):

代码语言:txt
复制
ip netns del pod-a
ip netns del pod-b

我们接着来试一下bridge方式。

bridge方式

如果把bridge当成纯二层的交换机来负责两个pod的连接性能是很不错的,比上面用veth的方式要好,但是因为要支持k8s的serviceIP,所以有必要让数据包走一遍主机的iptables规则,所以一般都会给bridge挂个网关的IP,一来能响应pod的ARP请求,这样也就不用开启veth网卡对主机这一端的arp代答,再来这样能让数据包走一遍主机的netfilter的扩展函数,这样iptables规则就能生效了。

按理说linux bridge作为交换机是工作在二层,可是从源码中可以看到bridge是实实在在地执行了netfilter的几个hook点的函数的(PREROUTING/INPUT/FORWARD/OUTPUT/POSTROUTING),当然也有开关可以关闭这个功能(net.bridge.bridge-nf-call-iptables)

下面我们来测一下用bridge连接两个pod时的性能,创建br0的bridge然后把pod-a和pod-b都接上去:

代码语言:txt
复制
ip link add br0 type bridge 
ip addr add 192.168.10.1 dev br0  ## 给br0挂pod的默认网关的IP
ip link set br0 up

ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0 onlink  ## 默认网关是br0的地址
ip link set veth-pod-a master br0  ## veth网卡主机这一端插到bridge上
ip link set veth-pod-a up

ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0 onlink
ip link set veth-pod-b master br0
ip link set veth-pod-b up

ip route add 192.168.10.0/24 via 192.168.10.1 dev br0 scope link  ## 主机到所有的pod的路由,下一跳为br0

再次在pod-a中运行iperf3:

代码语言:txt
复制
ip netns exec pod-a iperf3 -s

在pod-b中测试连接性能:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 38232 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.64 GBytes  65.6 Gbits/sec    0    642 KBytes
[  5]   1.00-2.00   sec  7.71 GBytes  66.2 Gbits/sec    0    642 KBytes
[  5]   2.00-3.00   sec  7.49 GBytes  64.4 Gbits/sec    0    786 KBytes
[  5]   3.00-4.00   sec  7.61 GBytes  65.3 Gbits/sec    0    786 KBytes
[  5]   4.00-5.00   sec  7.54 GBytes  64.8 Gbits/sec    0    786 KBytes
[  5]   5.00-6.00   sec  7.71 GBytes  66.2 Gbits/sec    0    786 KBytes
[  5]   6.00-7.00   sec  7.66 GBytes  65.8 Gbits/sec    0    826 KBytes
[  5]   7.00-8.00   sec  7.63 GBytes  65.5 Gbits/sec    0    826 KBytes
[  5]   8.00-9.00   sec  7.64 GBytes  65.7 Gbits/sec    0    826 KBytes
[  5]   9.00-10.00  sec  7.71 GBytes  66.2 Gbits/sec    0    826 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  76.3 GBytes  65.6 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  76.3 GBytes  65.3 Gbits/sec                  receiver

与veth差距并不大,反复测了几次,都差不多,对于这个结果我觉得有点不科学,从源码上看,bridge转发的流程肯定要比走一遍主机的协议栈转发要快的,于是我想了一下,是不是把bridge执行netfilter扩展函数关闭会好一点呢?于是:

代码语言:txt
复制
sysctl -w net.bridge.bridge-nf-call-iptables=0

再次试了一下:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
 
Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 40658 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.55 GBytes  73.4 Gbits/sec    0    810 KBytes
[  5]   1.00-2.00   sec  8.86 GBytes  76.1 Gbits/sec    0    940 KBytes
[  5]   2.00-3.00   sec  8.96 GBytes  77.0 Gbits/sec    0    940 KBytes
[  5]   3.00-4.00   sec  9.04 GBytes  77.6 Gbits/sec    0    940 KBytes
[  5]   4.00-5.00   sec  8.89 GBytes  76.4 Gbits/sec    0    987 KBytes
[  5]   5.00-6.00   sec  9.18 GBytes  78.9 Gbits/sec    0    987 KBytes
[  5]   6.00-7.00   sec  9.09 GBytes  78.1 Gbits/sec    0    987 KBytes
[  5]   7.00-8.00   sec  9.10 GBytes  78.1 Gbits/sec    0   1.01 MBytes
[  5]   8.00-9.00   sec  8.98 GBytes  77.1 Gbits/sec    0   1.12 MBytes
[  5]   9.00-10.00  sec  9.13 GBytes  78.4 Gbits/sec    0   1.12 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  89.8 GBytes  77.1 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  89.8 GBytes  76.8 Gbits/sec                  receiver

果然,是会快许多的,但这结果多少令我也有点吃惊,我已经确保主机上只有一条iptables规则,只是开启了bridge执行netfilter扩展函数居然前后会相差10Gbits/sec。

但是,这个标志是不能关的,前面说了,要执行iptables规则把clusterIP转成podIP,所以还是要开起来:

代码语言:txt
复制
sysctl -w net.bridge.bridge-nf-call-iptables=1

这时候用clusterIP试试:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 47414 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.02 GBytes  60.3 Gbits/sec    0    669 KBytes
[  5]   1.00-2.00   sec  7.23 GBytes  62.1 Gbits/sec    0    738 KBytes
[  5]   2.00-3.00   sec  7.17 GBytes  61.6 Gbits/sec    0    899 KBytes
[  5]   3.00-4.00   sec  7.21 GBytes  62.0 Gbits/sec    0    899 KBytes
[  5]   4.00-5.00   sec  7.31 GBytes  62.8 Gbits/sec    0    899 KBytes
[  5]   5.00-6.00   sec  7.19 GBytes  61.8 Gbits/sec    0   1008 KBytes
[  5]   6.00-7.00   sec  7.24 GBytes  62.2 Gbits/sec    0   1008 KBytes
[  5]   7.00-8.00   sec  7.22 GBytes  62.0 Gbits/sec    0   1.26 MBytes
[  5]   8.00-9.00   sec  6.99 GBytes  60.0 Gbits/sec    0   1.26 MBytes
[  5]   9.00-10.00  sec  7.07 GBytes  60.8 Gbits/sec    0   1.26 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  71.6 GBytes  61.5 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  71.6 GBytes  61.3 Gbits/sec                  receiver

反复试了几次,大概都是这个值,可以看到,使用clusterIP比直接用podIP访问性能下降了8%;

接着来测macvlan,先清理一下现场:

代码语言:txt
复制
ip netns del pod-a
ip netns del pod-b
ip link del br0

macvlan

macvlan模式是从一个物理网卡虚拟出多个虚拟网络接口,每个虚拟的接口都有单独的mac地址,可以给这些虚拟接口配置IP地址,在bridge模式下(其它几种模式不适用,所以不作讨论),父接口作为交换机来完成子接口间的通信,子接口可以通过父接口访问外网;

我们使用bridge模式:

代码语言:txt
复制
ip link add link eno2 name eth0 type macvlan mode bridge  ## eno2是我的物理网卡名称,eth0是我虚拟出来的接口名,请根据你的实际情况修改
ip netns add pod-a
ip l set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0  

ip link add link eno2 name eth0 type macvlan mode bridge
ip netns add pod-b
ip l set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0

ip link add link eno2 name eth0 type macvlan mode bridge
ip addr add 192.168.10.1/24 dev eth0  ## 多创建一个子接口,留在主机上,把pod的默认网关的IP挂这里,这样pod里面请求clusterIP的流量会走到主机来,主机协议栈的iptables规则就有机会执行了
ip link set eth0 up

iptables -A POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE  ## 要让回包也经过主机协议栈,所以要做源地址转换

测试结果:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 47050 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  9.47 GBytes  81.3 Gbits/sec    0    976 KBytes
[  5]   1.00-2.00   sec  9.81 GBytes  84.3 Gbits/sec    0    976 KBytes
[  5]   2.00-3.00   sec  9.74 GBytes  83.6 Gbits/sec    0   1.16 MBytes
[  5]   3.00-4.00   sec  9.77 GBytes  83.9 Gbits/sec    0   1.16 MBytes
[  5]   4.00-5.00   sec  9.73 GBytes  83.6 Gbits/sec    0   1.16 MBytes
[  5]   5.00-6.00   sec  9.69 GBytes  83.2 Gbits/sec    0   1.16 MBytes
[  5]   6.00-7.00   sec  9.75 GBytes  83.7 Gbits/sec    0   1.22 MBytes
[  5]   7.00-8.00   sec  9.78 GBytes  84.0 Gbits/sec    0   1.30 MBytes
[  5]   8.00-9.00   sec  9.81 GBytes  84.3 Gbits/sec    0   1.30 MBytes
[  5]   9.00-10.00  sec  9.84 GBytes  84.5 Gbits/sec    0   1.30 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  97.4 GBytes  83.6 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  97.4 GBytes  83.3 Gbits/sec                  receiver

直接用podIP访问是非常快的,几种方式中最快的;

用clusterIP试试:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 35780 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.16 GBytes  61.5 Gbits/sec    0    477 KBytes
[  5]   1.00-2.00   sec  7.18 GBytes  61.7 Gbits/sec    0    477 KBytes
[  5]   2.00-3.00   sec  7.13 GBytes  61.2 Gbits/sec    0    608 KBytes
[  5]   3.00-4.00   sec  7.18 GBytes  61.7 Gbits/sec    0    670 KBytes
[  5]   4.00-5.00   sec  7.16 GBytes  61.5 Gbits/sec    0    670 KBytes
[  5]   5.00-6.00   sec  7.30 GBytes  62.7 Gbits/sec    0    822 KBytes
[  5]   6.00-7.00   sec  7.34 GBytes  63.1 Gbits/sec    0    822 KBytes
[  5]   7.00-8.00   sec  7.30 GBytes  62.7 Gbits/sec    0   1.00 MBytes
[  5]   8.00-9.00   sec  7.19 GBytes  61.8 Gbits/sec    0   1.33 MBytes
[  5]   9.00-10.00  sec  7.15 GBytes  61.4 Gbits/sec    0   1.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  72.1 GBytes  61.9 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  72.1 GBytes  61.7 Gbits/sec                  receiver

走一遍主机的内核协议栈就下降了将近25%;

清理一下现场:

代码语言:txt
复制
ip netns del pod-a
ip netns del pod-b
iptables -D POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE

ipvlan

与macvlan类似,ipvlan也是在一个物理网卡上虚拟出多个子接口,与macvlan不同的是,ipvlan的每一个子接口的mac地址是一样的,IP地址不同;

ipvlan有l2和l3模式,l2模式下,与macvlan的工作原理类似,父接口作为交换机来转发子接口的数据包,不同的是,ipvlan的流量转发时是通过dmac==smac来判断这是子接口间的通信的;

l2模式:

代码语言:txt
复制
ip l add link eno2 name eth0 type ipvlan mode l2
ip netns add pod-a
ip l set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0

ip l add link eno2 name eth0 type ipvlan mode l2
ip netns add pod-b
ip l set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0

ip link add link eno2 name eth0 type ipvlan mode l2
ip link set eth0 up
ip addr add 192.168.10.1/24 dev eth0

iptables -A POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE  ##跟上面的一样,也是要让回包经过主机协议栈,所以要做源地址转换

测试结果:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 59580 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  9.70 GBytes  83.3 Gbits/sec    0    748 KBytes
[  5]   1.00-2.00   sec  9.71 GBytes  83.4 Gbits/sec    0    748 KBytes
[  5]   2.00-3.00   sec  10.1 GBytes  86.9 Gbits/sec    0    748 KBytes
[  5]   3.00-4.00   sec  9.84 GBytes  84.5 Gbits/sec    0    826 KBytes
[  5]   4.00-5.00   sec  9.82 GBytes  84.3 Gbits/sec    0    826 KBytes
[  5]   5.00-6.00   sec  9.74 GBytes  83.6 Gbits/sec    0   1.19 MBytes
[  5]   6.00-7.00   sec  9.77 GBytes  83.9 Gbits/sec    0   1.19 MBytes
[  5]   7.00-8.00   sec  9.53 GBytes  81.8 Gbits/sec    0   1.41 MBytes
[  5]   8.00-9.00   sec  9.56 GBytes  82.1 Gbits/sec    0   1.41 MBytes
[  5]   9.00-10.00  sec  9.54 GBytes  82.0 Gbits/sec    0   1.41 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  97.3 GBytes  83.6 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  97.3 GBytes  83.3 Gbits/sec                  receiver

结果显示,ipvlan的l2模式下,同主机的pod通信的性能和macvlan差不多,l3模式下也差不多,所以就不展示了;

使用clusterIP访问:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 38540 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.17 GBytes  61.6 Gbits/sec    0    611 KBytes
[  5]   1.00-2.00   sec  7.22 GBytes  62.0 Gbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  7.31 GBytes  62.8 Gbits/sec    0    708 KBytes
[  5]   3.00-4.00   sec  7.29 GBytes  62.6 Gbits/sec    0    833 KBytes
[  5]   4.00-5.00   sec  7.27 GBytes  62.4 Gbits/sec    0    833 KBytes
[  5]   5.00-6.00   sec  7.36 GBytes  63.3 Gbits/sec    0    833 KBytes
[  5]   6.00-7.00   sec  7.26 GBytes  62.4 Gbits/sec    0    874 KBytes
[  5]   7.00-8.00   sec  7.19 GBytes  61.8 Gbits/sec    0    874 KBytes
[  5]   8.00-9.00   sec  7.17 GBytes  61.6 Gbits/sec    0    874 KBytes
[  5]   9.00-10.00  sec  7.28 GBytes  62.5 Gbits/sec    0    874 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  72.5 GBytes  62.3 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  72.5 GBytes  62.0 Gbits/sec                  receiver

可以看到,使用clusterIP访问相比直接用podIP,性能下降了25%,而且还有个问题,同主机pod之间的访问不需要经过iptables规则,所以network policy无法生效,如果在ipvlan或macvlan的模式下,能很好解决clusterIP的问题就好了,下面我们来试试直接在pod内给网卡附加eBPF程序解决clusterIP的问题

ipvlan模式下附加eBPF程序

下面的方式因为需要附加我们自定义的eBPF程序(mst_lxc.o),所以各位就不能跟着做了,看个结果吧;

pod-b的eth0网卡的tc ingress和tc egress附加eBPF程序(因为是在pod里面,所以附加的程序刚好是反过来的,因为eBPF程序写的时候,针对的是主机的网卡):

代码语言:txt
复制
ip netns exec pod-b bash
ulimit -l unlimited
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress   #ingress方向附加lxc-egress,主要完成rev-DNAT:podIP->clusterIP
tc filter add dev eth0 egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress   #egress方向完成DNAT:clusterIP->podIP

增加svc及后端(功能和上面的iptables的DNAT规则的功能是类似的)

代码语言:txt
复制
mustang svc add --service=10.96.0.100:5201 --backend=192.168.10.10:5201

I0908 09:51:35.754539    5694 bpffs_linux.go:260] Detected mounted BPF filesystem at /sys/fs/bpf
I0908 09:51:35.891046    5694 service.go:578] Restored services from maps
I0908 09:51:35.891103    5694 bpf_svc_add.go:86] created success,id:1

查看一下配置(mustang工具是我们自定义的用户态工具,用来操作eBPF的map)

代码语言:txt
复制
mustang svc ls

==========================================================================
Mustang Service count:1
==========================================================================
id      pro     service         port    backends
--------------------------------------------------------------------------
1       NONE    10.96.0.100     5201    192.168.10.10:5201,
==========================================================================

一切准备就绪,这时候再来测试用clusterIP访问pod-a:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 57152 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.96 GBytes  77.0 Gbits/sec    0   1.04 MBytes
[  5]   1.00-2.00   sec  8.83 GBytes  75.8 Gbits/sec    0   1.26 MBytes
[  5]   2.00-3.00   sec  9.13 GBytes  78.4 Gbits/sec    0   1.46 MBytes
[  5]   3.00-4.00   sec  9.10 GBytes  78.2 Gbits/sec    0   1.46 MBytes
[  5]   4.00-5.00   sec  9.37 GBytes  80.5 Gbits/sec    0   1.46 MBytes
[  5]   5.00-6.00   sec  9.44 GBytes  81.1 Gbits/sec    0   1.46 MBytes
[  5]   6.00-7.00   sec  9.14 GBytes  78.5 Gbits/sec    0   1.46 MBytes
[  5]   7.00-8.00   sec  9.36 GBytes  80.4 Gbits/sec    0   1.68 MBytes
[  5]   8.00-9.00   sec  9.11 GBytes  78.3 Gbits/sec    0   1.68 MBytes
[  5]   9.00-10.00  sec  9.29 GBytes  79.8 Gbits/sec    0   1.68 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  91.7 GBytes  78.8 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  91.7 GBytes  78.5 Gbits/sec                  receiver

可以看到,在ipvlan模式下,附加eBPF程序来实现clusterIP后性能大大提升了,相对于直接用podIP来访问,性能只是有不到5%的损耗,相对于用主机的iptables规则实现clusterIP性能提升了20%,也是上述所有方案中最优的一种,我们自研的cni(mustang)支持这种方式,同时也支持veth方式上附加eBPF程序的方式,下面是veth附加eBPF程序的测试结果;

这里要先清理一下现场

veth方式附加eBPF程序

  1. 还是先创建pod-a和pod-b:
代码语言:txt
复制
ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-a up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-a/proxy_arp
ip route add 192.168.10.10 dev veth-pod-a scope link

ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-b up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-b/proxy_arp
ip route add 192.168.10.11 dev veth-pod-b scope link

echo 1 > /proc/sys/net/ipv4/ip_forward
  1. 附加eBPF程序

给pod-a和pod-b主机一端的网卡的tc ingress和tc egress都附加eBPF程序:

代码语言:txt
复制
tc qdisc add dev veth-pod-b clsact
tc filter add dev veth-pod-b ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress  ## 因为这是附加在主机一端的网卡,与附加在容器时的方向是反的
tc filter add dev veth-pod-b egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress

tc qdisc add dev veth-pod-a clsact
tc filter add dev veth-pod-a ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress
tc filter add dev veth-pod-a egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress
  1. 增加svc和增加endpoint(为了对数据包进行快速重定向):
代码语言:txt
复制
mustang svc add --service=10.96.0.100:5201 --backend=192.168.10.10:5201
mustang ep add --ifname=eth0 --netns=/var/run/netns/pod-a
mustang ep add --ifname=eth0 --netns=/var/run/netns/pod-b
  1. 查看一下配置
代码语言:txt
复制
mustang svc ls
==========================================================================
Mustang Service count:1
==========================================================================
id      pro     service         port    backends
--------------------------------------------------------------------------
1       NONE    10.96.0.100     5201    192.168.10.10:5201,
==========================================================================

mustang ep ls
==========================================================================
Mustang Endpoint count:2
==========================================================================
Id      IP              Host    IfIndex         LxcMAC                  NodeMAC
--------------------------------------------------------------------------
1       192.168.10.10   false   653             76:6C:37:2C:81:05       3E:E9:02:96:60:D6
2       192.168.10.11   false   655             6A:CD:11:13:76:2C       72:90:74:4A:CB:84
==========================================================================
  1. 可以开测了,老套路,pod-a跑服务端,在pod-b上测,先试试用clusterIP:
代码语言:txt
复制
 ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 36492 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.52 GBytes  73.2 Gbits/sec    0   1.07 MBytes
[  5]   1.00-2.00   sec  8.54 GBytes  73.4 Gbits/sec    0   1.07 MBytes
[  5]   2.00-3.00   sec  8.64 GBytes  74.2 Gbits/sec    0   1.07 MBytes
[  5]   3.00-4.00   sec  8.57 GBytes  73.6 Gbits/sec    0   1.12 MBytes
[  5]   4.00-5.00   sec  8.61 GBytes  73.9 Gbits/sec    0   1.18 MBytes
[  5]   5.00-6.00   sec  8.48 GBytes  72.8 Gbits/sec    0   1.18 MBytes
[  5]   6.00-7.00   sec  8.57 GBytes  73.6 Gbits/sec    0   1.18 MBytes
[  5]   7.00-8.00   sec  9.11 GBytes  78.3 Gbits/sec    0   1.18 MBytes
[  5]   8.00-9.00   sec  8.86 GBytes  76.1 Gbits/sec    0   1.18 MBytes
[  5]   9.00-10.00  sec  8.71 GBytes  74.8 Gbits/sec    0   1.18 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  86.6 GBytes  74.4 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  86.6 GBytes  74.1 Gbits/sec                  receiver

再试试直接用podIP:

代码语言:txt
复制
ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 56460 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.76 GBytes  75.3 Gbits/sec    0    676 KBytes
[  5]   1.00-2.00   sec  8.87 GBytes  76.2 Gbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  8.78 GBytes  75.5 Gbits/sec    0    708 KBytes
[  5]   3.00-4.00   sec  8.99 GBytes  77.2 Gbits/sec    0    708 KBytes
[  5]   4.00-5.00   sec  8.85 GBytes  76.1 Gbits/sec    0    782 KBytes
[  5]   5.00-6.00   sec  8.98 GBytes  77.1 Gbits/sec    0    782 KBytes
[  5]   6.00-7.00   sec  8.64 GBytes  74.2 Gbits/sec    0   1.19 MBytes
[  5]   7.00-8.00   sec  8.40 GBytes  72.2 Gbits/sec    0   1.55 MBytes
[  5]   8.00-9.00   sec  8.16 GBytes  70.1 Gbits/sec    0   1.88 MBytes
[  5]   9.00-10.00  sec  8.16 GBytes  70.1 Gbits/sec    0   1.88 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  86.6 GBytes  74.4 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  86.6 GBytes  74.1 Gbits/sec                  receiver

可以看到,附加eBPF程序后,用PodIP访问比原来性能提升了12%左右(见最上面veth方式的测试结果,66Gbits/sec),且clusterIP访问和podIP访问几乎没有差异;

总结

  • veth :(PODIP)66Gbits/sec (clusterIP)66Gbits/sec
  • bridge : (PODIP)77Gbits/sec (clusterIP)61Gbits/sec
  • macvlan :(PODIP)83Gbits/sec (clusterIP)62Gbits/sec
  • ipvlan :(PODIP)83Gbits/sec (clusterIP)62Gbits/sec
  • ipvlan+eBPF:(PODIP)83Gbits/sec (clusterIP)78Gbits/sec
  • veth+eBPF :(PODIP)74Gbits/sec (clusterIP)74Gbits/sec

同主机pod连接的几种方式及性能对比
同主机pod连接的几种方式及性能对比

综上所述:

  • macvlan和ipvlan在使用podIP访问时性能是所有方案中最高的,但clusterIP的访问因为走了主机协议栈,降了25%,有eBPF加持后又提升了20%,不过macvlan在无线网卡不支持,且单网卡mac数量有限制,不支持云平台源目的检查等,所以一般都较为推荐ipvlan+eBPF方式,不过这种方式下附加eBPF程序还是有点点麻烦,上面演示时为了简洁直接把eBPF的map在pod里创建了,实际是在主机上创建eBPF的map,然后让所有的pod共享这份eBPF的map;
  • veth方式附加eBPF程序后,用serviceIP和直接用podIP性能上没有差异,且比不附加eBPF程序性能有12%的提升,而且还可以用主机上的iptables防火墙规则,在主机端附加eBPF程序比较简单,所以这是我们的mustang默认的方案;
  • bridge在podIP访问时性能是比veth方式高的,但因为执行iptables规则性能被拖低了,所以在不附加eBPF的情况下,还是比较推荐veth方式,这是calico的默认方案;bridge方式是flannel的默认方案

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 深入理解kubernetes(k8s)网络原理之六-同主机pod连接的几种方式及性能对比
    • veth方式
      • bridge方式
        • macvlan
          • ipvlan
            • ipvlan模式下附加eBPF程序
              • veth方式附加eBPF程序
                • 总结
                相关产品与服务
                容器服务
                腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
                领券
                问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档