1

I have a k8s node named edge1, which have two pods, one is a client pod named net-tool-edge1 and one is service pod named nginx-edge1. There is also a service named nginx

For some reason, this node didn't have kube-proxy, instead an agent will generate IPVS for services.

Today I found that net-tool-edge1 couldn't visit service nginx, there was no response. And after I used tcpdump to capture traffic, I found that IPVS didn't work as expected.

Pod net-work-tool's IP is 10.234.67.29, Pod nginx-edge1's IP is 10.234.67.28, service nginx has clusterIP 10.234.39.157.

The output of ipvsadm list:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.234.14.175:80 rr
  -> 10.234.67.28:80              Masq    1      0          0         
TCP  10.234.39.157:80 rr
  -> 10.234.67.28:80              Masq    1      0          0         
TCP  10.234.50.96:80 rr
  -> 10.22.48.15:80               Masq    1      0          0          

Following is tcpdump out of net-tool-edge1:

02:45:46.016748 ARP, Request who-has 10.234.67.1 tell 10.234.67.29, length 28
02:45:46.016858 ARP, Request who-has 10.234.67.1 tell 10.234.67.29, length 28
02:45:46.016862 ARP, Reply 10.234.67.1 is-at 96:b9:e1:32:f0:fa, length 28
02:45:46.016864 IP 10.234.67.29.52704 > 169.254.25.10.53: 6768+ A? nginx.fabedge-e2e-test.svc.cluster.local. (58)
02:45:46.016953 IP 10.234.67.29.52704 > 169.254.25.10.53: 7300+ AAAA? nginx.fabedge-e2e-test.svc.cluster.local. (58)
02:45:46.025844 IP 169.254.25.10.53 > 10.234.67.29.52704: 6768*- 1/0/0 A 10.234.39.157 (114)
02:45:47.023403 IP 169.254.25.10.53 > 10.234.67.29.52704: 7300*- 0/1/0 (151)
02:45:47.023958 IP 10.234.67.29.57824 > 10.234.39.157.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 688253 ecr 0,nop,wscale 7], length 0
02:45:47.024040 ARP, Request who-has 10.234.67.28 tell 10.234.67.1, length 28
02:45:47.024149 ARP, Request who-has 10.234.67.29 tell 10.234.67.28, length 28
02:45:47.024153 ARP, Reply 10.234.67.29 is-at f2:3e:7d:a6:f5:1d, length 28
02:45:47.024162 IP 10.234.67.28.80 > 10.234.67.29.57824: Flags [S.], seq 3004459791, ack 1920875837, win 26960, options [mss 1360,sackOK,TS val 688253 ecr 688253,nop,wscale 7], length 0
02:45:47.024180 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [R], seq 1920875837, win 0, length 0
02:45:48.026571 IP 10.234.67.29.57824 > 10.234.39.157.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 689256 ecr 0,nop,wscale 7], length 0
02:45:48.026687 IP 10.234.67.28.80 > 10.234.67.29.57824: Flags [S.], seq 3020124674, ack 1920875837, win 26960, options [mss 1360,sackOK,TS val 689256 ecr 689256,nop,wscale 7], length 0
02:45:48.026702 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [R], seq 1920875837, win 0, length 0
02:45:51.026582 ARP, Request who-has 10.234.67.29 tell 10.234.67.1, length 28
02:45:51.026595 ARP, Reply 10.234.67.29 is-at f2:3e:7d:a6:f5:1d, length 28
02:45:52.034599 ARP, Request who-has 10.234.67.28 tell 10.234.67.29, length 28
02:45:52.034668 ARP, Reply 10.234.67.28 is-at 92:01:73:4f:50:2f, length 28

As we can see, In pod net-tool-edge1 the request's dest IP is 10.234.39.157, but the response packet's source IP is 10.234.67.28, so net-tool-edge1 send a RST packet.

Here is the tcpdump output of edge1 node:

10:45:47.023961 IP 10.234.67.29.57824 > 10.234.39.157.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 688253 ecr 0,nop,wscale 7], length 0
10:45:47.023978 IP 10.234.67.29.57824 > 10.234.39.157.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 688253 ecr 0,nop,wscale 7], length 0
10:45:47.024063 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 688253 ecr 0,nop,wscale 7], length 0
10:45:47.024064 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 688253 ecr 0,nop,wscale 7], length 0
10:45:47.024160 IP 10.234.67.28.80 > 10.234.67.29.57824: Flags [S.], seq 3004459791, ack 1920875837, win 26960, options [mss 1360,sackOK,TS val 688253 ecr 688253,nop,wscale 7], length 0
10:45:47.024161 IP 10.234.67.28.80 > 10.234.67.29.57824: Flags [S.], seq 3004459791, ack 1920875837, win 26960, options [mss 1360,sackOK,TS val 688253 ecr 688253,nop,wscale 7], length 0
10:45:47.024185 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [R], seq 1920875837, win 0, length 0
10:45:47.024186 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [R], seq 1920875837, win 0, length 0
10:45:48.026585 IP 10.234.67.29.57824 > 10.234.39.157.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 689256 ecr 0,nop,wscale 7], length 0
10:45:48.026598 IP 10.234.67.29.57824 > 10.234.39.157.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 689256 ecr 0,nop,wscale 7], length 0
10:45:48.026636 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 689256 ecr 0,nop,wscale 7], length 0
10:45:48.026640 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [S], seq 1920875836, win 27200, options [mss 1360,sackOK,TS val 689256 ecr 0,nop,wscale 7], length 0
10:45:48.026684 IP 10.234.67.28.80 > 10.234.67.29.57824: Flags [S.], seq 3020124674, ack 1920875837, win 26960, options [mss 1360,sackOK,TS val 689256 ecr 689256,nop,wscale 7], length 0
10:45:48.026686 IP 10.234.67.28.80 > 10.234.67.29.57824: Flags [S.], seq 3020124674, ack 1920875837, win 26960, options [mss 1360,sackOK,TS val 689256 ecr 689256,nop,wscale 7], length 0
10:45:48.026703 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [R], seq 1920875837, win 0, length 0
10:45:48.026704 IP 10.234.67.29.57824 > 10.234.67.28.80: Flags [R], seq 1920875837, win 0, length 0

Here we can see, when net-tool-edge1 send request to 10.234.39.157, IPVS(or other kernel module) changed dest IP to 10.234.67.28, but when nginx-edge1 send its response packet, IPVS didn't change the source IP to 10.234.39.157.

I also has another node named edge2, which has the same settings, but everyting on edge2 worked well. In fact, before I restarted edge1, everything worked well.

I googled a lot, but didn't found anything helpful. Any help, any tip and any document is welcome and Thanks in advance.

There is some iptables rules as:

[root@edge1 ~]# iptables -S 
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
-N DOCKER
-N DOCKER-ISOLATION-STAGE-1
-N DOCKER-ISOLATION-STAGE-2
-N DOCKER-USER
-N FABEDGE-FORWARD
-A INPUT -d 169.254.25.10/32 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -d 169.254.25.10/32 -p tcp -m tcp --dport 53 -j ACCEPT
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -j FABEDGE-FORWARD
-A OUTPUT -s 169.254.25.10/32 -p udp -m udp --sport 53 -j ACCEPT
-A OUTPUT -s 169.254.25.10/32 -p tcp -m tcp --sport 53 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
-A FABEDGE-FORWARD -s 10.234.67.0/24 -j ACCEPT
-A FABEDGE-FORWARD -d 10.234.67.0/24 -j ACCEPT
[root@edge1 ~]# iptables -t nat -S 
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N DOCKER
-N FABEDGE-NAT-OUTGOING
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -j FABEDGE-NAT-OUTGOING
-A DOCKER -i docker0 -j RETURN
-A FABEDGE-NAT-OUTGOING -s 10.234.67.0/24 -m set --match-set FABEDGE-PEER-CIDR dst -j RETURN
-A FABEDGE-NAT-OUTGOING -s 10.234.67.0/24 -d 10.234.67.0/24 -j RETURN
-A FABEDGE-NAT-OUTGOING -s 10.234.67.0/24 -j MASQUERADE
[root@edge1 ~]# ipset list 
Name: FABEDGE-PEER-CIDR
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 952
References: 1
Number of entries: 9
Members:
10.234.66.0/24
10.234.0.0/18
10.22.48.28
10.22.48.16
10.22.48.34
10.234.64.0/24
10.234.68.0/24
10.234.65.0/24
10.22.48.17
Jianbo Yan
  • 53
  • 6
  • Looks like a kind of NAT hairpin and asymmetric routing problem: client is in server's LAN, so when the server replies, it replies directly without being routed (by the "IPVS agent"?) and the reply is left unchanged with the wrong source. Of course in Docker context everything is more complex. Normally Docker uses br_netfilter which should also allow bridged traffic to be NATed but it looks like for some reason this doesn't apply here. Hope this comment gives you some clues where to continue because I have no more clue. – A.B Jul 26 '22 at 06:51
  • @A.B Thanks for your reply, I have fixed the problem. – Jianbo Yan Jul 26 '22 at 09:53

1 Answers1

0

It seems a paremter named 'net.bridge.bridge-nf-call-iptables' is changed to 0 after restart. After reset it to 1, the problem is gone.

I should metioned that there is a bridge device named br-fabedge, net-tool-edge1 and nginx-edge1's packets will pass through it.

When node edge1 do dnat, the kernel will save some info in the conntrack table and when reponse packet returns, kernel will change the source address according to conntrack table. But this happens in layer 3.

But when nginx-edge1 sent response packet through br-fabedge, this happens on layer2, kernel won't reference connetrack table.

But setting bridge-nf-call-iptables=1 will make br-fabedge to reference conntrack table and change source address.

My explain may be not precise,take it as a tip.

Jianbo Yan
  • 53
  • 6