我的数据包丢在那里之一场iptables旅行

一般来搭建虚拟网络,或者其他业务情形,linux主机内部出现网络不通时候,有时候思路会断掉,因为内核协议栈对于专注于应用的开发者来说,属于黑盒子,看不到里面具体发生了什么,为什么丢包?

当然大部分原因是由于我们使用方法和配置有误导致,但是从如果从上层看下层,从下层也反证上层,问题可以提前找到和快解决。

本文重点举例,说明虚拟网络应用场景下出现iptables策略问题,并且通过工具反查和反证问题过程。属于一场iptables稍微深层次点的旅行吧,与君共勉。

系统环境

1
2
3
4
5
6
root@ubuntu:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

系统镜像名称

ubuntu-16.04.3-server-amd64.iso

组网环境:

x.png

  • 1.1.1.1和1.1.1.2位namespace net0和net1内部地址
  • net0和net1经过linux bridge br0连接起来

基本组网组成脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
brctl addbr br0 
ifconfig br0 1.1.2.254/24 up
ip addr add 1.1.1.254/24 dev br0

ip link add net0_eth0 type veth peer name tap0
ip netns add net0
ip link set dev net0_eth0 netns net0
ip netns exec net0 ip link set dev net0_eth0 name eth0
ip netns exec net0 ip addr add 1.1.1.1/24 dev eth0
ip netns exec net0 ip link set dev eth0 up
ip link set dev tap0 master br0
ip link set dev tap0 up


ip link add net1_eth0 type veth peer name tap1
ip netns add net1
ip link set dev net1_eth0 netns net1
ip netns exec net1 ip link set dev net1_eth0 name eth0
ip netns exec net1 ip addr add 1.1.1.2/24 dev eth0
ip netns exec net1 ip link set dev eth0 up
ip link set dev tap1 master br0
ip link set dev tap1 up

iptable默认策略

1
2
3
4
5
6
7
iptables -A INPUT -p icmp -m physdev --physdev-in tap0 -j LOG

iptables -A FORWARD -p icmp -m physdev --physdev-in tap0 --physdev-out tap1 -j LOG

iptables -A FORWARD -p icmp -m physdev --physdev-is-out

iptables -A FORWARD -p icmp -m physdev --physdev-is-out --physdev-is-in --physdev-is-bridged

环境验证

1
2
3
4
5
6
7
8
9
10
11
12
root@ubuntu:~# ip netns
net1 (id: 1)
net0 (id: 0)
root@ubuntu:~# ip netns exec net0 bash
root@ubuntu:~# ip r
1.1.1.0/24 dev eth0 proto kernel scope link src 1.1.1.1
root@ubuntu:~# ping 1.1.1.2
PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.
64 bytes from 1.1.1.2: icmp_seq=8 ttl=64 time=0.061 ms
64 bytes from 1.1.1.2: icmp_seq=9 ttl=64 time=0.055 ms
64 bytes from 1.1.1.2: icmp_seq=10 ttl=64 time=0.057 ms
64 bytes from 1.1.1.2: icmp_seq=11 ttl=64 time=0.057 ms

环境搭建完成后,在net0中ping net1 地址,可达

自此基本环境组建完毕,然后do something (其实是在主机侧执行了iptables -P FORWARD DROP)后,发现上述环境无法连通了,what happened!下面开启问题定位。

问题定位过程

步骤1: 在net0 中ping net1 ,发现不通,查看arp表项

1
2
3
4
root@ubuntu:~# ip netns exec net0 arp -n
Address HWtype HWaddress Flags Mask Iface
1.1.1.2 ether 26:4c:48:39:09:21 C eth0
root@ubuntu:~#

可学到arp表项,代表arp数据包相互可达,但是ping包仍然不通。

ip协议属于更上层协议,icmp不通可能性有两种:

1:对方不支持ping

2:中间被阻隔了

对于可能性1,很显然可能性不大,因为一开始还是好的,而且还没有听说关闭ping有什么意义

对于可能性2,是有可能发生的

步骤2: 可能性2验证

ii.png

上述1 、2 、3、4定义为4个数据包抓取点

4点抓包

1
2
3
4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C13:54:32.427376 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.1, length 28
13:54:32.427392 26:4c:48:39:09:21 > c2:a3:10:45:e4:65, ethertype ARP (0x0806), length 42: Reply 1.1.1.2 is-at 26:4c:48:39:09:21, length 28

未收到icmp,收到了arp

2点抓包

1
2
3
4
5
6
7
8
9
10
11
12
oot@ubuntu:~/drop_watch/src# tcpdump -i tap0 -ne
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:05:55.104561 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5357, seq 1, length 64
14:05:56.103918 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5357, seq 2, length 64
14:05:57.103702 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5357, seq 3, length 64
14:05:59.454076 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5358, seq 1, length 64
14:06:00.461847 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5358, seq 2, length 64
14:06:09.373442 c2:a3:10:45:e4:65 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.1, length 28
14:06:09.373461 26:4c:48:39:09:21 > c2:a3:10:45:e4:65, ethertype ARP (0x0806), length 42: Reply 1.1.1.2 is-at 26:4c:48:39:09:21, length 28
14:06:09.373463 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5361, seq 1, length 64
14:06:10.381635 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5361, seq 2, length 64

收到了icmp和arp

3点抓包

1
2
3
4
5
root@ubuntu:~/drop_watch/src# tcpdump -i tap1 -ne
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap1, link-type EN10MB (Ethernet), capture size 262144 bytes
14:05:25.067625 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.1, length 28
14:05:25.067636 26:4c:48:39:09:21 > c2:a3:10:45:e4:65, ethertype ARP (0x0806), length 42: Reply 1.1.1.2 is-at 26:4c:48:39:09:21, length 28

未收到icmp,收到了arp

分析结论: 数据包经过br0被丢弃了,可能和转发策略iptables有关系

查看br0的转发控制策略

iptables -nL -v

发现FORWARD链有丢包,且和ping数据包的行为完全吻合,停止ping包,该丢包计数也不再增长

Chain FORWARD (policy DROP 813 packets, 68292 bytes)

policy DROP为关键词规则链的默认策略丢包,通过iptables -S查看

1
2
3
4
5
6
7
8
9
root@ubuntu:~/drop_watch/src# iptables -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
-A INPUT -p icmp -m physdev --physdev-in tap0 -j LOG
-A FORWARD -p icmp -m physdev --physdev-in tap0 --physdev-out tap1 -j LOG
-A FORWARD -p icmp -m physdev --physdev-is-out
-A FORWARD -p icmp -m physdev --physdev-is-in --physdev-is-out --physdev-is-bridged
root@ubuntu:~/drop_watch/src#

我们知道,如下-P FORWARD DROP 默认情况下,转发链设置为丢包,也就是说如果不配置通过策略,此数据包会被丢弃,但是这是再懂的问题原因以后才知道的,在未定为问题以前我是这样思考的,我的天哪,明明没有任何丢弃数据包策略,数据包却被丢弃了,而且默认情况下没有策略情况数据包应该被放通才对!外加真实环境比较复杂未免想的更复杂,是不是缺驱动?…等等

本来问题定位到此就要结束了,但是真实环境下我是如下折腾的

难道是少了什么驱动等配置选项? 看起来并没有其他策略阻隔数据包,难道是内核问题,我做了什么?

drop watch查看丢包点,调用栈,确认丢包位置

1
2
3
4
5
apt-get install -y libnl-3-dev libnl-genl-3-dev binutils-dev libreadline6-dev
sudo apt-get install -y libnl-3-dev libnl-genl-3-dev binutils-dev libreadline6-dev gcc
git clone https://github.com/pavel-odintsov/drop_watch
cd drop_watch/src
make

ping数据包继续走起来,使用dropwatch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
root@ubuntu:~/drop_watch/src# ./dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
2 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
2 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)
^CGot a stop message
dropwatch> Terminating dropwatch...
Shutting down ...

到此问题基本定位nf_hook_slow查看,属于forward钩子处理部分(如果不确定你可以看内核源码,或者打印出更详细的调用栈),说明数据包被丢在了钩子策略上,是我想多了…,思路转回对iptables的研究上,直到找出默认FORWARD丢包策略-P FORWARD DROP

额外说明:如果你想看到更为详细的丢包调用栈,你可以用下面方法(当然systemtap也可以做到,但是这里不再赘述)

perf查看协议栈丢包点位置,并打印出调用栈

安装方法

1
2
apt install linux-tools-common -y
apt install linux-tools-4.4.0-87-generic linux-cloud-tools-generic -y

使用方法:

步骤1:抓取:

1
sudo perf record -g -a -e skb:kfree_skb

步骤2:分析:

1
sudo perf script

eg:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

root@ubuntu:~/drop_watch/src# sudo perf record -g -a -e skb:kfree_skb

^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.143 MB perf.data (12 samples) ]

root@ubuntu:~/drop_watch/src# sudo perf script
ping 1811 [000] 301.864213: skb:kfree_skb: skbaddr=0xffff8800d59bb400 protocol=2048 location=0xffffffff8176d2e0
921b1a kfree_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
96d2e0 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
f7c br_nf_forward_ip ([br_netfilter])
96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
3264 __br_forward ([bridge])
3797 br_forward ([bridge])
48d0 br_handle_frame_finish ([bridge])
347 NF_HOOK_THRESH ([br_netfilter])
1239 br_nf_pre_routing_finish ([br_netfilter])
1fd1 br_nf_pre_routing ([br_netfilter])
96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
4d2a br_handle_frame ([bridge])
936424 __netif_receive_skb_core (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
936b38 __netif_receive_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
937938 process_backlog (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
93707e net_rx_action (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
285e11 __do_softirq (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
a43bcc do_softirq_own_stack (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
285858 do_softirq.part.19 (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
2858dd __local_bh_enable_ip (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
978c89 ip_finish_output2 (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
979c16 ip_finish_output (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
97a61e ip_output (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
979de5 ip_local_out (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
97afe9 ip_send_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
97b043 ip_push_pending_frames (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
9a16b3 raw_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
9b14c5 inet_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
919ad8 sock_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
91a581 ___sys_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
91aed1 __sys_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
91af22 sys_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
a41eb2 entry_SYSCALL_64_fastpath (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
108490 sendmsg (/lib/x86_64-linux-gnu/libc-2.23.so)
0 [unknown] ([unknown])

ping 1811 [000] 302.863683: skb:kfree_skb: skbaddr=0xffff8800d59bb400 protocol=2048 location=0xffffffff8176d2e0
921b1a kfree_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
96d2e0 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
f7c br_nf_forward_ip ([br_netfilter])
96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
3264 __br_forward ([bridge])
3797 br_forward ([bridge])
48d0 br_handle_frame_finish ([bridge])
347 NF_HOOK_THRESH ([br_netfilter])
1239 br_nf_pre_routing_finish ([br_netfilter])
1fd1 br_nf_pre_routing ([br_netfilter])
96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
4d2a br_handle_frame ([bridge])
936424 __netif_receive_skb_core (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
936b38 __netif_receive_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
937938 process_backlog (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
93707e net_rx_action (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
285e11 __do_softirq (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
a43bcc do_softirq_own_stack (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
285858 do_softirq.part.19 (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)
root@ubuntu:~/drop_watch/src#