nf_conntrack_sip does not work SOMETIMES, restarting iptables USUALLY fixes it

Question

I'm trying to use nf_conntrack_sip on box that is running Asterisk, that is, not routing traffic for another VoIP box. Setup works until I reboot. After reboot nf_conntrack_sip ALMOST always stops working and media traffic is dropped.

conntrack --dump | grep -E 'sip|helper'
# No output matching 'sip' nor 'helper' while a call is in progress (albeit no audio)

The iptables rules are loaded correctly confirmed by iptables-save.

Then I do systemctl restart iptables and 9/10 times that fixes it. If it does not then I restart repeat the iptables restart.

conntrack --dump | grep -E 'sip|helper'
conntrack v1.4.4 (conntrack-tools): 9 flow entries have been shown.
udp      17 3597 src=10.7.0.38 dst=10.47.1.11 sport=5063 dport=5060 src=10.47.1.11 dst=10.7.0.38 sport=5060 dport=5063 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 helper=sip use=2

Simply reloading the rules with iptables-restore < /etc/sysconfig/iptables does not help. I suspect unloading/loading conntrack or some modules does the trick.

Occasionally it does work at boot, but very rare. Asterisk start quickly. Giving it more time to "finish starting something" does not help.

Update: restarting iptables while nf_conntrack_sip is working as expected, can, rarely, break it.

The setup:

Update: Initially the problem was described as occurring on a VM, but since then I reinstalled onto real hardware (i5-6500 CPU @ 3.20GHz with 8Gb RAM) with exactly the same problem still occurring. All identical packages (same provision script) as the initial VM.

The OS is CentOS-7.4 Minimal + updates, kernel 3.10.0-693.21.1.el7.x86_64. It is all installed from RPMs, no custom kernels nor modules. Update: I also did yum update to latest stable packages and kernel available from CentOS at 2018-08-10. The problem persists.

I did yum autoremove firewalld and yum install iptables-services.

Diffs to /etc/sysconfig/iptables-config (other values are defaults by RPM)

-IPTABLES_MODULES=""
+IPTABLES_MODULES="nf_conntrack_sip"

Added file /etc/modprobe.d/nf_conntrack.conf:

options nf_conntrack nf_conntrack_helper=0

The entire /etc/sysconfig/iptables is very simple:

*raw
-A PREROUTING -p udp --dport 5060 -j CT --helper sip
COMMIT

*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -p udp -m state --state NEW -m udp --dport 5060 -j ACCEPT
-A INPUT -j LOG --log-level 7 --log-prefix "REJECT in filter.INPUT:"
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT

Update: Setting module options options nf_conntrack nf_conntrack_helper=1 and NOT using the iptables rule ... -j CT --helper sip does NOT fix it and the behavior remains non-deterministic.

Not relevant to the problem, only to confirm that packets are dropped, as opposed to having NAT issues, /etc/rsyslog.d/kern-debug.conf

kern.=debug /var/log/kernel-debug

Testing with a Cisco SPA504G phone that dials into the PBX and gets hold music. Not trying to do anything complicated with media. SIP signalling and Media are exchanged with same IPv4 address. The test call is only between the phone and the PBX. No other parties involved.

My attempt to diagnose it:

I've made short script that tries to capture the state of various things before and after the fix by restarting iptables, to compare by diff. The script:

for f in $( find /proc/sys/net/netfilter -type f ); do
  echo f=${f}
  cat "${f}"
done

echo cat /sys/module/nf_conntrack/parameters/*
cat /sys/module/nf_conntrack/parameters/*

echo ls /sys/module/nf_conntrack/holders/
ls /sys/module/nf_conntrack/holders/

echo cat /sys/module/nf_conntrack_sip/parameters/*
cat /sys/module/nf_conntrack_sip/parameters/*
echo ls /sys/module/nf_conntrack_sip/holders/
ls /sys/module/nf_conntrack_sip/holders/

echo ls /sys/module/ip*/holders/
ls /sys/module/{ip,nf_}*/holders/

echo sysctl -a
sysctl -a

echo lsmod
lsmod

echo iptables-save
iptables-save

The only thing I notice is that OFTEN module nf_conntrack_netlink IS listed as loaded after the boot, while there is a problem. Sometimes it is NOT listed by lsmod AFTER the boot but there is still the problem. After restarting iptables it is, to the best of my knowledge, never listed as loaded. I suspect it is unrelated because there is no direct link between it being loaded and the problem manifesting.

Have a look at your system logs. Do you have a message about exausted nf_conntrack connections? — Rui F Ribeiro, Aug 08 '18 at 19:32
@Rui F Ribeiro I was `tail -f /var/log/messages` and nothing like that came up. I suspect what you suggest would happen on a busy or system or under a DoF attack. This is an isolated test and network with the fault right after boot when conntrack --dump lists 12 entries on average. I will check again and post result. Thanks. — AnyDev, Aug 08 '18 at 22:38
@Rui F Ribeiro - no, there is nothing in the logs about reaching nf_conntrack connections limits. I'm reaching conntrack connection counts in the 20-40 range, with max being 65536. — AnyDev, Aug 09 '18 at 01:23
No, I did not install VMWare tools. I setup this test in VMWare because I was getting strange behaviour (probably different problem) in LXC, so I wanted to take LXC out of the equation. The goal is to run it either on hardware or in LXC hosted on hardware. — AnyDev, Aug 09 '18 at 01:26
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/81372/discussion-between-andrdevek-and-rui-f-ribeiro). — AnyDev, Aug 09 '18 at 04:45
Instead of restarting iptables to fix it when it fails, you should try clearing conntrack entries: if it fixes it, that would pinpoint an issue related to conflict of ports in conntrack. Check `conntrack -L` and `conntrack -D -p udp --dport ...` etc. Not that it would solve anything, but that would be a step in the right direction. The restart does have this effect (iptables.init in CentOS has: `# try to unload remaining netfilter modules used by ipv4 and ipv6` ) — A.B, Aug 20 '18 at 18:35
@A.B Thank you! when I did `conntrack -F` that "fixed" it. I was looking at `conntrack -L | grep sip`, what I needed to be looking for also is "5060". I will have to update the question as it grew too long and lots of text is not relevant. — AnyDev, Aug 23 '18 at 09:03
Also you should try to reproduce the behavior in a non RHEL kernel (ie 4.14 or later). In case some recent bug fix was lost and not backported to RHEL's 3.10 kernel — A.B, Aug 23 '18 at 09:23
example of recent fix [allow duplicate SDP expectations](https://marc.info/?l=netfilter-devel&m=152275069206743&w=2) — A.B, Aug 23 '18 at 10:32
@A.B Your comment about "conflict of ports" seems on the right track. I believe I can now reliably reproduce the problem. If, after flushing the conntrack, I make a call FROM the box to the phone, it causes the problem and sticks that way. If, after flushing, I make call from phone to the box, then it continues to work fine. After reboot, asterisk restarts, and it "pings" peers based on last known state, so it makes OUT connection to them thus breaking things. It also "pings" phones at interval. Phones also ping PBX at interval which can explain why sometimes after reboot it worked. — AnyDev, Aug 23 '18 at 11:03
I believe this has to do with initial OUT packet not being tagged as SIP, but creating a conntrack entry which then prevents SIP helper entry, with same ports, being created. I THINK i just need to fix iptables ".... -j CT --helper sip" rules to match outgoing packets. — AnyDev, Aug 23 '18 at 11:07
Yey! I just added `iptables -t raw -A OUTPUT -p udp -m udp --sport 5060 -j CT --helper sip` on the top of the other rules listed in the Q and it solved my problem. Thank you @A.B — AnyDev, Aug 23 '18 at 11:10
so it was a missing config. had it been a router and not the asterisk box that would have been enough. I didn't see that. you should add an answer telling about the missing OUTPUT then — A.B, Aug 23 '18 at 11:13

score 3 · Accepted Answer · answered Aug 23 '18 at 15:16

Solution

The solution was to simply mark the outgoing packets to be handled by conntrack sip helper too:

iptables -t raw -A OUTPUT -p udp -m udp --sport 5060 -j CT --helper sip

Cause

The problem was the firewall rule was marking only incoming packets for conntrack sip helper.

iptables -t raw -A PREROUTING -p udp -m udp --dport 5060 -j CT --helper sip

When the PBX was the one to send the first packet toward the phone, it would establish a conntrack entry without sip helper. The entry continued to match the SIP conversation without SIP helper being involved.

[root@test ~]# conntrack -L | grep -E '5060|sip'
conntrack v1.4.4 (conntrack-tools): 13 flow entries have been shown.
udp      17 159 src=10.47.1.11 dst=10.7.0.38 sport=5060 dport=1024 src=10.7.0.38 dst=10.47.1.11 sport=1024 dport=5060 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1

When the phone was the one to send the first packet toward the PBX, it would hit the rule with "-j CT --helper sip" listed above and the conntrack entry would get created with sip helper.

[root@test ~]# conntrack -L | grep -E '5060|sip'
conntrack v1.4.4 (conntrack-tools): 9 flow entries have been shown.
udp      17 3588 src=10.7.0.38 dst=10.47.1.11 sport=1024 dport=5060 src=10.47.1.11 dst=10.7.0.38 sport=5060 dport=1024 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 helper=sip use=2

Note the "helper=sip" towards the end of the entry, compare with lack of it in first sample.

The PBX and Phone send SIP packets at each other to confirm the other is there, so the timing made it appear non-deterministic.

Asterisk preserves the state of peers across reboots, and probes them after reboot, thus being far more likely to send outgoing packet first and causing the non-SIP-helper entry in the conntrack.

Big thank you to user A.B for pointing me in the right direction in the comments.

Remaining mystery

What I can not explain is why when I had modprobe option

options nf_conntrack nf_conntrack_helper=0

It was still getting broken and "fixed" in the same way. I did not spend much time on helper automatically triggering, so maybe I did something wrong. I might update this answer if I find out more. I am not planning to use automatic helper enabled.

nf_conntrack_sip does not work SOMETIMES, restarting iptables USUALLY fixes it

1 Answers1