Large TCP backlog for uwsgi, but no visible connections

Question

In a setup with three docker instances, one running haproxy and two others running a flask-based python application through uWsgi, we run into a situation after about a day where no new connections is accepted on one or both instances.

uWsgi is set up to accept up to 100 backlogged connections. This is less than the default 128 configured in /proc/sys/net/core/somaxconn. uWsgi gives up on the 101st connection.

ss confirms that there is a backlog of 101.

root@ad9380a94c50:/# ss -nlpt
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port 
LISTEN     101    100                       *:8080                     *:*      users:(("uwsgi",pid=25,fd=3),("uwsgi",pid=19,fd=3))
LISTEN     0      128              127.0.0.11:38230                    *:*

There is no corresponding connection when running, for example, netstat -npt.

The source code for uwsgi shows that the backlog queue length is obtained by calling getsockopt abd retrieving the tcpi_unacked field. In other words, this does not appear to be a bug in uwsgi, it seems the linux kernel and/or docker literally thinks there are connections there that aren't really there. I suspect they were there once upon a time, in the shape of a health check made by haproxy.

There is no slow-building of back log. While the instance is happy, the backlog remains at zero. It seems as if something goes wrong, and after that the backlog runs up to 100 very quickly and things hang.

We're running docker on an Amazon virtual machine.

Do you see anything out of the ordinary in `/proc/25/net/ip_conntrack` and `/proc/19/net/ip_conntrack`? — Josip Rodin, Jun 28 '16 at 15:43
I'll check tomorrow. After I killed the process with a KILL signal, the backlog disappeared. So I think the process really is holding on to the backlog by hanging, which indicates some kind of problem in uwsgi or flask. — izak, Jun 28 '16 at 22:25

izak · Accepted Answer · 2016-06-30T13:13:17.240

The process is blocking on a mutex (or futex in linux speak). So the backlog is legit, we're literally stuck behind a system call and though the connections are going away, nothing else is updating.

For future reference to others finding this question, this was the break-through command:

# strace -p 5340
Process 5340 attached
futex(0x223cee0, FUTEX_WAIT_PRIVATE, 0, NULL

So there is some kind of deadlock, and I now only have to figure out which process is using mutexes. gdb eventually gave me that information:

(gdb) bt
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x00007f0ecc982068 in PyThread_acquire_lock () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#2  0x00007f0ecd037b29 in gil_real_get () from /usr/lib/uwsgi/plugins/python_plugin.so
#3  0x00007f0ecd030167 in uwsgi_python_master_fixup () from /usr/lib/uwsgi/plugins/python_plugin.so
#4  0x000000000042cb66 in uwsgi_respawn_worker ()
#5  0x000000000042b38f in master_loop ()
#6  0x000000000046741e in uwsgi_run ()
#7  0x000000000041698e in main ()

So some kind of deadlock in attempting to acquire the global interpreter lock.

Edit 2: Plot thickens. Almost exact same problem as this guy, except ours is with RabbitMQ rather than MongoDB. Running a second thread causes problems during reloading, sometimes causing the GIL not to be released, and then hanging when trying to re-acquire it.

Basically, we're doing something we shouldn't be doing and just need to rethink the whole thing.

Large TCP backlog for uwsgi, but no visible connections

1 Answers1