I have a 2.6.35 PREEMPT kernel running on moderate-speed ARMv7 processor. About once every 100 - 125s, something causes the kernel to fail to process some audio-related drivers sufficiently quickly to avoid underruns. The hold-up are generally in the range 15-30ms, but can be very much longer. It is not clear whether the hold-up is entirely in-kernel or may relate to the scheduling of a user process running with real-time priority (SCHED_RR, 2).
My guess is that there is a (at least one) driver that is not playing nice with preempt.
Some strace output from the user process illustrates some aspect of both normal and abnormal behaviour, although I am not certain how to interpret the various time reports?
Normal case:
0.000518 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL, revents=POLLIN}], 1, 3415) = 1
0.010202 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL}, {fd=6, events=POLLOUT|POLLERR|POLLNVAL, revents=POLLOUT}], 2, 3404) = 1
0.000585 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL}, {fd=6, events=POLLOUT|POLLERR|POLLNVAL, revents=POLLOUT}], 2, 3404) = 1
0.000302 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL, revents=POLLIN}], 1, 3404) = 1
0.010706 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL}, {fd=6, events=POLLOUT|POLLERR|POLLNVAL, revents=POLLOUT}], 2, 3393) = 1
0.000480 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL}, {fd=6, events=POLLOUT|POLLERR|POLLNVAL, revents=POLLOUT}], 2, 3392) = 1
No blocking occurs on the poll for output on fd6 and, when only fd10 is polled for input, a block of around 10ms occurs. This is reflected both in the report of the duration of the system call and the interval between system calls (they are consistent).
Failure case (extreme example):
0.000305 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL, revents=POLLIN}], 1, 3543) = 1
0.010730 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL}, {fd=6, events=POLLOUT|POLLERR|POLLNVAL, revents=POLLOUT}], 2, 3533) = 1
0.000475 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL}, {fd=6, events=POLLOUT|POLLERR|POLLNVAL, revents=POLLOUT}], 2, 3532) = 1
0.000329 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL, revents=POLLIN}], 1, 3532) = 1
0.953349 poll([{fd=10, events=POLLIN|POLLERR|POLLNVAL}, {fd=6, events=POLLOUT|POLLERR|POLLNVAL, revents=POLLOUT|POLLERR}], 2, 2578) = 1
Notice in this case that, even though the second-last call is recorded as taking 10ms (normal), it is 953ms before the last call.
What tools can I use to track down the culprit?