We have several 90TB servers (Areca RAID-6 partitioned into 10 ext4 partitions)
The application is basically a ring buffer; continuously writing data and deleting old data. As such, each partition is always 100% full (15GB per partition held as headroom).
Now we're seeing the writing application segfault because (I suppose) it cannot write to disk fast enough.
The app segfaults happen about the same time as this error (of which there are several):
Nov 26 11:33:10 localhost kernel: INFO: task sync:30312 blocked for more than 120 seconds.
Nov 26 11:33:10 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 11:33:10 localhost kernel: sync D f63a4ec0 0 30312 6161 0x00000080
Nov 26 11:33:10 localhost kernel: f571fe9c 00000086 d9b29930 f63a4ec0 d9b29930 c18d5ec0 c18d5ec0 cca87419
Nov 26 11:33:10 localhost kernel: 0005c511 c18d5ec0 c18d5ec0 cca7d888 0005c511 c18d5ec0 f63b2ec0 e3abe130
Nov 26 11:33:10 localhost kernel: c107d121 00000001 00000046 00000000 d9b29d52 d9b29930 f3544d00 f3c62800
Nov 26 11:33:10 localhost kernel: Call Trace:
Nov 26 11:33:10 localhost kernel: [<c107d121>] ? try_to_wake_up+0x1d1/0x230
Nov 26 11:33:10 localhost kernel: [<c107d1df>] ? wake_up_process+0x1f/0x40
Nov 26 11:33:10 localhost kernel: [<c1063efe>] ? wake_up_worker+0x1e/0x30
Nov 26 11:33:10 localhost kernel: [<c1065a58>] ? insert_work+0x58/0x90
Nov 26 11:33:10 localhost kernel: [<c154ab53>] schedule+0x23/0x60
Nov 26 11:33:10 localhost kernel: [<c15490e5>] schedule_timeout+0x155/0x1d0
Nov 26 11:33:10 localhost kernel: [<c100dffe>] ? __switch_to+0xee/0x370
Nov 26 11:33:10 localhost kernel: [<c1066371>] ? __queue_delayed_work+0x91/0x150
Nov 26 11:33:10 localhost kernel: [<c154b311>] wait_for_completion+0x71/0xc0
Nov 26 11:33:10 localhost kernel: [<c107d180>] ? try_to_wake_up+0x230/0x230
Nov 26 11:33:10 localhost kernel: [<c118865c>] sync_inodes_sb+0x7c/0xb0
Nov 26 11:33:10 localhost kernel: [<c118dbc5>] sync_inodes_one_sb+0x15/0x20
Nov 26 11:33:10 localhost kernel: [<c1168988>] iterate_supers+0xa8/0xb0
Nov 26 11:33:10 localhost kernel: [<c118dbb0>] ? fdatawrite_one_bdev+0x20/0x20
Nov 26 11:33:10 localhost kernel: [<c118dc01>] sys_sync+0x31/0x80
Nov 26 11:33:10 localhost kernel: [<c15534cd>] sysenter_do_call+0x12/0x12
fstab mounts the partitions as ext4 noauto,rw,users,exec 0 0
System is 32 bit Centos 6.6 with 3.10.80-1 kernel.
Question: Is this some kind of disk corruption problem or is there something I need to tune in Linux or the filesystem to fix this? The application needs to run 24x7, forever...