Last Monday morning, I found my server can't run any command, and it shouws "input output error". With tried for half an hour, I found the only command can executed is sudo poweroff -f (must use flag -f or I got "input output error").
And I booting server manually, and check the system log, but I got nothing special. And I made a smartctl test to confirm if there is any promblem with hard disk. And it passed without error.
Then this Monday this problem shows again. I shutdown the server and boot it manually, and it looks fine just like nothing happened.
Then I use msmtest86 8.2 test if if memory stick is ok. And makesure the SATA cable and hard disk in good condition and connected trustily.
I think maybe it is the problem with OS or file system? My OS is Debian 8.11. Can you give me some advice? Thank you all!
- 774
- 4
- 13
- 385
- 2
- 3
- 11
-
Are you low on disk space, especially in /tmp? – ajgringo619 Sep 19 '19 at 03:16
-
Hello @ajgringo619 I checked my hard disk usage, and it still have 600+GB available. – fajin yu Sep 19 '19 at 05:36
-
You can try the command `badblocks -nv
` (e.g: `badblocks -v /dev/sda2 ) `. Here the device name [ The block device that is mounted on `/` e.g: `/dev/sda2` ] can be found from the command `lsblk`. – ss_iwe Sep 19 '19 at 06:32 -
@ajgringo619, if you running low on space, your programs will get `ENOSPC` ("No space left on device") error instead. I personally run into this condition from time to time in my small desktop configuration. – xwindows -on strike- Sep 19 '19 at 07:37
2 Answers
I found my server can't run any command, and it shouws "input output error"
The error code EIO ("Input/output error") on command launch would happen when your filesystem is damaged; or worse, when you are running on a faulty storage.
Cross your fingers; either way, be aware that at this point you should NOT try to power on the server unless really necessary.1
The Test
There is one sure-fire way to distinguish between two root causes: conduct block-level read scan on the system, and watch out for kernel messages.
- Boot your system with GNU/Linux recovery boot disk.
- Change the system to the plain old text console (press Ctrl+Alt+F1); don't use graphical terminal for this.
- Login as root.
- Run
dmesg -Eto enable live kernel message display on the console. - Run
dmesg -n debugto let low-level kernel message though. - Run
blkidto see which disk contains system partition. (Note thatblkidwill list partitions; strip number off the end of partition path and you will get the disk) - Run
time -p dd if=/dev/sda of=/dev/null bs=4Mto conduct an entire-disk read test (please type this carefully). If your system disk is not/dev/sda, substitute accordingly. - Watch the screen (it will take a long while)...
Results
In the best case where
ddcompleted successfully and uneventfully, then it is likely a filesystem problem.- If you are comfortable doing filesystem check from boot disk, you can do it now (recommended).
- If you would rather let the system sort it by itself, reboot (also remove the boot disk), and boot your usual system but with
fsck.mode=forceappended to the end of kernel command line. (See this question for details) - Discussing the result of filesystem check will warrant a different question though.
However, in the worst case, you would see kernel messages like this spewing on the screen:
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: irq_stat 0x40000001 ata2.00: failed command: READ DMA EXT ata2.00: cmd 25/00:08:78:15:c5/00:00:6c:00:00/e0 tag 0 dma 4096 in res 51/40:00:78:15:c5/00:00:6c:00:00/e0 Emask 0x9 (media error) ata2.00: status: { DRDY ERR } ata2.00: error: { UNC } ata2.00: configured for UDMA/100 sd 1:0:0:0: [sda] Unhandled sense code sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 6c c5 15 78 sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed sd 1:0:0:0: [sda] CDB: Read(10): 28 00 6c c5 15 78 00 00 08 00 end_request: I/O error, dev sda, sector 1824855416 Buffer I/O error on device sda, logical block 228106927 ata2: EH completeLook for the key parts:
DRDY,ERRandUNCin bracesMedium ErrorstatusUnrecovered read errorsense message
If you glanced and find these in the messages (even once), they show that you are facing physical disk error.
When this is the case, don't let
ddfinish, press Ctrl+C to stop, NOW; shut down your system, and bring your disk to a data recovery shop you trust.If you did not find the above worst-case telltales, and rather found this kind of kernel messages repeated:
ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen ata2: irq_stat 0x00000040, connection status changed ata2: SError: { CommWake DevExch } ata2: hard resetting link ata2: link is slow to respond, please be patient (ready=0)Key parts:
hard resetting linklink is slow to respond
Then you are rather facing SATA link problem (e.g. bad cabling): press Ctrl+C to stop, shut down your system, fix your disk cable and connection, and try again.
Side Notes
And I made a smartctl test to confirm if there is any promblem with hard disk. And it passed without error.
Beware that some hard disks tell straight lies in their S.M.A.R.T status (I'm looking at you, Toshiba); my previous laptop hard disk just ground to halt when reading, spewing read errors, and it still said "nothing's wrong" in its status registers.
If your server is mission-critical, then you should consider RAID-based setup.
1 Cautionary tale: My housemate once ignored this warning, and keep filesystem checker grinding on his desktop system anyway. He didn't wait for me to check it up until it eventually failed to boot. Once I got a chance to check it, the disk damage had been already beyond recover (the 500 GB disk could only barely read at snail-pace KB/s, and there was no significant continuous readable area found even after several days).
On the other hand, in another case with the same symptom, the machine owner heeded my warning and left the thing off until I could check it. Of course, it was a hard disk failure. After half a day of GNU DDRescue session and one new hard disk, I brought a good news to him that his system and data was 100% recovered at block level- i.e. all files intact, and ready to boot again without any modification.
- 774
- 4
- 13
-
Why: Change the system to the plain old text console (press Ctrl+Alt+F1); don't use graphical terminal for this. – HUA Di Aug 17 '21 at 12:40
-
@HUADi because the X system, WM & DE, will write files in the background, for their own purposes and for starting up other applications. If you're worried about a nearly-failed disk, you don't want any disk activity happening unless it's directly helping you to backup data. – mcint Feb 07 '23 at 01:15
I ran into this error on my linux server (running Debian 10) when navigating folders and accessing files, despite the drive passing all SMART tests. I was not able to solve the problem using any of the answers posted on Stackexchange.
I was using a 2.5" HDD in a 3.5" drive bay, and it turns out the drive had vibrated lose from the SATA connector. I shut the server down and plugged the drive back in firmly and the errors disappeared.
- 31
- 1
-
Thank you! My version was dust on probably the connectors of my nvme drive. I cleaned it using compressed air and the problem went away. – He Shiming Mar 12 '22 at 01:24