7

This question addresses the first pass of ddrescue on the device to be rescued.

I had to rescue a 1.5TB hard disk.

The command I used is:

# ddrescue /dev/sdc1 my-part-img my-part-map

When the rescue is started (with no optional parameters) on a good area of the disk, the read rate ("current rate") stays around 18 MB/s.

It occasionally slows a bit, but then comes back to this speed.

However, when it encounters a bad area of the disk, it may slow down significantly, and then it never comes back to the 18 MB/s, but stays around 3 MB/s, even after reading 50 GB of good disk with no problem.

The strange part is that, when it is currently scanning a good disk area at 3 MB/s, if I stop ddrescue and restart it, it restarts at the higher reading rate of 18 MB/s. I actually saved about 2 days by stopping and restarting ddrescue when it was going at 3 MB/s, which I had to do 8 times to finish the first pass.

My question is: why is it that ddrescue will not try to go back to the highest speed on its own. Given the policy, explicitly stated in the documentation, of doing first and fast the easy areas, that is what should be done, and the behavior I observed seems to me to be a bug.

I have been wondering whether this can be dealt with with the option -a or --min-read-rate=… but the manual is so terse that I was not sure. Besides, I do not understand on what basis one should choose a read rate for this option. Should it be the above 18 MB/s?

Still, even with an option to specify it, I am surprised this is not done by default.

Meta note

Two users have voted to close the question for being primarily opinion based.

I would appreciate knowing in what sense it is?

I describe with some numerical precision the behavior of an important piece of software on an actual example, showing clearly that it does not meet a major design objective stated in its documentation (doing the easy parts as quickly as possible), and that very simple reasoning could improve that.

The software is well know, from a very trusted source, with precise algorithms, and I expect that most defects were weeded out long ago. So I am asking experts for a possible known reason for this unexpected behavior, not being an expert myself on this issue.

Furthermore, I ask whether one of the options of the software should be used to resolve the issue, which is even more a very precise question. And I ask for a detailed aspect (how to choose the parameter for this option) since I did not find documentation for that.

I am asking for facts that I need for my work, not opinions. And I motivate it with experimental facts, not opinions.

babou
  • 826
  • 11
  • 19
  • Did you try `sdd` from `schilytools`? Sdd is much older than ddrescue, so it learned from the time when disks had more problems (in the 1980s). Iẗ́'s read speed only depends on the error state of the source disk. – schily Aug 06 '18 at 07:27
  • Two users voted to close the question (not to mention one downvote), but without a word of explanation. I am not experienced with **ddrescue** and disk resding issues, but I did put some care and time in researching my problem, and in writing the question, which does differ from the many questions just asking "*why is ddrescue so slow ?*". I would much appreciate a word of comment regarding the reason(s) for wanting to close the question. – babou Aug 06 '18 at 12:47
  • 1
    @schily From what I read on source forge, _sdd_ is a replacement for _dd_, not for _ddrescue_. Also, not being very experienced on this issue, I must confess that I tend to stay with the tool for which I will more easily find help on the net. And I tend to trust GNU software in general. But I will look ... I assume you wrote it. – babou Aug 06 '18 at 12:59
  • Sdd has the needed properties inside. The main options to control that are `-noerror` and `try=`. I know that I repaired hundreds of disks with `sdd` and I usually tend to distrust GNU software because I've see too many problems in too many GNU programs and a reported bug typically takes 20 years for a fix. – schily Aug 06 '18 at 13:46
  • Not being competent to judge, I will not discuss software sources. Two things I like in *ddrescue* are the policy to do the easy parts first, and more generally the map feature allowing stop and restart with changing parameters, or prioritizing a specific part of the disk. Would *sdd* do that. BTW, I love your avatar; it reminds me of the caricature of an italian free-software activist baking his disk in a pizza oven, – babou Aug 06 '18 at 14:09
  • sdd is intended to copy things like disks and to do this in a way that gives the best possible error recovery. You can use it to read the disk and let the firmware do the refresh, you can copy the disk to itself to get a write refresh and you can intentionally write e.g. nulled blocks to problematic areas. You also could use `iseek=` and `oseek=` to let it work at known to be defect areas first. My avatar has been created by a russian person for `cdrecord` and it shows someone who engraves a CD. – schily Aug 06 '18 at 14:24
  • @babou Trying to shorten your text, make it scannable, include the exact `ddrescue` command you are issuing; help to lower efforts for other Stack Exchange users to start reading. – Pro Backup Aug 06 '18 at 14:44
  • @ProBackup I added the command, though it says nothing more (but you are right). I do not understand what you mean by _making the text scannable_. I shortened the text slightly, but I do not really see how I can shorten it more, and still keep my question precise and informed, especially under accusations of being too broad (am I) or opinion based. It is not really several questions, but aspects of the same one: what is the right way to avoid the problem I identified. – babou Aug 06 '18 at 17:59
  • @babou For the first phase of ddrescue I was advised to use `-n` a.k.a. `--no-scrape` to skip the scraping phase. Your question reads as 2 questions: (1) why does ddrescue slow down and actual rate never recovers but 100% does when just stopping and re-starting (2) why does software (xyz) not automagically do what I want? To my opinion that #2 is the broad and opinion based question. For #2 there is no exact answer I think, and why it doesn't fit Stack Exchange. Only 1 question at a time is also one of the rules of Stack Exchange. – Pro Backup Aug 07 '18 at 21:17

2 Answers2

8

I have been wondering whether this can be dealt with with the option -a or --min-read-rate= ... but the manual is so terse that I was not sure. Besides, I do not understand on what basis one should choose a read rate for this option. Should it be the above 18 MB/s?

The --min-read-rate= option should help. Modern drives tend to spend a lot of time in their internal error checking, so while the rate slows down extremely, this isn't reported as error condition.

even after reading 50 GB of good disk with no problem.

Which also means: you don't even know if there are problems anymore. The drive might have a problem, and decide to not report it.

Now, ddrescue supports using a dynamic --min-read-rate= value, from info ddrescue:

 If BYTES is 0 (auto), the minimum read rate is recalculated every
 second as (average_rate / 10).

But in my experience, the auto setting doesn't seem to help much. Once the drive gets stuck, especially if that happens right at the beginning, I guess the average_rate never stays high enough for it to be effective.

So in a first pass when you want to grab as much data as possible, fast areas first, I just set it to average_rate / 10 manually, average_rate being what the drive's average rate would be if it was intact.

So for example you can go with 10M here (for a drive that is supposed to go at ~100M/s) and then you can always go back and try your luck with the slow areas later.

the behavior I observed seems to me to be a bug.

If you have a bug then you have to debug it. It's hard to reproduce without having the same kind of drive failure. It could just as well be the drive itself that is stuck in some recovery mode.

When dealing with defective drives, you also have to check dmesg if there are any odd things happening, such as bus resets and the like. Some controllers are also worse at dealing with failing drives than others.

Sometimes manual intervention just can't be avoided.

Even then, I am surprised this is not done by default.

Most programs don't come with sane defaults. dd still uses 512 byte blocksize by default, which is the "wrong" choice in most cases... What is considered sane might also change over time.

I am asking for facts that I need for my work, not opinions.

Having good backups is better than having to rely on ddrescue. Getting data off a failing drive is a matter of luck in the first place. Data recovery involves a lot of personal experience and thus - opinions.

Most recovery tools we have are also stupid. The tool does not have an AI that reports to a central server, and goes like "Oh I've seen this failure pattern on this particular drive model before, so let's change our strategy...". So this part has to be done by humans.

frostschutz
  • 47,228
  • 5
  • 112
  • 159
  • Quote: _It could just as well be the drive itself that is stuck in some recovery mode._ It could also be S.M.A.R.T. that is running some periodic check. – Pro Backup Aug 06 '18 at 17:48
  • Much to think about, and you do provide some clarification on apparently weird behaviour. Still, it does not explain why `ddrescue` is not able to regain its former speed, unless I stop-restart it. And now, I suspect the `--min-read-rate=` option will not help for that, but only controls skipping. I will try to check that, when I am finished rescuing the disk which has lost 10 MB out of 1.5 TB. But I also lack experience in repairing file systems. – babou Aug 06 '18 at 21:59
  • Nobody *cares* if the drive has a problem, we are going to chuck it. gddrescue is no about saving the drive it is about saving the data. – mckenzm Oct 27 '21 at 08:49
6

this is a bit of a necro post, but for anyone that might happen across this:

I've been able to reproduce OP's behaviour and have gotten ddrescue to resume its maximum read speed by using its -O flag, which reopens the input file after each error.

Unfortunately I haven't had a chance to dig into why it seems to resume at ~3 MiB/s after an encountering an error, but I thought I'd share my experience.

ParoXoN
  • 161
  • 1
  • 2