11

I'm trying to find the first non-zero byte (starting from an optional offset) on a block device using dd and print its offset, but I am stuck. I didn't mention dd in the title as I figured there might be a more appropriate tool than dd to do this, but I figured dd should be a good start. If you know of a more appropriate tool and/or more efficient way to reach my goal, that's fine too.

In the meantime I'll show you how far I've come with dd in bash, so far.

#!/bin/bash

# infile is just a temporary test file for now, which will be replaced with /dev/sdb, for instance
infile=test.txt
offset=0

while true; do
  byte=`dd status='none' bs=1 count=1 if="$infile" skip=$offset`
  ret=$?

  # the following doesn't appear to work
  # ret is always 0, even when the end of file/device is reached
  # how do I correctly determine if dd has reached the end of file/device?
  if [ $ret -gt 0 ]; then
    echo 'error, or end of file reached'
    break
  fi

  # I don't know how to correctly determine if the byte is non-zero
  # how do I determine if the read byte is non-zero?
  if [ $byte ???? ]; then
    echo "non-zero byte found at $offset"
    break
  fi

  ((++offset))
done


As you can see, I'm stuck with two issues that I don't know how to solve:
a. How do I make the while loop break when dd has reached the end of the file/device? dd gives an exit code of 0, where I expected a non-zero exit code instead.
b. How do I evaluate whether the byte that dd read and returns on stdout is non-zero? I think I've read somewhere that special care should be taken in bash with \0 bytes as well, but I'm not even sure this pertains to this situation.

Can you give me some hints on how to proceed, or perhaps suggest and alternative way to achieve my goal?

  • 2
    It is not a good idea to read a **block** device one byte at a time - read a block into a file with `dd` then either read the file a byte at a time or perhaps use something like `cut` to detect the null (zero) byte(s) otherwise known as `\n`. Might be necessary to prefill your file with non-null bytes. – Jeremy Boden Jun 01 '21 at 14:11
  • 1
    Reading past the end of a file isn't an error, it just gives you nothing, since there's nothing there. With the command substitution, you get the single byte in `$byte`, except if it's a newline (which command substitution removes), or a NUL (which Bash treats as end of string). In those cases you'd get an empty variable, and I don't think there's any way to tell them apart in just Bash. – ilkkachu Jun 01 '21 at 14:24
  • 1
    Also... this is the same as using a shell loop to process text, you're running at least one invocation of `dd` per _byte_ of input, so it'll be _horribly_ slow. Well, (apart from the issues with NUL and newline), you could read larger blocks and process the string one byte at a time, but really, you'd be better off just taking an actual programming language more suited for data processing. – ilkkachu Jun 01 '21 at 14:29
  • such question should be answered for every mainstream language, but for bytes block and file, not just block device – quant2016 May 23 '23 at 15:53

2 Answers2

32

You can do this using cmp, comparing to /dev/zero:

cmp /path/to/block-device /dev/zero

cmp will give you the offset of the first non-zero byte.

If you want to skip bytes, you can use GNU cmp’s -i option, or if you’re not using GNU cmp, feed it the appropriate data using dd:

cmp -i 100 /path/to/block-device /dev/zero
dd if=/path/to/block-device bs=1 skip=100 | cmp - /dev/zero

This will work with any file, not just block devices.

Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
8

Stephen Kitt's answer makes this a bit pointless (it is more concise and more than one order of magnitude faster), but an alternative you have is to (hex)dump the content of your device, one byte per line, and pipe it to a program that prints the address of the first byte whose representation is not 00 and exits as soon as it finds it:

od -Ad -w1 -tx1 /dev/device | awk '$2 && $2 != "00" { print $1 + 1; exit }'

od's -j option allows you to optionally select the number of bytes to skip (at the beginning of the input).

A much faster variation (thanks to Peter Cordes' comments) requires a bit more of typing:

od -Ad -tx1 | awk '
  {
    for (i=2; i<=NF; i++)
      if ($i != "00") {
        print ($1 + i -1)
        exit
      }
  }'

Allowing od to output data in its preferred format requires to compute the offset of the first non-zero byte by adding its position in the line it appears on to the line's address.

fra-san
  • 9,931
  • 2
  • 21
  • 42
  • 1
    It may well be that awk is mainly what is slowing this down. grep might be a bit faster: `od -Ad -w1 -tx1 /dev/sda | grep -E -m1 -v '\*|00$'` – Digital Trauma Jun 02 '21 at 00:27
  • 1
    @DigitalTrauma I wouldn't know. On my system, `grep`, `awk` and `sed` take roughly the same time (either a few ten milliseconds with a real device and a few seconds with a testing file starting with some hundred MiB of nulls); they just read three short lines at most, after all. I preferred `awk` mainly because POSIX grep doesn't support `-m`. – fra-san Jun 02 '21 at 08:56
  • 1
    I typically use `hexdump -C /dev/device | less` for this, because hexdump collapses repeats (at least of 00 lines) by default, so no need to burn CPU time formatting into hex and parsing the 00s. That makes it usable even if there's a non-zero header, although `cmp -l foo /dev/zero | less` also handles that case. But anyway, `hexdump -C` isn't amazingly efficient perhaps, compared to how fast it's possible for modern CPUs to scan memory for non-zero bytes/words, but it's usable for gigabyte-sized files. – Peter Cordes Jun 02 '21 at 10:55
  • 1
    (`time hd /dev/zero` (and control-C) shows most of the cost is user time, like 2.0 user vs. 0.168 sys, so that's pretty bad. `time cmp /dev/zero /dev/full` shows a 50:50 split of user/kernel time, so it's scanning (err comparing :/) memory about as fast as copy_to_user can copy, presumably using memcmp so it's taking advantage of glibc's optimized hand-written asm loop using AVX2 SIMD instructions (on my Skylake CPU)). – Peter Cordes Jun 02 '21 at 11:02
  • 1
    Re: optional skip: GNU `cmp` can do that, too, with `cmp -i 4096:0 foo /dev/zero` to e.g. skip the first page of foo. (Or just `-i 4096` to skip that many bytes in both inputs.) – Peter Cordes Jun 02 '21 at 11:03
  • 2
    @PeterCordes `od` collapses repeated values by default too (I chose it because it's POSIX; though I see `hexdump` is probably faster). It isn't unbearably slow because the goal, here, allows for limiting the emitted lines to three at most. – fra-san Jun 02 '21 at 11:21
  • Ok, so just a missed optimization in `od` and `hexdump`. `od` is similar speed to `hexdump` if you omit `-w1`; probably there's a bunch of outer-loop work that happens, not a dedicated loop scanning for non-zeros to find where to start a new line. Anyway, usable without `-w1` if you're willing to manually count to see where in the output line the first non-zero byte, or you don't mind having the offset rounded down to the start of a 16-byte block. – Peter Cordes Jun 02 '21 at 20:10
  • @PeterCordes Thanks! I didn't realize `-w` was slowing `od` down that much. – fra-san Jun 02 '21 at 22:54