Remove null bytes from the end of a large file

Question

I just backed up the microSD card from my Raspberry Pi on my PC running a Linux distro using this command:

dd if=/dev/sdx of=file.bin bs=16M

The microSD card is only 3/4 full so I suppose there's a few gigs of null bytes at the end of the tremendous file. I am very sure I don't need that. How can I strip those null bytes from the end efficiently so that I can later restore it with this command?

cat file.bin /dev/zero | dd of=/dev/sdx bs=16M

Stripping off a few bytes affects the physical size of the file, which may cause problems when trying to mount it or write it to a device, and I don't recommend it. I strongly recommend compressing it instead. — Mukesh Sai Kumar, Jan 13 '18 at 08:41
I would't be so sure about them all being "null bytes", the card contains at least one filesystem, probably two, and certain structures may be spread over the disk. Why not compress file.bin and use zcat when a restore is needed? — Gerard H. Pille, Jan 13 '18 at 08:44
I don't think file layout is an issue given that I have provided the way I would restore the image. — iBug, Jan 13 '18 at 09:07
The point is that you **suppose** there are null bytes at the end of the file and you are not sure whether they are actually null bytes or how large those null bytes occupies. It is not impossible that the last byte is not a null byte, depending on various conditions. — Weijun Zhou, Jan 13 '18 at 09:20
@WeijunZhou Yep, you're right. I'm not sure how much is there nor am I sure if it exists. — iBug, Jan 13 '18 at 09:22
By "3/4 full", do you mean that the partition of interest extends from the beginning of the card to the 75% point, or do you mean that `df` on the mounted filesystem showed that 75% of available space was in use? — Mark Plotnick, Jan 13 '18 at 15:49

John1024 · Accepted Answer · 2020-04-22T22:29:37.233

8

To create a backup copy of a disk while saving space, use gzip:

gzip </dev/sda >/path/to/sda.gz

When you want to restore the disk from backup, use:

gunzip -c /path/to/sda.gz >/dev/sda

This will likely save much more space than merely stripping trailing NUL bytes.

Removing trailing NUL bytes

If you really want to remove trailing NUL bytes and you have GNU sed, you might try:

sed '$ s/\x00*$//' /dev/sda >/path/to/sda.stripped

This might run into a problem if a large disk's data exceeds some internal limit of sed. While GNU sed has no built-in limit on data size, the GNU sed manual explains that system memory limitations may prevent processing of large files:

GNU sed has no built-in limit on line length; as long as it can malloc() more (virtual) memory, you can feed or construct lines as long as you like.

However, recursion is used to handle subpatterns and indefinite repetition. This means that the available stack space may limit the size of the buffer that can be processed by certain patterns.

edited Apr 22 '20 at 22:29

answered Jan 13 '18 at 08:55

John1024

73,527
11
167
163

I finally changed my backup command to this: `pv -s $SIZE /dev/sdx | gzip > file.bin.gz` where `$SIZE` is obtained by other means. – iBug Jan 13 '18 at 13:30
Anyway, the GNU `sed` solution is a good one if I later want to work with some small files. – iBug Jan 13 '18 at 15:56
Note that if you forget the '<' parameter in gzip it will might try to do some dangerous operation! – Luciano Andress Martini Jan 15 '18 at 11:58
Your sed command didn't work reliably for me. For some files it doesn't do anything. – letmaik Apr 21 '20 at 16:37
@letmaik First things, first: what operating system and version of sed are you using? Try running `sed --version` and tell us the result. – John1024 Apr 21 '20 at 19:45
GNU sed 4.4, Ubuntu 18.04. – letmaik Apr 21 '20 at 21:22
@letmaik That's good. Now, let's test the `sed` command. What happens if you run `printf 'abc\x00\x00' | sed '$ s/\x00*$//' | od -c`? If it's working, you should see that the trailing NUL bytes are removed. – John1024 Apr 22 '20 at 05:57
Yes, that works, and as I said it seems to work in most cases, but not always. It's hard to repro this because the files are huge. – letmaik Apr 22 '20 at 08:36
@letmaik OK. I was afraid of that: 'huge' could be a problem. I added to the answer a quote from the GNU sed manual about memory limits. It sounds like those "huge" files may have hit one. So, do you really need to remove the trailing NUL bytes? If you are compressing with gzip, those trailing bytes should be very efficiently compressed. – John1024 Apr 22 '20 at 22:33

score 2 · Answer 2 · answered Apr 07 '20 at 01:10

You can write a simple tool to solve this problem.

Read the file, find out the last valid byte(not null), then truncate the file.

An example in rust from https://github.com/zqb-all/cut-trailing-bytes:

use std::io;
use std::io::prelude::*;
use std::fs::File;
use std::fs::OpenOptions;
use std::path::PathBuf;
use structopt::StructOpt;
use std::num::ParseIntError;

fn parse_hex(s: &str) -> Result<u8, ParseIntError> {
    u8::from_str_radix(s, 16)
}

#[derive(Debug, StructOpt)]
#[structopt(name = "cut-trailing-bytes", about = "A tool for cut trailing bytes, default cut trailing NULL bytes(0x00 in hex)")]
struct Opt {
    /// File to cut
    #[structopt(parse(from_os_str))]
    file: PathBuf,

    /// For example, pass 'ff' if want to cut 0xff
    #[structopt(short = "c", long = "cut-byte", default_value="0", parse(try_from_str = parse_hex))]
    byte_in_hex: u8,

    /// Check the file but don't real cut it
    #[structopt(short, long = "dry-run")]
    dry_run: bool,
}


fn main() -> io::Result<()> {

    let opt = Opt::from_args();
    let filename = &opt.file;
    let mut f = File::open(filename)?;
    let mut valid_len = 0;
    let mut tmp_len = 0;
    let mut buffer = [0; 4096];

    loop {
        let mut n = f.read(&mut buffer[..])?;
        if n == 0 { break; }
        for byte in buffer.bytes() {
            match byte.unwrap() {
                byte if byte == opt.byte_in_hex => { tmp_len += 1; }
                _ => {
                    valid_len += tmp_len;
                    tmp_len = 0;
                    valid_len += 1;
                }
            }
            n -= 1;
            if n == 0 { break; }
        }
    }
    if !opt.dry_run {
        let f = OpenOptions::new().write(true).open(filename);
        f.unwrap().set_len(valid_len)?;
    }
    println!("cut {} from {} to {}", filename.display(), valid_len + tmp_len, valid_len);

    Ok(())
}

letmaik · Answer 3 · 2020-04-23T05:58:16.030

1

I tried John1024's sed command and it worked most of the times but for some large files it didn't trim properly. The following will always work:

python -c "open('file-stripped.bin', 'wb').write(open('file.bin', 'rb').read().rstrip(b'\0'))"

Note that this loads the file into memory first. You can avoid this by writing a proper Python script that processes the file in chunks.

edited Apr 23 '20 at 05:58

answered Apr 21 '20 at 16:40

letmaik

143
6

Works perfect thank you! – Smeterlink Jul 31 '21 at 22:35

Ella Jameson · Answer 4 · 2022-05-05T05:25:45.467

Preface

I just solved this problem myself with Python. It's simple in theory, but in practice, it actually requires quite a bit of code to do correctly. I wanted to share my work here so others don't have to figure this out by themselves.

The Simple (Bad) Way

The simplest method (posted previously by letmaik) is to load the file into memory as a string of bytes, use Python's .rstrp() to remove trailing null bytes from the bytestring, then save that bytestring over the original file.

def strip_file_blank_space(filename):
    # Strips null bytes at the end of a file, and returns the new file size
    # This will process the file all at once
    # Open the file for reading bytes (then close)
    with open(filename, "rb") as f:
        # Read all of the data into memory
        data = f.read()
    # Strip trailing null bytes from the data in-memory
    data = data.rstrip(b'\x00')
    # Open the file for writing bytes (then close)
    with open(filename, "wb") as f:
        # Write the data from memory to the disk
        f.write(data)
    # Return the new file size
    return(len(data))

new_size = strip_file_blank_space("file.bin")

This will probably work most of the time, assuming the file is smaller than the available system memory. But with larger files (32+ GB) or on systems with less RAM (Raspberry Pi), the process will either crash the computer, or it will be killed by the system memory manager.

The Difficult (Correct) Way

The only way around the limited memory problem is to load one small block of data at a time, process it, delete it from memory, and then repeat on the next small block until the whole file is processed. Normally you can do this in python with very compact code, but because we need to process in blocks from the end of the file, moving backward, it takes a bit more work.

I've done the work for you. Here it is:

import os
import shutil
from math import floor
import warnings
import tempfile

def strip_file_blank_space(filename, block_size=1*(1024*1024)):
    # Strips null bytes at the end of a file, and returns the new file size
    # This will process the file in chunks, to conserve memory (default = 1 MiB)
    file_end_loc = None # This will be used if the file is larger than the block size
    simple_data = None # This is used if the file is smaller than the block size
    # Open the source file for reading
    with open(filename, "rb") as f:
        # Get original file size
        filesize = os.fstat(f.fileno()).st_size
        # Test if file size is less than (or equal to) the block size
        if filesize <= block_size:
            # Load data to do a normal rstrip all in-memory
            simple_data = f.read()
        # If the file is larger than the specified block size
        else:
            # Compute number of whole blocks (remainder at beginning processed seperately)
            num_whole_blocks = floor(filesize / block_size)
            # Compute number of remaining bytes
            num_bytes_partial_block = filesize - (num_whole_blocks * block_size)
            # Go through each block, looking for the location where the zeros end
            for block in range(num_whole_blocks):
                # Set file position, relative to the end of the file
                current_position = filesize - ((block+1) * block_size)
                f.seek(current_position)
                # Read current block
                block_data = f.read(block_size)
                # Strip current block from right side
                block_data = block_data.rstrip(b"\x00")
                # Test if the block data was all zeros
                if len(block_data) == 0:
                    # Move on to next block
                    continue
                # If it was not all zeros
                else:
                    # Find the location in the file where the real data ends
                    blocks_not_processed = num_whole_blocks - (block+1)
                    file_end_loc = num_bytes_partial_block + (blocks_not_processed * block_size) + len(block_data)
                    break
            # Test if the end location was not found in the full blocks loop
            if file_end_loc == None:
                # Read partial block at the beginning of the file
                f.seek(0)
                partial_block_data = f.read(num_bytes_partial_block)
                # Strip from the right side
                partial_block_data = partial_block_data.rstrip(b"\x00")
                # Test if this block (and therefore the entire file) is zeros
                if len(partial_block_data) == 0:
                    # Warn about the empty file
                    warnings.warn("File was all zeros and will be replaced with an empty file")
                # Set the location where the real data begins
                file_end_loc = len(partial_block_data)
    
    # If we are doing a normal strip:
    if simple_data != None:
        # Strip right trailing null bytes
        simple_data = simple_data.rstrip(b'\x00')
        # Directly replace file
        with open(filename, "wb") as f:
            f.write(simple_data)
            new_filesize = os.fstat(f.fileno()).st_size
        # Return the new file size
        return len(simple_data)
    # If we are doing a block-by-block copy and replace
    else:
        # Create temporary file (do not delete, will move myself)
        temp_file = tempfile.NamedTemporaryFile(mode="wb", delete=False)
        # Open the source file for reading
        with open(filename, "rb") as f:
            # Test if data is smaller than (or equal to) the block size
            if file_end_loc <= block_size:
                # Do a direct copy
                f.seek(0)
                data = f.read(file_end_loc)
                temp_file.write(data)
                temp_file.close()
            # If the data is larger than the block size
            else:
                # Find number of whole blocks to copy
                num_whole_blocks_copy = floor(file_end_loc / block_size)
                # Find partial block data size (at the end of the file this time)
                num_bytes_partial_block_copy = file_end_loc - (num_whole_blocks_copy * block_size)
                # Copy whole blocks
                f.seek(0)
                for block in range(num_whole_blocks_copy):
                    # Read block data (automatically moves position)
                    block_data = f.read(block_size)
                    # Write block to temp file
                    temp_file.write(block_data)
                # Test for any partial block data
                if num_bytes_partial_block_copy > 0:
                    # Read remaining data
                    partial_block_data = f.read(num_bytes_partial_block_copy)
                    # Write remaining data to temp file
                    temp_file.write(partial_block_data)
                # Close temp file
                temp_file.close()
        # Delete original file
        os.remove(filename)
        # Replace original with temporary file
        shutil.move(temp_file.name, filename)
        # Return the new file size
        return(file_end_loc)

new_size = strip_file_blank_space("file.bin") # Defaults to 1 MiB blocks

As you can see, it takes many more lines of code, but if you're reading this, then those are lines you don't have to write now! You're welcome. :)

I've tested this function using 4+ GB files on a Raspberry Pi with 1 GB of RAM, and the process never used more memory than 50 MB in total. It took a while to process, but it worked flawlessly.

Conclusion

When programming, do be mindful of how much data you are loading into memory at any given time. Keep in mind the potential largest file size you'll be working with, and the potential lower limits of the memory available to you.

I hope this helps someone down the line!

score 0 · Answer 5 · answered May 05 '22 at 05:32

On Linux at least (and on filesystems that support it such as modern ext4), you can use fallocate -d to replace those sequence of zeros with holes that don't take up any disk space:

$ echo test > a
$ head -c1G /dev/zero >> a
$ echo test2 >> a
$ head -c1G /dev/zero >> a
$ du -h a
2.1G    a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a

2GiB large file taking up 2GiB of disk space.

$ fallocate -d a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a
$ du -h a
12K     a

Same 2GiB large file but now only taking up 12KiB of disk space.

$ filefrag -v a
Filesystem type is: ef53
File size of a is 2147483659 (524289 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    7504727..   7504727:      1:
   1:   262144..  262144:   48424960..  48424960:      1:    7766871: last
a: 2 extents found

You could remove the trailing hole with:

truncate -os 262145 a

The last block should now contain data:

$ tail -c4096 a | hd
00000000  00 00 00 00 00 74 65 73  74 32 0a 00 00 00 00 00  |.....test2......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

While you could also remove the trailing zeros in that last block, note that it won't save any space on disk.

score 0 · Answer 6 · answered May 05 '22 at 06:05

Note that it's not because 3/4 of a filesystem is unallocated that the corresponding blocks of the underlying storage device will contain zeros. If some files were written before, but deleted since, the old data will still be there, only the corresponding blocks will be marked as unallocated in the structure of the filesystem.

An exception to that could be if TRIM/DISCARD support is available on the block device and was used when the filesystems there were mounted.

On Linux, the loop devices do support trimming, and will create corresponding hole in the file that loop on if supported by the underlying filesystem, so you could mount the filesystem on you image:

sudo mount -o loop file.bin /somewhere

And do a:

sudo fstrim /somewhere

To discard the unallocated blocks of the file system.

If the image is partitionned:

sudo losetup -fP --show file.bin

Then mount the corresponding /dev/loopXpY partition(s).

You may also want to look at things like partimage to take a dump of your sdcard. Which will take care of only dumping the allocated bits.

Using zerofree on the sdcard before dumping would also make sure the unallocated parts are filled with zeros.

Remove null bytes from the end of a large file

6 Answers6

Removing trailing NUL bytes