Get size of file except trailing zero bytes

Question

I want to get the size of a file that is being downloaded. Since the file is preallocated, using du -sd just returns its final, full size. I want to know how much has been downloaded, so I don't want those trailing zero bytes to count. How do I get this size?

This should be possible, since aria2c can easily resume its stopped downloads, and it doesn't seem to store the downloaded length in its control (session) files. I have written a script to read total_length from .aria2 control files. This is the total length though, not downloaded length. You can easily use that script and the technical specs to get any other property aria2 stores.

Update from comments:

As ilkkachu was hinting, BITFIELD in the .aria2 file seems to actually be a map: each bit corresponds to a file chunk, 1 meaning "downloaded" (0 meaning "not downloaded"). BITFIELD LENGTH gives you the number of chunks (and the chunk size is likely just that of the file divided by the chunk number). I'm pretty sure the download progress is given by the ratio of 1s over the number of chunks in BITFIELD. Unfortunately, AFAICT, the .aria2 file seems to be updated after some delay, or as soon as the download is interrupted.

it looks to me as if `aria2c` keep track of download chunk, to resume on a chunk boundary. unix utility can't decide what is a "good" or a "bad" byte or 0 on a file. — Archemar, May 12 '20 at 13:01
@Archemar I have written a [script](https://github.com/aria2/aria2/issues/1621) to read `total_length` from `.aria2` control files. This is the total length though, not downloaded length. You can easily use that script and the [technical specs](https://aria2.github.io/manual/en/html/technical-notes.html) to get any other property aria2 stores. — HappyFace, May 12 '20 at 13:06
precisely aria2c **store** upload length (and update it), it doesn't get it from an unix command/system call. — Archemar, May 12 '20 at 13:09
you could use `aria2c --file-allocation=none`, but that is probably not what you want ;-) — pLumo, May 12 '20 at 13:12
If available on your platform/file system, you may be interested in exploring `debugfs`: it can list the allocated blocks for a file, grouped into extents, showing the extents that are marked as uninitialized - the state they are after the to-be-downloaded file has been created and before the corresponding parts are actually downloaded. The main drawback is it must be run as root. The main upside is it's pretty fast. — fra-san, May 12 '20 at 17:05
@fra-san Unfortunately I'm on macOS. Any alternatives for macOS? — HappyFace, May 12 '20 at 17:40
What's "UPLOAD LENGTH" in the control file? It says "The uploaded length in this download" in that linked page, but I'm not sure if that really makes sense. In any case, the downloaded length pretty much has to be somewhere in there, since scanning the file for the non-zero byte nearest the end would just be insane for a large file. — ilkkachu, May 12 '20 at 17:47
I mean, you could do a binary search for zeroed blocks of some size, but might give false positives if the file contained long-enough runs of zeroes, and would immediately fall on its face if some joker made a file that contained _only_ zeroes, plus one non-zero byte at the end... — ilkkachu, May 12 '20 at 17:49
@HappyFace Not that I'm aware of, sorry. It may be worth adding your OS to your question, since getting info about file allocation is likely OS/file system-dependent. — fra-san, May 12 '20 at 17:50
@ilkkachu I don't know. I only tried getting the total length. — HappyFace, May 12 '20 at 17:54
ok, it doesn't seem to be used for regular http downloads. though uploading would actually make sense with Bittorrent. There's the bitmap for downloaded pieces, so I suppose you'd need to look at that. Having a bitmap instead of just a scalar counter might also make more sense with torrents, but yeah. — ilkkachu, May 12 '20 at 18:13
As ilkkachu was hinting, `BITFIELD` in the `.aria2` file seems to actually be a map: each bit corresponds to a file chunk, `1` meaning "downloaded" (`0` meaning "not downloaded"). `BITFIELD LENGTH` gives you the number of chunks (and the chunk size is likely just that of the file divided by the chunk number). I'm pretty sure the download progress is given by the ratio of 1s over the number of chunks in `BITFIELD`. Unfortunately, AFAICT, the `.aria2` file seems to be updated after some delay, or as soon as the download is interrupted. — fra-san, May 13 '20 at 01:07
@HappyFace, ah, you want to follow the download _while_ it's proceeding? (and not just after it's stopped/crashed) — ilkkachu, May 13 '20 at 09:11
@ilkkachu Yeah, while downloading. My problem has been solved by my own answer though. — HappyFace, May 13 '20 at 09:13

score 3 · Answer 1 · answered May 13 '20 at 10:00

Considering just the issue of finding out how far along aria2 is on a download, there's a few choices.

As discussed in the comments, the information is in a bitmap in the control file (filename.aria2). It's documented in https://aria2.github.io/manual/en/html/technical-notes.html . Having a bitmap doesn't make much sense for an HTTP download, which goes linearly from the start, but I suppose it would make more sense for a BitTorrent download or such.

Here's a hex dump of a control file for a particular download with the important fields marked (od -tx1 file.aria2):

0000000 00 01 00 00 00 00 00 00 00 00 00 10 00 00 00 00
                                      ^^^^^^^^^^^ ^^^^^^  
0000020 00 00 82 9d c0 00 00 00 00 00 00 00 00 00 00 00 
        ^^^^^^^^^^^^^^^^^                         ^^^^^^
0000040 01 06 ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        ^^^^^ ^^^... 
0000060 ff ff ff ff ff ff ff ff ff fe 00 00 00 00 00 00


offset 10: 00 10 00 00 => piece length = 0x100000 = 1 MiB
offset 14: 00 00 00 00 
           82 9d c0 00 => file length = 0x829dc000 = 2191376384 (~ 2 GiB)
offset 30: 00 00 01 06 => size of bitmap = 0x0106 = 262 bytes, could fit 2096 pieces
offset 34: ff ff ...   => bitmap

Counting the set bits in the bitmap, that particular download was interrupted after at least 191 pieces of 1 MiB (200278016 bytes) were downloaded, which pretty much matches the resulting file size I got, 201098200 bytes. (The actual file was bigger by just less then an MiB, the records for in-flight pieces in the control file might mark that, but I didn't care. I didn't have pre-allocation on, just so that I could cross check with the size on the filesystem.)

By default aria2c saves the control file every 60 seconds, but we can use --auto-save-interval=<secs> to change that:

--auto-save-interval=<SEC>
       Save a control file(*.aria2) every SEC seconds.  If 0 is
       given, a control file is not saved during download. aria2
       saves  a  control  file  when  it stops regardless of the
       value.  The possible values are between 0 to 600. 
       Default: 60

Alternatively, I suppose you could use aria2c --log=<logfile> and fish the download progress out of the log. Though it seems the progress is only shown write cache entries in DEBUG level messages, and with those enabled, the log is rather verbose.

Also, you could use --summary-interval=1 to print some progress output to stdout, possibly redirected to some log file (and perhaps with --show-console-readout=false to hide the live readout). Though it only seems to give rounded figures:

 *** Download Progress Summary as of Wed May 13 12:57:11 2020 ***
=================================================================
[#b56779 1.7GiB/2.0GiB(86%) CN:1 DL:105MiB ETA:2s]
FILE: /work/blah.iso
-----------------------------------------------------------------

score 1 · Answer 2 · 2020-05-13T14:44:21.923

1

There is a way.

What you want to match are the zeros at the end of a line, this regex:

\0*$

will match that, provided that the tool executing the regex doesn't choke on NUL bytes (\0) and understand the \0 escape. GNU grep with PCRE regexes does, like this (-a allows binary files, -o prints only the section matched, -P is for PCRE regex):

grep -aPo '\0*$' file

That will output all zero bytes at the end of each line (plus each newline).

To extract only the last line, we can use sed (GNU sed which is documented that could work with files containing NULs (think of the -z option)) (some tools don't like NUL bytes):

sed -n '$p' file | grep -aPo '\0*$'

All that needs to be done is to count them:

zerobytes=$(( $( sed -n '$p' file | grep -aPo '\0*$' | wc -c ) - 1 ))

Of course, all that needs to be done at this point is to subtract that value from the overall file length to get the downloaded file size.

Untested code

# alias ggrep and gdu to GNU grep and GNU du or install coreutils from Homebrew
filesize() {
    local filename="$1"
    test -e "$filename" || return 1

    local filesize="$(gdu -sb "$filename" | awk '{ print $1 }')"
    echo "$filesize"
}
filesizereal() {
    local file="$1"
    local zerobytes=$(( $( gsed -n '$p' "$file" | ggrep -aPo '\0*$' | wc -c ) - 1 ))
    echo "$(( ${$(filesize "$file"):-0} - $zerobytes ))"
}

edited May 13 '20 at 14:44

answered May 12 '20 at 14:49

This seems to work, thanks! It's rather slow though. – HappyFace May 12 '20 at 15:11
1

That assumes the file contains no NUL NL (0x0 0xa) sequences which is not uncommon for binary files (see `perl -l -0777 -ne 'print $ARGV if /\0\n/' /bin/*` for instance). `\0*$` matches on the NULs at the end of each line. – Stéphane Chazelas May 13 '20 at 08:30
Ahh, yes, indeed, and grep pcre is not understanding `(?-m)` (disable multiline) either. A `$` should match the end of the string when not in multiline mode. Anyway, workaround added. @StéphaneChazelas – May 13 '20 at 09:54
@HappyFace You need to add the sed part to avoid problems with files that might have **several** zero bytes just before a newline (I am not sure if ariac files could contain such bytes). – May 13 '20 at 09:57
It's not so much that `ggrep -P` is not understanding `(?-m)` (it does), but that `grep` processes one line at a time, so the RE is matched against something that never contains a newline (at least in current versions, GNU `grep`'s PCRE support is still meant to be considered experimental, and there is a lot of variation in behaviour between versions). What that means as well is that approach doesn't work if the file ends in `NUL NL`. – Stéphane Chazelas May 13 '20 at 14:32

HappyFace · Answer 3 · 2020-05-13T07:40:18.570

I have written a rust script that counts the trailing zeroes. It's pretty fast, but loads the whole file. See this question.

To run this script, you need rust and scriptisto installed on your system. I have named this script trailingzeroes.rs on my system.

#!/usr/bin/env scriptisto

// scriptisto-begin
// script_src: src/main.rs
// build_cmd: cargo build --release
// target_bin: ./target/release/script
// files:
//  - path: Cargo.toml
//    content: |
//     package = { name = "script", version = "0.1.0", edition = "2018"}
//     [dependencies]
// scriptisto-end

// https://users.rust-lang.org/t/count-trailing-zero-bytes-of-a-binary-file/42503/4

use std::env;
use std::fs;

fn main() {
    let filename = env::args().nth(1).unwrap();
    let buffer = fs::read(filename).unwrap();
    let count = buffer.iter().rev().take_while(|b| **b == 0).count();
    println!("{}", count);
}

Now,

# gdu is GNU du
# ggrep is GNU grep

function filesize() {
    # '<file> ; returns size in bytes.'

    local FILENAME="$1"
    test -e "$FILENAME" || { echo "File $FILENAME doesn't exist." >&2 ; return 1 }

    local SIZE="$(gdu -sb $FILENAME | awk '{ print $1 }')"
    ec $SIZE
}
function filesizereal() {
    local file="$1"
    test -e "$file" || { echo "File $file doesn't exist." >&2 ; return 1 }
    local zerobytes
    # zerobytes=$(( $( ggrep -aPo '\0*$' $file | wc -c ) - 1 ))
    zerobytes="${$(trailingzeroes.rs $file)}"
    echo $(( ${$(filesize $file):-0} - $zerobytes )) 
}

I have edited (**just now**) my answer. Added a sed command to select only the last line, it should be faster than before. Please try it. — , May 13 '20 at 09:59
@Isaac Yes, it's almost ten times faster now! The rust script is still about 3 times faster than that, but now the difference doesn't matter much. — HappyFace, May 13 '20 at 10:10

Get size of file except trailing zero bytes

3 Answers3