how to quickly scan size of directories in big terabyte directories

Question

Basically I have a 3TB disk, inside of it I have 4 directories, and inside of those directories I have hundreds of directories.

I would like to view the relative size of any of the directories, it doesn't have to be very accurate, and displaying in GB would be preferable.

I've tried du -sh /disk/dir1/asdf

but because those directories are hundreds of GB, the above takes 10+ minutes.

"ncdu" is nice, but again, takes way too long to go through everything (hours).

df -h works well, quick and gives me relative size, however it only displays the disk, is there a way to emulate functionality of df -h but adjust the depth size?

If it helps, we're using glusterfs on that disk, maybe there's a way to speed things up with that option? A general approach would be best though.

the file size is not why this takes time, at all, it's the number of files in there. — Marcus Müller, Nov 24 '22 at 14:13
Make it a separate filesystem, mount it there and then use df? — frostschutz, Nov 24 '22 at 14:31
`df` only needs to check one thing -- the free space list. Most file systems track this as file blocks get used and freed. `du` has to navigate the whole directory tree top-down, and accumulate sizes at every level. Further, it has to keep track of which inodes it has visited, to avoid double-counting files with multiple hard links. If your answer does not have to be very accurate, run `du` every few hours under cron, save the results, and script an extract function. A 3TB filestore can't change that much, that fast. — Paul_Pedant, Nov 24 '22 at 16:20
Using zfs, you can just create a new filesystem using the same pool of space. — James Risner, Nov 24 '22 at 17:26
@frostschutz glusterfs is a remote file system, typically for storage clusters. I don't think a local file system on the individual machine is a solution here :D — Marcus Müller, Nov 25 '22 at 11:27
@JamesRisner nah, glusterfs is a remote FS with focus on storage cluster applications, I don't think a local ZFS pool helps a lot here :) — Marcus Müller, Nov 25 '22 at 11:28
@MarcusMüller "a local file system on the individual machine" is not what I suggested, not sure where you got that from. As long as glusterfs (any filesystem, local or otherwise) knows its own size, you can make a separate entity of it and query it directly. That is the general approach to make df -h work for one specific directory. Whether its applicable in this particular situation is questionable hence the question mark in my comment… — frostschutz, Nov 25 '22 at 12:16
Alternatives are accounting of the filesystem (quota and the like) if that's supported by the filesystem at all, or... you're pretty much stuck with du, ncdu as it were. If you have to query it often, maybe you can just cache it. — frostschutz, Nov 25 '22 at 12:20
https://docs.gluster.org/en/main/Administrator-Guide/Directory-Quota/ — frostschutz, Nov 25 '22 at 12:23
@frostschutz I'm not sure how you'd make a separate file system if that file system is actually remote; the question doesn't indicate the asker had the power to change the filesystem itself, just that they can mount it. But the quota approach is pretty nice! — Marcus Müller, Nov 25 '22 at 13:13

Marcus Müller · Accepted Answer · 2022-11-25T13:14:52.100

df -h works well, quick and gives me relative size, however it only displays the disk, is there a way to emulate functionality of df -h but adjust the depth size?

No, df -h simply asks the file system how much space is used in it completely. That information only exists for the whole file system, not for the subdirecturies.

If it helps, we're using glusterfs on that disk, maybe there's a way to speed things up with that option? A general approach would be best though.

Do the counting on a machine that has the lowest possible latency connection to the actual (metadata) storage.
You're probably limited by how long it takes to get the file listing, and ask for the individual file sizes. I don't know glusterfs and its implementation well enough, but:

du -s . does the following: for each directory, get the list of entries in the directory (using the getdents(64) syscall, which in turn kicks the file system to deliver a list of files). Then, iterate sequentially through these entries and then get a file statistics of each file (using the (new)fstat(at) syscall, which in turn kicks the filesystem to deliver information on each file), which contains file size, which is used to calculate the total.
For each directory it encounters, it recurses down.

So, there's a whole lot of communication there, where, if you have the following directory tree:

.
├── b
│   ├── b
│   ├── c
│   │   ├── e
│   │   ├── g
│   │   └── h
│   ├── d
│   └── f
├── bar
├── baz
├── foo
└── foooo

size information on ./baz won't be fetched before the info on ./bar has been found. And because most of the time of getting that info is spent waiting for the filesystem to go and fetch info from the glusterfs daemon (via network!), what the program dominantly does is wait for long times, then ask for the next file info, then wait again, and so on. Very little time would be spent on your computer doing something (like understanding what the server sent you or adding up sizes), and very much time on waiting.

If glusterfs is capable of asynchronous requests (and we can be pretty sure it is), a simple solution would be to put the "get directory listings" and the "get file sizes" aspects into separate functional units, and to make the getting of file sizes multithreaded (in the naive, and extreme case, spawn a thread of for every file).

You can do that using Ole Tange's parallel.

First, use find /disk/dir1/asdf -type f to (sequentially) get a list of all files. (this could also be multithreaded, but that would be more complicated, and also, depends on how "wide" vs "deep" directory structures are below that directory); then, use parallel to run stat -f '%s' on each file, in parallel, finally consolidate the results and sum them up.

Then, while you're not reducing the total time spent waiting, a lot of the waiting happens in parallel.

This might also be a fine thing to do in C++, as std::async makes the task of collecting data rather simple. Something like:

#include <cstdint>
#include <filesystem>
#include <future>
#include <iostream>
#include <vector>

namespace fs = std::filesystem;
using future_t = std::future<std::uintmax_t>;

int main() {
    std::vector<future_t> futures;

    for(auto const& dir_entry : fs::recursive_directory_iterator(".")) {
        if(fs::is_regular_file(dir_entry)) {
            futures.emplace_back(std::async(
                std::launch::async,
                [](auto path) {
                    return fs::file_size(path);
                },
                dir_entry
            ));
        }
    }
    std::uintmax_t total_size = 0;
    for(auto& future : futures) {
        total_size += future.get();
    }
    std::cout << "Total size " << total_size << "b\n";
}

(try on compiler explorer! Or copy to file main.c, and build locally using g++ -O3 -std=c++17 -lpthread -o async_size main.c; run via cd /path/I/want/to/know/size/of; /path/of/async_size)

Ah, I was afraid of this, I guess it'll have to do. – Dmytro Lysak Nov 26 '22 at 09:25 — Dmytro Lysak, Nov 26 '22 at 09:25

how to quickly scan size of directories in big terabyte directories

1 Answers1