df -h works well, quick and gives me relative size, however it only displays the disk, is there a way to emulate functionality of df -h but adjust the depth size?
No, df -h simply asks the file system how much space is used in it completely. That information only exists for the whole file system, not for the subdirecturies.
If it helps, we're using glusterfs on that disk, maybe there's a way to speed things up with that option? A general approach would be best though.
Do the counting on a machine that has the lowest possible latency connection to the actual (metadata) storage.
You're probably limited by how long it takes to get the file listing, and ask for the individual file sizes. I don't know glusterfs and its implementation well enough, but:
du -s . does the following: for each directory, get the list of entries in the directory (using the getdents(64) syscall, which in turn kicks the file system to deliver a list of files).
Then, iterate sequentially through these entries and then get a file statistics of each file (using the (new)fstat(at) syscall, which in turn kicks the filesystem to deliver information on each file), which contains file size, which is used to calculate the total.
For each directory it encounters, it recurses down.
So, there's a whole lot of communication there, where, if you have the following directory tree:
.
├── b
│ ├── b
│ ├── c
│ │ ├── e
│ │ ├── g
│ │ └── h
│ ├── d
│ └── f
├── bar
├── baz
├── foo
└── foooo
size information on ./baz won't be fetched before the info on ./bar has been found. And because most of the time of getting that info is spent waiting for the filesystem to go and fetch info from the glusterfs daemon (via network!), what the program dominantly does is wait for long times, then ask for the next file info, then wait again, and so on. Very little time would be spent on your computer doing something (like understanding what the server sent you or adding up sizes), and very much time on waiting.
If glusterfs is capable of asynchronous requests (and we can be pretty sure it is), a simple solution would be to put the "get directory listings" and the "get file sizes" aspects into separate functional units, and to make the getting of file sizes multithreaded (in the naive, and extreme case, spawn a thread of for every file).
You can do that using Ole Tange's parallel.
First, use find /disk/dir1/asdf -type f to (sequentially) get a list of all files. (this could also be multithreaded, but that would be more complicated, and also, depends on how "wide" vs "deep" directory structures are below that directory); then, use parallel to run stat -f '%s' on each file, in parallel, finally consolidate the results and sum them up.
Then, while you're not reducing the total time spent waiting, a lot of the waiting happens in parallel.
This might also be a fine thing to do in C++, as std::async makes the task of collecting data rather simple. Something like:
#include <cstdint>
#include <filesystem>
#include <future>
#include <iostream>
#include <vector>
namespace fs = std::filesystem;
using future_t = std::future<std::uintmax_t>;
int main() {
std::vector<future_t> futures;
for(auto const& dir_entry : fs::recursive_directory_iterator(".")) {
if(fs::is_regular_file(dir_entry)) {
futures.emplace_back(std::async(
std::launch::async,
[](auto path) {
return fs::file_size(path);
},
dir_entry
));
}
}
std::uintmax_t total_size = 0;
for(auto& future : futures) {
total_size += future.get();
}
std::cout << "Total size " << total_size << "b\n";
}
(try on compiler explorer! Or copy to file main.c, and build locally using g++ -O3 -std=c++17 -lpthread -o async_size main.c; run via cd /path/I/want/to/know/size/of; /path/of/async_size)