I have a 100 M line file that fits in RAM on a GNU/Linux system.
This is rather slow:
sort bigfile > bigfile.sorted
and does not use all 48 cores on my machine.
How do I sort that file fast?
I have a 100 M line file that fits in RAM on a GNU/Linux system.
This is rather slow:
sort bigfile > bigfile.sorted
and does not use all 48 cores on my machine.
How do I sort that file fast?
Let us assume you have 48 cores, 500 GB free RAM and the file is 100 M lines and fits in memory.
If you use normal sort it is rather slow:
$ time sort bigfile > bigfile.sort
real 4m48.664s
user 21m15.259s
sys 0m42.184s
You can make it a bit faster by ignoring your locale:
$ export LC_ALL=C
$ time sort bigfile > bigfile.sort
real 1m51.957s
user 6m2.053s
sys 0m42.524s
You can make it faster by telling sort to use more cores:
$ export LC_ALL=C
$ time sort --parallel=48 bigfile > bigfile.sort
real 1m39.977s
user 15m32.202s
sys 1m1.336s
You can also try giving sort more working memory (this does not help if sort already has enough memory):
$ export LC_ALL=C
$ time sort --buffer-size=80% --parallel=48 bigfile > bigfile.sort
real 1m39.779s
user 14m31.033s
sys 1m0.304s
But it seems sort really likes to do a lot of single threading. You can force it to parallelize more with:
$ merge() {
if [ $1 -le 1 ] ; then
parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;'
else
parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;' |
merge $(( $1/2 ));
fi
}
# Generate commands that will read blocks of bigfile and sort those
# This only builds the command - it does not run anything
$ parallel --pipepart -a bigfile --block -1 --dr -vv sort |
# Merge these commands 2 by 2 until only one is left
# This only builds the command - it does not run anything
merge $(parallel --number-of-threads) |
# Execute the command
# This runs the command built in the previous step
bash > bigfile.sort
real 0m30.906s
user 0m21.963s
sys 0m28.870s
It chops the file into 48 blocks on the fly (one block per core), sorts those blocks in parallel. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible.
For a 100 GB file with with 4 G lines the timings are:
$ LC_ALL=C time sort --parallel=48 -S 80% --compress-program pzstd bigfile >/dev/null
real 77m22.255s
$ LC_ALL=C time parsort bigfile >/dev/null
649.49user 727.04system 18:10.37elapsed 126%CPU (0avgtext+0avgdata 32896maxresident)k
So using the parallelization speeds up around a factor of 4.
To make it easier to use I have made it into a small tool: parsort which is now part of GNU Parallel.
It supports sort options and reading from stdin, too (parsort -k2rn < bigfile).