25

I have a 100 M line file that fits in RAM on a GNU/Linux system.

This is rather slow:

sort bigfile > bigfile.sorted

and does not use all 48 cores on my machine.

How do I sort that file fast?

Ole Tange
  • 33,591
  • 31
  • 102
  • 198

1 Answers1

50

Let us assume you have 48 cores, 500 GB free RAM and the file is 100 M lines and fits in memory.

If you use normal sort it is rather slow:

$ time sort bigfile > bigfile.sort
real    4m48.664s
user    21m15.259s
sys     0m42.184s

You can make it a bit faster by ignoring your locale:

$ export LC_ALL=C
$ time sort bigfile > bigfile.sort
real    1m51.957s
user    6m2.053s
sys     0m42.524s

You can make it faster by telling sort to use more cores:

$ export LC_ALL=C
$ time sort --parallel=48 bigfile > bigfile.sort
real    1m39.977s
user    15m32.202s
sys     1m1.336s

You can also try giving sort more working memory (this does not help if sort already has enough memory):

$ export LC_ALL=C
$ time sort --buffer-size=80% --parallel=48 bigfile > bigfile.sort
real    1m39.779s
user    14m31.033s
sys     1m0.304s

But it seems sort really likes to do a lot of single threading. You can force it to parallelize more with:

$ merge() {
    if [ $1 -le 1 ] ; then
        parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;'
    else
        parallel -Xj1 -n2 --dr 'sort -m <({=uq=}) | mbuffer -m 30M;' |
          merge $(( $1/2 ));
    fi
  }
# Generate commands that will read blocks of bigfile and sort those
# This only builds the command - it does not run anything
$ parallel --pipepart -a bigfile --block -1 --dr -vv sort |
    # Merge these commands 2 by 2 until only one is left
    # This only builds the command - it does not run anything
    merge $(parallel --number-of-threads) |
    # Execute the command
    # This runs the command built in the previous step
    bash > bigfile.sort
real    0m30.906s
user    0m21.963s
sys     0m28.870s

It chops the file into 48 blocks on the fly (one block per core), sorts those blocks in parallel. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible.

For a 100 GB file with with 4 G lines the timings are:

$ LC_ALL=C time sort --parallel=48 -S 80% --compress-program pzstd bigfile >/dev/null
real    77m22.255s
$ LC_ALL=C time parsort bigfile >/dev/null
649.49user 727.04system 18:10.37elapsed 126%CPU (0avgtext+0avgdata 32896maxresident)k

So using the parallelization speeds up around a factor of 4.

To make it easier to use I have made it into a small tool: parsort which is now part of GNU Parallel.

It supports sort options and reading from stdin, too (parsort -k2rn < bigfile).

Ole Tange
  • 33,591
  • 31
  • 102
  • 198
  • I moved a long, albeit interesting, technical discussion to chat. If you want to continue that discussion, please join [this chat room](https://chat.stackexchange.com/rooms/106644/discussion-on-answer-by-ole-tange-how-to-use-parallel-to-speed-up-sort-for-b). Any further comments that are not asking for a clarification of the answer or suggesting an improvement will be deleted. – terdon Apr 12 '20 at 18:46
  • 2
    Amazing answer, especially considering that it ends with just casually mentioning this is now distributed with GNU parallel! Note for macOS users: homebrew's `parallel` package includes `parsort`, but it's not linked so won't appear on your path. Also the script assumes that GNU parallel is on the path as `parallel` and gnu sort is `sort`. – Chris Jun 30 '21 at 04:18
  • 1
    It looks like you just implemented an out-of-core sort using `parallel` and the standard GNU `sort`. Impressive. I've said for years that CS programming students should be required to write an out of core sort program limited to a very small amount of memory in order to graduate but you've now implemented that in a bash function. – doneal24 Aug 11 '22 at 18:12