Can I parallelize sort?

Question

For example for bzip there is pbzip, a parallel version of bzip. Is there any such parallelization tools for sort to improve performance?

score 13 · Accepted Answer · answered Aug 29 '13 at 14:05

13

As of coreutils 8.6 (2010-10-15), GNU sort already sorts in parallel to make use of several processors where available. So, it cannot be further improved in that regard like pigz or pbzip2 improve gzip or bzip2.

If your sort is not parallel, you can try and install the GNU sort from the latest version of GNU coreutils.

With GNU sort, you can limit the number of threads with the --parallel option.

answered Aug 29 '13 at 14:05

Stéphane Chazelas

522,931
91
1,010
1,501

2

sort --stable gives a 15% performance boost, at least in my test workload. – jrw32982 Apr 30 '15 at 23:07

score 10 · Answer 2 · answered Jun 22 '15 at 20:19

10

The one thing that always helps me most with sort is giving it as much memory as possible, so as to reduce swapping, e.g.:

sort -S 20G

answered Jun 22 '15 at 20:19

benroth

200
1
5

7

Thanks, this is a trick I use lately, too - just let sort use half the RAM, if needed: `sort -S 50%` – miku Jun 22 '15 at 20:31

score 6 · Answer 3 · edited Aug 29 '13 at 14:37

6

If your file is large enough, sorting will cause disk swapping, either because the allocated virtual memory is growing too big, or because the sort program itself is swapping chunks to disk and back. Older sort implementations are more likely to have this "sort via disk buffer" sort of behavior, since it was the only way to sort large files in the old days.

sort has a -m option that may help you here. It might be faster to split the file into chunks — say with split -l — sort them independently, then merge them back together.

Then again, it may be that this is exactly what "sort via disk buffer" does. The only way to find out if it helps is to benchmark it on your particular test load. The critical parameter will be the line count you give to split -l.

edited Aug 29 '13 at 14:37

replay

8,483
1
26
31

answered Aug 29 '13 at 06:51

Warren Young

71,107
16
178
168

Thanks for your answer. I will conduct some benchmarks with `split` and `merge` and see if it helps. – miku Aug 29 '13 at 06:53
@miku: I don't see that `merge(1)` has applicability here. Use `sort -m`. – Warren Young Aug 29 '13 at 06:54
1

sorry for my laxity, I meant `sort --merge`. – miku Aug 29 '13 at 06:59
1

If you split the file and sort the pieces you will still have to sort the entire thing when you put it back together right? How will that be faster? – terdon Aug 29 '13 at 11:12
2

This is a variant on the [merge sort](http://en.m.wikipedia.org/wiki/Merge_sort) algorithm, one of the fastest sorting methods available. – Warren Young Aug 29 '13 at 11:34

score 4 · Answer 4 · answered Jul 02 '15 at 00:53

4

export LC_COLLATE=C
export LANG=C
cat big_file | sort > /dev/null

Usally Linux sort does some nifty stuff to comply to Unicode equality rules... if you change the locale to C it switches to byte only...

For a 1.4GB file the difference on my machine is 20s vs. 400s (!!!)

answered Jul 02 '15 at 00:53

kei1aeh5quahQu4U

452
5
13

Thanks, but wouldn't a `LC_ALL=C` be enough? – miku Jul 02 '15 at 16:30
1

I think so... maybe `LC_COLLATE` is already enough. AFAIK `sort` uses `strcoll` for comparision and the manpage says the behavior depends on `LC_COLLATE` – kei1aeh5quahQu4U Jul 03 '15 at 17:25

Saullo G. P. Castro · Answer 5 · 2015-02-27T10:49:38.233

3

I had a very significant gain using sort -n, which requires numeric values (float or integer) in all selected columns, without scientific notation.

Another possibility that might bring a great improvement in your process is to use the memory mapped folder /dev/shm to deal with intermediary files.

edited Feb 27 '15 at 10:49

answered Feb 25 '15 at 18:42

Saullo G. P. Castro

131
4

subin · Answer 6 · 2015-12-14T14:20:04.793

#! /bin/sh
#config MAX_LINES_PER_CHUNK based on file length
MAX_LINES_PER_CHUNK=1000 
ORIGINAL_FILE=inputfile.txt
SORTED_FILE=outputfile.txt
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted

 #Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE

#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX

for file in $CHUNK_FILE_PREFIX*
do
    sort -n -t , -k 1,1 $file > $file.sorted &
done
wait

#echo "**********SORTED CHUNK FILES*********"
#echo $SORTED_CHUNK_FILES
#Merging chunks to $SORTED_FILE ...
sort  -mn $SORTED_CHUNK_FILES > $SORTED_FILE

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null

file is split and sort it will increse the speed of sorting

Hi! This answer could be improved by explaining what it is meant to do, rather than being only a code dump (also, if it has been benchmarked to be faster than GNU sort on some input, that would be interesting to know!). — dhag, Dec 14 '15 at 13:31

Can I parallelize sort?

6 Answers6

Linked