For example for bzip there is pbzip, a parallel version of bzip. Is there any such parallelization tools for sort to improve performance?
6 Answers
As of coreutils 8.6 (2010-10-15), GNU sort already sorts in parallel to make use of several processors where available. So, it cannot be further improved in that regard like pigz or pbzip2 improve gzip or bzip2.
If your sort is not parallel, you can try and install the GNU sort from the latest version of GNU coreutils.
With GNU sort, you can limit the number of threads with the --parallel option.
- 522,931
- 91
- 1,010
- 1,501
-
2sort --stable gives a 15% performance boost, at least in my test workload. – jrw32982 Apr 30 '15 at 23:07
The one thing that always helps me most with sort is giving it as much memory as possible, so as to reduce swapping, e.g.:
sort -S 20G
- 200
- 1
- 5
-
7Thanks, this is a trick I use lately, too - just let sort use half the RAM, if needed: `sort -S 50%` – miku Jun 22 '15 at 20:31
If your file is large enough, sorting will cause disk swapping, either because the allocated virtual memory is growing too big, or because the sort program itself is swapping chunks to disk and back. Older sort implementations are more likely to have this "sort via disk buffer" sort of behavior, since it was the only way to sort large files in the old days.
sort has a -m option that may help you here. It might be faster to split the file into chunks — say with split -l — sort them independently, then merge them back together.
Then again, it may be that this is exactly what "sort via disk buffer" does. The only way to find out if it helps is to benchmark it on your particular test load. The critical parameter will be the line count you give to split -l.
- 8,483
- 1
- 26
- 31
- 71,107
- 16
- 178
- 168
-
Thanks for your answer. I will conduct some benchmarks with `split` and `merge` and see if it helps. – miku Aug 29 '13 at 06:53
-
@miku: I don't see that `merge(1)` has applicability here. Use `sort -m`. – Warren Young Aug 29 '13 at 06:54
-
1
-
1If you split the file and sort the pieces you will still have to sort the entire thing when you put it back together right? How will that be faster? – terdon Aug 29 '13 at 11:12
-
2This is a variant on the [merge sort](http://en.m.wikipedia.org/wiki/Merge_sort) algorithm, one of the fastest sorting methods available. – Warren Young Aug 29 '13 at 11:34
export LC_COLLATE=C
export LANG=C
cat big_file | sort > /dev/null
Usally Linux sort does some nifty stuff to comply to Unicode equality rules... if you change the locale to C it switches to byte only...
For a 1.4GB file the difference on my machine is 20s vs. 400s (!!!)
- 452
- 5
- 13
-
-
1I think so... maybe `LC_COLLATE` is already enough. AFAIK `sort` uses `strcoll` for comparision and the manpage says the behavior depends on `LC_COLLATE` – kei1aeh5quahQu4U Jul 03 '15 at 17:25
I had a very significant gain using sort -n, which requires numeric values (float or integer) in all selected columns, without scientific notation.
Another possibility that might bring a great improvement in your process is to use the memory mapped folder /dev/shm to deal with intermediary files.
- 131
- 4
#! /bin/sh
#config MAX_LINES_PER_CHUNK based on file length
MAX_LINES_PER_CHUNK=1000
ORIGINAL_FILE=inputfile.txt
SORTED_FILE=outputfile.txt
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE
#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX
for file in $CHUNK_FILE_PREFIX*
do
sort -n -t , -k 1,1 $file > $file.sorted &
done
wait
#echo "**********SORTED CHUNK FILES*********"
#echo $SORTED_CHUNK_FILES
#Merging chunks to $SORTED_FILE ...
sort -mn $SORTED_CHUNK_FILES > $SORTED_FILE
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
file is split and sort it will increse the speed of sorting
- 111
- 3
-
2Hi! This answer could be improved by explaining what it is meant to do, rather than being only a code dump (also, if it has been benchmarked to be faster than GNU sort on some input, that would be interesting to know!). – dhag Dec 14 '15 at 13:31