In a common Linux distribution, do utilities like rm, mv, ls, grep, wc, etc. run in parallel on their arguments?
In other words, if I grep a huge file on a 32-threaded CPU, will it go faster than on dual-core CPU?
In a common Linux distribution, do utilities like rm, mv, ls, grep, wc, etc. run in parallel on their arguments?
In other words, if I grep a huge file on a 32-threaded CPU, will it go faster than on dual-core CPU?
You can get a first impression by checking whether the utility is linked with the pthread library. Any dynamically linked program that uses OS threads should use the pthread library.
ldd /bin/grep | grep -F libpthread.so
So for example on Ubuntu:
for x in $(dpkg -L coreutils grep findutils util-linux | grep /bin/); do if ldd $x | grep -q -F libpthread.so; then echo $x; fi; done
However, this produces a lot of false positives due to programs that are linked with a library that itself is linked with pthread. For example, /bin/mkdir on my system is linked with PCRE (I don't know why…) which itself is linked with pthread. But mkdir is not parallelized in any way.
In practice, checking whether the executable contains libpthread gives more reliable results. It could miss executables whose parallel behavior is entirely contained in a library, but basic utility typically aren't designed that way.
dpkg -L coreutils grep findutils util-linux | grep /bin/ | xargs grep pthread
Binary file /usr/bin/timeout matches
Binary file /usr/bin/sort matches
So the only tool that actually has a chance of being parallelized is sort. (timeout only links to libpthread because it links to librt.) GNU sort does work in parallel: the number of threads can be configured with the --parallel option, and by default it uses one thread per processor up to 8. (Using more processors gives less and less benefit as the number of processors increases, tapering off at a rate that depends on how parallelizable the task is.)
grep isn't parallelized at all. The PCRE library actually links to the pthread library only because it provides thread-safe functions that use locks and the lock manipulation functions are in the pthread library.
The typical simple approach to benefit from parallelization when processing a large amount of data is to split this data into pieces, and process the pieces in parallel. In the case of grep, keep file sizes manageable (for example, if they're log files, rotate them often enough) and call separate instances of grep on each file (for example with GNU Parallel). Note that grepping is usually IO-bound (it's only CPU-bound if you have a very complicated regex, or if you hit some Unicode corner cases of GNU grep where it has bad performance), so you're unlikely to get much benefit from having many threads.
Another way to find an answer is to use something like sysdig to examine the system calls executed by a process. For example, if you want to see if rm creates any threads (via the clone system call), you could do:
# sysdig proc.name=rm and evt.type=clone and evt.dir='<'
With that running, I did:
$ mkdir foo
$ cd foo
$ touch {1..9999}
$ rm *
And saw no clones -- no threading there. You could repeat this experiment for other tools, but I don't think you'll find that they're threaded.
Note that clone() is the underpinnings of fork() as well, so if a tool starts some other process (e.g., find ... -exec), you'd see that output. The flags will differ from the "create a new thread" use case:
# sysdig proc.name=find and evt.type=clone and evt.dir='<'
...
1068339 18:55:59.702318832 2 find (2960545) < clone res=0 exe=find args=/tmp/foo.-type.f.-exec.rm.{}.;. tid=2960545(find) pid=2960545(find) ptid=2960332(find) cwd= fdlimit=1024 pgft_maj=0 pgft_min=1 vm_size=9100 vm_rss=436 vm_swap=0 comm=find cgroups=cpuset=/.cpu=/user.slice.cpuacct=/user.slice.io=/user.slice.memory=/user.slic... flags=25165824(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID) uid=1026 gid=1026 vtid=2960545(find) vpid=2960545(find)
See xargs or gnu parallel, for how to run them in parallel.
However the parallelisable part will tends toward zero time, as more processes are added. This will leave the non-parallelisable part, that will not get faster. Therefore there is a limit to how fast a task can be by adding more processes. Very quickly you can get to a situation that adding processes makes very little difference.
Then there is communication overhead: adding processes makes it slower. If the benefit of adding a process is lower than the cost of adding it, then it can get slower.
If you are basically interested in the utilies you named, then it is much unlikely that there is a threaded version of the commands.
Even worse, it such a variant did exist, it would most likely be slower than their single threaded counterpart.
This is caused by the fact that the utilities you named all have massive filesystem interactions that (if done multi threaded) would harm kernel optimizations like read ahead.
A well implemented kernel e.g. detects a linear read in a file and causes a linear read such as done by grep to get the file content used by grep to be fetched in advance.
A mv operation is a rename operation insdie one or two directories and that requires a directory lock in the kernel. Another rename operation on these directories cannot happen at the same time unless that would be implemented in a non-atomic way.
The oldest free tar implementation (star) on the other side is parallelized since 30 years with respect to the two basic tasks: There are two processes and a piece of shared memory between both that allows one process to do the archive read/write and the other process to do the filesystem I/O simultaneously.
Your specific question related to grep could be answered by "basically yes" since the filesystem prefetch in the kernel will be faster with more than one CPU than it is with only one CPU. If the file you operate on is not huge and if this file is already inside the kernel cache,there are no prefetch advantages...
BTW: Modern shells have a builtin time feature that does not only show the times but also computes a percentage computed from the ratio of the sum of USER and SYS CPU time and wall clock time. If the related time output is more than 100%, you had a utility run that did take advantage from having more than one CPU. For non-threaded utilities, this however is typically just something like 105%.
Finally: parallelization also takes place at process level and a parallelized make could easily run 3x faster than a non-paralelized version.
If your platform allows you to switch off CPUs at runtime, I encourage you to switch off n-1 CPUs and compare the results with a multi CPU environment on the otherwise identical machine.